LiteralSubs
A pipeline that turns anime into Japanese-learning material — but keeps the literal meaning of the Japanese instead of flattening it into smooth, localized English. It reads a show, understands every line, and ships ready-made subtitles and spaced-repetition decks. The pilot is Oregairu.
The problem
Official subtitles are localized — they trade the structure of the Japanese for natural-sounding English. Great for watching, quietly terrible for learning: you read fluent English and never see how the language actually works.
The localized column is easier to read. The literal column is what teaches you — it preserves the idiom, the imagery, and the way Japanese encodes the idea. That gap is the whole product.
How it’s built
LiteralSubs is a deterministic pipeline with a single language-model step in the middle. Curated inputs go in; subtitles and decks come out; the learner meets them in tools they already use.
Deliberately, LiteralSubs ships only text and decks — never video. It owns the curation and the literal translation; Migaku and Anki already do the playback and the scheduling.
The AI pipeline
Ten stages take one episode from a raw subtitle file to finished decks. Most of it is fast, local, deterministic language processing. Exactly one stage calls a model — and it’s the one judgement no rule-set can fake: the translation itself.
- 0
Ingest
pysubs2Parse the raw Japanese .srt, pull speaker tags from (名前) markers, clean broadcast formatting into timed dialogue lines.
- 1
Tokenize
SudachiMorphological analysis (Mode B) over a custom user-dictionary of the show’s character names and terms — surface, lemma, reading, part-of-speech.
- 2
Filter & annotate
rulesDrop punctuation noise, mark non-cardable grammar (particles, copulas), tag register (gendered / emphatic / archaic), resolve character names.
- 3
Enrich
JPDB corpusLook up each word’s frequency rank in an anime-weighted corpus — how common is this word in the wild?
- 4
Score
aggregatePer word: how many times it appears in the show, where it first shows up, every line it lives in.
- 5
Stratify
frequency cutoffsBucket vocabulary into five JLPT-labelled tiers by rank, keeping only words that recur (≥2× in the show).
- 6
Translate
ClaudeClaude writes the literal translation for each line — a readable sentence plus a word-by-word breakdown — under a cached house style-guide and the show’s manifest. Adaptive thinking on hard lines; resumable.
- 7
Validate
checksAutomated house-style enforcement: idioms stay literal, honorifics preserved, every word covered, lines that lean too hard on inference get flagged.
- 8
Render
pysubs2Emit three subtitle tracks from one source — pure (Japanese order), natural (readable), and an optional gloss track of etymology notes.
- 9
Build decks
genanki · ffmpegPick the best example sentence per word, synthesize furigana + pitch accent + kanji breakdown, bake in the audio clip and a screenshot, de-duplicate across the season, export one .apkg per tier.
The translate step is calibrated for cost and consistency: the heavy context — a literal-translation style guide plus the show’s full character manifest — is sent once and prompt-cached, so every line after the first reuses it. The model can be swapped (Opus / Sonnet / Haiku) to trade quality against price.
What the model actually returns — one line of dialogue
n: 1
time: "12190ms-->17860ms"
jp: "青春をおう歌せし者達は 常に自己と周囲を欺き"
glossed: "Those-people who sang-the-praises-of youth always deceive self and surroundings,"
segments:
- { jp: 青春, en: "youth" }
- { jp: を, type: particle, pure: "-wo" }
- { jp: おう歌せし, en: "sang-the-praises-of", notes: "archaic; literal calque kept" }
- { jp: 者達, en: "those-people", notes: "達 pluralizer preserved" }
- { jp: は, type: particle, pure: "-wa" }
- { jp: 常に, en: "always" }
- { jp: 自己, en: "self" }
... What a subtitle looks like
Every line is understood at the word level, then re-expressed at three depths. Here’s one real line from the pilot — Yukinoshita’s cold open — with every layer the pipeline produces.
もう無理して来なくていいわ
already forcing-yourself, not-coming is-fine-wa
(you) already forcing-yourself, not-coming is-fine
Pure keeps strict Japanese order with particles as romaji — a purist reference. Natural reorders into readable English but keeps the literal word choices — this is the track you actually watch with. Idioms never get smoothed away; only true set-phrases (ありがとう → “Thanks”) read naturally, with the literal sense tucked into the gloss track.
How it’s designed to teach
LiteralSubs is built on immersion (comprehensible input / AJATT), not grammar drills. The principle: deck + subtitles + audio together produce comprehension — no single piece tries to do it alone.
Prime with the deck
A few days before an episode, study its new vocabulary in Anki. The deck is an on-ramp, not the whole teacher — it seeds high-value words so the episode lands.
Watch with dual subs
Watch in Migaku with Japanese and the literal English side by side. You see the syntax and a faithful gloss at once, so structure starts to feel native.
Mine & review
The subtitle files are Migaku-compatible, so power users can mine extra cards from the same episode — extending the curated deck instead of rebuilding it.
Vocabulary is split into five tiers so a learner studies only what’s reachable from where they are. Lower tiers teach the word itself; higher tiers teach it in the show’s own voice:
Cards are chosen for teachability: the pipeline prefers the “1T” sentence for each word — the one where it’s the only unknown — so context carries you to the meaning.
The card
Each card pairs the word with the moment it was said. The back carries both translations on purpose — the literal one teaches structure, the natural one confirms meaning.
Furigana, pitch accent and the kanji breakdown are all synthesized by the pipeline (subtitle files don’t carry them). Audio and the screenshot are cut straight from the episode with ffmpeg, so review feels like the show.
The pilot — Oregairu
Yahari Ore no Seishun Love Comedy wa Machigatteiru. Chosen because its narrator, Hachiman, thinks in dense, idiom-heavy, literary Japanese — exactly the writing that localized subs flatten most. If the literal approach proves itself anywhere, it’s here.
“Curator-built” means the show’s manifest — every character, alias, and show-specific term, plus tone and premise — is curated by hand, and the model translates against that context under a fixed literal style-guide. One consistent voice, held to one standard; no crowdsourced drift.
Why I built it
I’m learning Japanese the immersion way, and the same frustration kept recurring: the subtitles I was learning from were lying to me a little. Not wrong — just smoothed. The Japanese was doing something specific and the English quietly erased it. I wanted to see the real shape of the language while I watched.
So I built the thing I wanted. The interesting part wasn’t the model — it was everything around it: defining what “literal but readable” even means precisely enough to enforce, turning that into a style guide a model could follow, and wrapping a single language-model call in a deterministic pipeline that tokenizes, ranks, validates, and packages, so the output is consistent instead of merely impressive once.
It also taught me where to not use cleverness. An early version tried to reorder English with a hand-written heuristic; it made everything worse. Cutting it — letting the model write natural order directly and keeping the literalness in word choice — was the version that worked. Knowing which part of a system to simplify is most of the job.
For me this is the clearest example of how I work: take a fuzzy intuition, make it concrete and measurable, and ship it end to end — pipeline, model, product, and the business case around it.