Case study · pilot in progress

LiteralSubs

A pipeline that turns anime into Japanese-learning material — but keeps the literal meaning of the Japanese instead of flattening it into smooth, localized English. It reads a show, understands every line, and ships ready-made subtitles and spaced-repetition decks. The pilot is Oregairu.

Python pipeline Claude API Sudachi NLP Anki / genanki FastAPI

The problem

Official subtitles are localized — they trade the structure of the Japanese for natural-sounding English. Great for watching, quietly terrible for learning: you read fluent English and never see how the language actually works.

JapaneseLocalized subLiteral (LiteralSubs)
“can’t eat hot food” “cat-tongue”
“stands out” “emit unique-brilliance”

The localized column is easier to read. The literal column is what teaches you — it preserves the idiom, the imagery, and the way Japanese encodes the idea. That gap is the whole product.

How it’s built

LiteralSubs is a deterministic pipeline with a single language-model step in the middle. Curated inputs go in; subtitles and decks come out; the learner meets them in tools they already use.

Sources
Raw Japanese subtitles .srt
Show manifest characters · terms · context
Reference data JPDB · JMdict · KANJIDIC · pitch
Episode video optional — audio & stills
Engine
LiteralSubs pipeline tokenize → translate → stratify → render → build
SudachiClaudegenankiffmpeg
Outputs & delivery
3 subtitle tracks pure · natural · gloss
Anki decks .apkg, one per JLPT tier
▶ Migaku player watch — dual JP + literal-EN subs
🗂 Anki review — spaced repetition

Deliberately, LiteralSubs ships only text and decks — never video. It owns the curation and the literal translation; Migaku and Anki already do the playback and the scheduling.

The AI pipeline

Ten stages take one episode from a raw subtitle file to finished decks. Most of it is fast, local, deterministic language processing. Exactly one stage calls a model — and it’s the one judgement no rule-set can fake: the translation itself.

  1. 0

    Ingest

    pysubs2

    Parse the raw Japanese .srt, pull speaker tags from (名前) markers, clean broadcast formatting into timed dialogue lines.

  2. 1

    Tokenize

    Sudachi

    Morphological analysis (Mode B) over a custom user-dictionary of the show’s character names and terms — surface, lemma, reading, part-of-speech.

  3. 2

    Filter & annotate

    rules

    Drop punctuation noise, mark non-cardable grammar (particles, copulas), tag register (gendered / emphatic / archaic), resolve character names.

  4. 3

    Enrich

    JPDB corpus

    Look up each word’s frequency rank in an anime-weighted corpus — how common is this word in the wild?

  5. 4

    Score

    aggregate

    Per word: how many times it appears in the show, where it first shows up, every line it lives in.

  6. 5

    Stratify

    frequency cutoffs

    Bucket vocabulary into five JLPT-labelled tiers by rank, keeping only words that recur (≥2× in the show).

  7. 6

    Translate

    Claude

    Claude writes the literal translation for each line — a readable sentence plus a word-by-word breakdown — under a cached house style-guide and the show’s manifest. Adaptive thinking on hard lines; resumable.

  8. 7

    Validate

    checks

    Automated house-style enforcement: idioms stay literal, honorifics preserved, every word covered, lines that lean too hard on inference get flagged.

  9. 8

    Render

    pysubs2

    Emit three subtitle tracks from one source — pure (Japanese order), natural (readable), and an optional gloss track of etymology notes.

  10. 9

    Build decks

    genanki · ffmpeg

    Pick the best example sentence per word, synthesize furigana + pitch accent + kanji breakdown, bake in the audio clip and a screenshot, de-duplicate across the season, export one .apkg per tier.

The translate step is calibrated for cost and consistency: the heavy context — a literal-translation style guide plus the show’s full character manifest — is sent once and prompt-cached, so every line after the first reuses it. The model can be swapped (Opus / Sonnet / Haiku) to trade quality against price.

What the model actually returns — one line of dialogue
n: 1
time: "12190ms-->17860ms"
jp: "青春をおう歌せし者達は 常に自己と周囲を欺き"
glossed: "Those-people who sang-the-praises-of youth always deceive self and surroundings,"
segments:
  - { jp: 青春,      en: "youth" }
  - { jp: を,        type: particle, pure: "-wo" }
  - { jp: おう歌せし, en: "sang-the-praises-of", notes: "archaic; literal calque kept" }
  - { jp: 者達,      en: "those-people", notes: "達 pluralizer preserved" }
  - { jp: は,        type: particle, pure: "-wa" }
  - { jp: 常に,      en: "always" }
  - { jp: 自己,      en: "self" }
  ...

What a subtitle looks like

Every line is understood at the word level, then re-expressed at three depths. Here’s one real line from the pilot — Yukinoshita’s cold open — with every layer the pipeline produces.

雪ノ下雪乃 · Yukinoshita Yukino 00:13.550 → 00:15.650
Japanese
Word by word
もうalready
無理してforcing-yourself,
来なくてnot-coming
いいis-fine
-wa · feminine
Pure

already forcing-yourself, not-coming is-fine-wa

Natural

(you) already forcing-yourself, not-coming is-fine

Pure keeps strict Japanese order with particles as romaji — a purist reference. Natural reorders into readable English but keeps the literal word choices — this is the track you actually watch with. Idioms never get smoothed away; only true set-phrases (ありがとう → “Thanks”) read naturally, with the literal sense tucked into the gloss track.

How it’s designed to teach

LiteralSubs is built on immersion (comprehensible input / AJATT), not grammar drills. The principle: deck + subtitles + audio together produce comprehension — no single piece tries to do it alone.

A

Prime with the deck

A few days before an episode, study its new vocabulary in Anki. The deck is an on-ramp, not the whole teacher — it seeds high-value words so the episode lands.

B

Watch with dual subs

Watch in Migaku with Japanese and the literal English side by side. You see the syntax and a faithful gloss at once, so structure starts to feel native.

C

Mine & review

The subtitle files are Migaku-compatible, so power users can mine extra cards from the same episode — extending the curated deck instead of rebuilding it.

Vocabulary is split into five tiers so a learner studies only what’s reachable from where they are. Lower tiers teach the word itself; higher tiers teach it in the show’s own voice:

TierFrequencyCards / episodeCard type
sub-N5 top ~300 18 word-target
N5 top ~800 15 word-target
N4 top ~1,500 12 word-target
N3 top ~4,000 9 sentence-in-context
N2+ top ~10,000 6 sentence-in-context

Cards are chosen for teachability: the pipeline prefers the “1T” sentence for each word — the one where it’s the only unknown — so context carries you to the meaning.

The card

Each card pairs the word with the moment it was said. The back carries both translations on purpose — the literal one teaches structure, the natural one confirms meaning.

Front
無理むり
Back
もう無理むりしてなくていいわ ▶ 0:02 · 🖼 still
literal already forcing-yourself, not-coming is-fine-wa
natural (you) already forcing-yourself, not-coming is-fine

Reading むり · auto pitch-accent
Meaning unreasonable; impossible; overdoing it
Kanji 無 nothing理 reason
speaker · 雪ノ下雪乃tier · N5 · 1T

Furigana, pitch accent and the kanji breakdown are all synthesized by the pipeline (subtitle files don’t carry them). Audio and the screenshot are cut straight from the episode with ffmpeg, so review feels like the show.

The pilot — Oregairu

Yahari Ore no Seishun Love Comedy wa Machigatteiru. Chosen because its narrator, Hachiman, thinks in dense, idiom-heavy, literary Japanese — exactly the writing that localized subs flatten most. If the literal approach proves itself anywhere, it’s here.

“Curator-built” means the show’s manifest — every character, alias, and show-specific term, plus tone and premise — is curated by hand, and the model translates against that context under a fixed literal style-guide. One consistent voice, held to one standard; no crowdsourced drift.

Phase 0–3 Pipeline proven end-to-end on episode 1; card schema locked; all deck types generating.
Phase 4 Build dashboard (FastAPI) — upload an episode, edit the manifest, build a season, download the decks.

Why I built it

I’m learning Japanese the immersion way, and the same frustration kept recurring: the subtitles I was learning from were lying to me a little. Not wrong — just smoothed. The Japanese was doing something specific and the English quietly erased it. I wanted to see the real shape of the language while I watched.

So I built the thing I wanted. The interesting part wasn’t the model — it was everything around it: defining what “literal but readable” even means precisely enough to enforce, turning that into a style guide a model could follow, and wrapping a single language-model call in a deterministic pipeline that tokenizes, ranks, validates, and packages, so the output is consistent instead of merely impressive once.

It also taught me where to not use cleverness. An early version tried to reorder English with a hand-written heuristic; it made everything worse. Cutting it — letting the model write natural order directly and keeping the literalness in word choice — was the version that worked. Knowing which part of a system to simplify is most of the job.

For me this is the clearest example of how I work: take a fuzzy intuition, make it concrete and measurable, and ship it end to end — pipeline, model, product, and the business case around it.