description: "" draft: true cascade: type: docs date: 2026-03-28

Author: Mykola Dementiev · Date: 2026-03-28


TL;DR

We had ~106 episodes of a TV series with no Ukrainian subtitles. We built a Python script that:

  1. Searches for ready-made Ukrainian subtitles on OpenSubtitles.com (free, 100 downloads/day)
  2. If not found — translates through a configurable chain of free translators: Google → Argos (offline) → MyMemory → LibreTranslate
  3. Uses Claude AI only as a last resort — with token budget control
  4. Keeps a full run log in Log.md and Log_console.txt
  5. Automatically synchronizes subtitle timings between different video releases

Result: 25 episodes translated in the first day, the rest processed incrementally.


The Problem

We have a complete collection of “The Fosters” — 106 episodes in .mkv format. Subtitles are either missing entirely or only available as English .srt files. Watching with English subtitles is fine, but we wanted our native language.

Options we ruled out:

  • Manual translation — 106 episodes × ~45 minutes = not realistic
  • Paid service — expensive, quality for Ukrainian is inconsistent
  • Plain claude — burns through tokens fast, monthly limit runs out quickly

We needed a solution that:

  • Runs autonomously (fire and forget)
  • Minimizes spending on paid resources
  • Verifies translation quality
  • Can be interrupted and resumed

Solution Architecture

Overall Pipeline

@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13
skinparam ArrowColor #444

title Subtitle Translation Pipeline

start

:Find episode without subtitles;

:Search on **OpenSubtitles.com**\n(uk, 100 requests/day);

if (Found?) then (yes)
  :Download subtitles;
  :Verify language\n(uk-exclusive chars: ї, є, ґ);
  if (Ukrainian?) then (yes)
    :Sync timings\n(2-point linear calibration);
    :✅ Done;
    stop
  else (no — it's Russian)
    :Discard file;
  endif
else (no)
endif

:Translate text\n(Chain of Responsibility);

:Check quality\n(Cyrillic ratio ≥ 70%);

if (Quality OK?) then (yes)
  :✅ Save .uk.srt;
else (no, < 60%)
  :❌ Delete file;
  note right: Will retry\non next run
endif

stop
@enduml

Pipeline Diagram


Step 1 — OpenSubtitles.com

The first step is to look for already-translated Ukrainian subtitles. OpenSubtitles has a huge database and often has exactly what you need.

API: REST, free tier = 100 downloads/day.

# Search by title and season/episode number
GET /api/v1/subtitles?query=The+Fosters&season_number=1&episode_number=3&languages=uk

Important catch: subtitles tagged as uk (Ukrainian) are often actually Russian. We detect the real language using Ukrainian-exclusive characters:

def is_ukrainian(text):
    uk_exclusive = len(re.findall(r'[їЇєЄґҐ]', text))
    uk_i         = len(re.findall(r'[іІ]', text))
    ru_exclusive = len(re.findall(r'[ыЫэЭъЪ]', text))
    total_cyr    = len(re.findall(r'[а-яА-ЯіІїЇєЄґҐёЁыЫэЭъЪ]', text))

    if total_cyr < 20:                      return True   # too little text — don't reject
    if ru_exclusive > total_cyr * 0.02:     return False  # has "ы/э/ъ" — this is RU
    return (uk_exclusive + uk_i) > total_cyr * 0.03       # has "ї/є/і" — this is UK

Step 2 — Timing Synchronization

Subtitles from OpenSubtitles may be cut for a different video release (different fps, different start offset). We synchronize using 2-point linear calibration:

@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13

title Subtitle Timing Synchronization (2-point calibration)

participant "EN file\n(our timing)" as en #EEF3FF
participant "UK file\n(from OpenSubtitles)" as uk #FFF3EE
participant "compute_sync()" as sync #EEFFEE
participant "apply_sync()" as apply #EEFFEE
participant "verify_sync()" as verify #EEFFEE

en -> sync : t0, t1 (first/last timestamp)
uk -> sync : t0, t1 (first/last timestamp)

sync -> sync : scale = (en_t1 - en_t0) / (uk_t1 - uk_t0)
sync -> sync : offset = en_t0 - scale × uk_t0

sync --> apply : (offset_ms, scale)
apply -> apply : new_t = scale × t + offset_ms
apply --> verify : synchronized subtitles

verify -> verify : |en_t - uk_t| < 2000ms\nfor first 5 entries

alt check passed
  verify --> en : ✅ accepted
else drift > 2 sec
  verify --> en : ❌ discard file
end
@enduml

Timing Sync Diagram

The math is simple: two equations, two unknowns (scale and offset). Handles both start offset and fps difference (e.g. 23.976 vs 25) simultaneously.

def compute_sync(en_blocks, uk_blocks):
    en_t0 = parse_time(en_blocks[0][1].split('-->')[0])
    en_t1 = parse_time(en_blocks[-1][1].split('-->')[0])
    uk_t0 = parse_time(uk_blocks[0][1].split('-->')[0])
    uk_t1 = parse_time(uk_blocks[-1][1].split('-->')[0])

    scale  = (en_t1 - en_t0) / (uk_t1 - uk_t0)
    offset = en_t0 - scale * uk_t0
    return offset, scale

Step 3 — Configurable Chain of Responsibility for Translation

If no subtitles are found on OpenSubtitles, we translate ourselves. We used the Chain of Responsibility pattern — and made the entire chain configurable from the top of the file:

# ─── Ланцюжок перекладу (Chain of Responsibility) ─────────────────
# Format: "Service (fallbacks:timeout_inc) -> ..."
#   fallbacks    — retries after the first attempt (0 = one attempt only)
#   timeout_inc  — seconds added to timeout per retry
# First service = primary (Phase 1 batch). Rest = CoR fallback for failures.
# Unavailable service (not installed) → skipped automatically.
TRANSLATOR_CHAIN = "Google (2:30) -> Argos (1) -> MyMemory (1) -> LibreTranslate (1) -> Claude"

# ─── Local LibreTranslate server ──────────────────────────────────
# POST /translate  {"q": text, "source": "en", "target": "uk", "format": "text"}
# Leave "" to skip LibreTranslate entirely.
LIBRETRANSLATE_URL = "http://192.168.1.147:6455"

At startup, the script analyses the chain and prints what will actually run:

════════════════════════════════════════════════════════════════
 🔗 TRANSLATOR_CHAIN: Google (2:30) -> Argos (1) -> MyMemory (1) -> LibreTranslate (1) -> Claude

    Фаза 1 (основний):    Google
    Фаза 2 (CoR fallback):
    ├─ Google         3 спроби, +30с/retry    ✓
    ├─ Argos          2 спроби                ✓  офлайн модель готова
    ├─ MyMemory       2 спроби                ✓
    ├─ LibreTranslate 2 спроби                ✓  POST http://192.168.1.147:6455/translate
    └─ Claude         1 спроба                ✓
════════════════════════════════════════════════════════════════

If a service is unavailable (e.g. Argos not installed) — it’s automatically removed from the chain and the next service takes over. No crashes, no config changes needed.

How the chain config is parsed

def _parse_chain_config(chain_str: str) -> list:
    steps = []
    for part in re.split(r'\s*->\s*', chain_str.strip()):
        # Allows spaces around colon: (1:30) or (1 : 30) or (1: 30)
        m = re.match(r'^(\w+)\s*(?:\(\s*(\d+)\s*(?::\s*(\d+)\s*)?\))?$', part.strip())
        if not m:
            continue
        steps.append({
            'name':        m.group(1),
            'fallbacks':   int(m.group(2)) if m.group(2) else 0,
            'timeout_inc': int(m.group(3)) if m.group(3) else 0,
        })
    return steps

Two-phase execution

Phase 1 — the first service in the chain processes all entries in batches, showing progress:

  [Google 1/12] 80 entries... ✓ 77/80 (3 → retry)
  [Google 2/12] 80 entries... ✓ 80/80
  ...

Phase 2 — failed entries go through the full CoR chain:

@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13
skinparam objectBorderColor #666
skinparam objectBackgroundColor #EEF3FF

title Chain of Responsibility — Configurable Fallback Translators

object "🌐 **Google Translate**\nonline · free\nPhase 1 + 2 retries (+30s each)" as google {
  fallbacks = 2
  timeout_inc = 30s
}
object "🦜 **Argos Translate**\noffline · unlimited" as argos {
  fallbacks = 1
}
object "💬 **MyMemory**\nonline · free" as mymemory {
  fallbacks = 1
}
object "🖥️ **LibreTranslate**\nlocal server · POST /translate" as libre {
  fallbacks = 1
}
object "🤖 **Claude AI**\ntoken budget control" as claude {
  fallbacks = 0
}

google --> argos : failed entries
argos --> mymemory : failed entries
mymemory --> libre : failed entries
libre --> claude : last resort
@enduml

Chain of Responsibility Diagram

TranslatorStep — with timeout increment support

class TranslatorStep:
    def __init__(self, name, fn, fallbacks=0, timeout_inc=0, timeout_base=30):
        self.name         = name
        self.fn           = fn
        self.fallbacks    = fallbacks    # retries after first attempt
        self.timeout_inc  = timeout_inc  # seconds added to timeout per retry
        self.timeout_base = timeout_base
        self.next: 'TranslatorStep | None' = None

    def execute(self, entries):
        remaining  = entries
        total_att  = self.fallbacks + 1
        for attempt in range(total_att):
            if not remaining:
                return
            timeout = self.timeout_base + attempt * self.timeout_inc
            label   = self.name if attempt == 0 else f"{self.name}{attempt}"
            print(f"  [{label}] {len(remaining)} entries...", end='', flush=True)
            result    = self.fn(remaining, timeout=timeout)
            remaining = [
                (idx, clean) for idx, clean in remaining
                if not (result.get(idx) and cyrillic_ratio(result[idx]) >= MIN_QUALITY)
            ]
        if remaining and self.next:
            self.next.execute(remaining)

Google (2:30) means: attempt 1 with 30s timeout → attempt 2 with 60s → attempt 3 with 90s. So stubborn lines get more time on each retry instead of failing the same way.

Building the chain dynamically

# Filter out unavailable services, build linked list in config order
resolved = [s for s in _parse_chain_config(TRANSLATOR_CHAIN)
            if _check_service(s['name'], has_argos)[0]]

steps = [
    TranslatorStep(name=s['name'], fn=_FN_MAP[s['name']],
                   fallbacks=s['fallbacks'], timeout_inc=s['timeout_inc'])
    for s in resolved
]
for i in range(len(steps) - 1):
    steps[i].set_next(steps[i + 1])

# Phase 1: primary batch pass
primary_fn(all_entries, timeout=30)

# Phase 2: CoR for failures
steps[0].execute(failed_entries)

Why Google, not Argos, as primary?

We initially used Argos Translate as the primary translator (offline, unlimited). It worked for about 6 episodes before we noticed the quality was unacceptable for TV dialogue:

Original Argos Google
“I’m sorry” “Я шкода” “Вибачте”
“Previously on The Fosters” “Попередньо на піни” “Раніше у Фостерів”

Argos translates word-by-word without context. Google handles idioms and dialogue naturally. The fix: Google becomes primary, Argos stays in the CoR chain as an offline fallback for when Google is unavailable.

LibreTranslate local server

Public Lingva Translate instances are mostly dead (403 errors). Instead, we run LibreTranslate locally:

docker run -p 6455:5000 libretranslate/libretranslate

The script uses the LibreTranslate REST API:

POST http://192.168.1.147:6455/translate
{"q": "text", "source": "en", "target": "uk", "format": "text", "api_key": ""}
# → {"translatedText": "переклад"}

If LIBRETRANSLATE_URL = "" — the LibreTranslate step is automatically skipped.


Step 4 — Claude Token Budget Control

Claude is used only for entries that failed every previous step. To avoid burning the entire monthly limit in a single run, we read the current token usage:

# ~/.claude/fetch-claude-usage.swift returns "84|2026-03-28T15:00:00Z"
# where 84 = % USED tokens, so free = 100 - 84 = 16%
def get_free_tokens_pct() -> float | None:
    ...

def claude_allowed(min_pct: float) -> bool:
    if _token_pct is None:
        return True   # unknown → allow optimistically
    return _token_pct >= min_pct

Run with a 30% threshold:

python3 translate_subtitles.py 11 --min-tokens 30

File Structure

The forest - srt - ua/
├── translate_subtitles.py      # main script
├── subtitles_source/           # original .srt files
│   ├── Season 1/
│   │   ├── S01E01.srt
│   │   ├── S01E01.en.srt      # copy used for translation
│   │   └── ...
│   └── Season 2/
├── subtitles_output/           # results
│   ├── Season 1/
│   │   ├── S01E01.uk.srt      # finished translation
│   │   └── ...
├── Log.md                      # structured run log (markdown table)
└── Log_console.txt             # full console output

Run Log

Log.md is a table updated after every file (not just at the end of the run). Per-translator phrase counts are tracked separately:

| Date & Time         | Run Result                                                              |
|---------------------|-------------------------------------------------------------------------|
| 2026-03-28 21:15:33 | ✓ 4/4 | OS×2 | Google×340 | Claude×12 | OS:86/100 | tokens: 63%→51% |
| 2026-03-28 19:02:11 | ✓ 11/11 | Google×8102 | Argos×44 | OS:98/100 | tokens: 84%→71% |

Setup & Usage

# Dependencies
pip3 install deep-translator argostranslate --break-system-packages

# Download Argos offline model (optional — used as fallback)
argospm install translate-en_uk

# Local LibreTranslate (optional — used as fallback)
docker run -p 6455:5000 libretranslate/libretranslate

# Run
python3 translate_subtitles.py              # default: 30 files, 30% token threshold
python3 translate_subtitles.py 10           # process 10 files per run
python3 translate_subtitles.py --min-tokens 50    # stricter token limit
python3 translate_subtitles.py --debug-tokens     # diagnose token source

Ctrl+C interrupts cleanly — the log is saved and no files are corrupted.


Results

Metric Value
Total episodes 106
Translated in 1 day 25
Primary translator Google Translate (free, best quality)
Average file size ~70 KB
Average coverage 97–100%
Claude usage ~3% of phrases (last resort only)

Key Takeaways

  • OpenSubtitles first — if subtitles already exist, why translate?
  • Configurable chain — one line at the top of the file controls the entire pipeline; add, remove, or reorder translators instantly
  • Google as primary — better quality than offline models for TV dialogue; Argos stays as offline fallback
  • Timeout escalationGoogle (2:30) means each retry gets 30 more seconds, giving stubborn lines a fair chance
  • ї/є/ґ language check — simple but reliable detector to filter Russian-tagged-as-Ukrainian subtitles
  • Linear timing sync — solves fps drift without any external libraries
  • Token budget control — Claude as a precision tool, not a first resort

The script is available in the repository and continues to evolve.

Last updated on