description: "" draft: true cascade: type: docs date: 2026-03-28
Author: Mykola Dementiev · Date: 2026-03-28
TL;DR
We had ~106 episodes of a TV series with no Ukrainian subtitles. We built a Python script that:
- Searches for ready-made Ukrainian subtitles on OpenSubtitles.com (free, 100 downloads/day)
- If not found — translates through a configurable chain of free translators: Google → Argos (offline) → MyMemory → LibreTranslate
- Uses Claude AI only as a last resort — with token budget control
- Keeps a full run log in
Log.mdandLog_console.txt - Automatically synchronizes subtitle timings between different video releases
Result: 25 episodes translated in the first day, the rest processed incrementally.
The Problem
We have a complete collection of “The Fosters” — 106 episodes in .mkv format. Subtitles are either missing entirely or only available as English .srt files. Watching with English subtitles is fine, but we wanted our native language.
Options we ruled out:
- Manual translation — 106 episodes × ~45 minutes = not realistic
- Paid service — expensive, quality for Ukrainian is inconsistent
- Plain
claude— burns through tokens fast, monthly limit runs out quickly
We needed a solution that:
- Runs autonomously (fire and forget)
- Minimizes spending on paid resources
- Verifies translation quality
- Can be interrupted and resumed
Solution Architecture
Overall Pipeline
@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13
skinparam ArrowColor #444
title Subtitle Translation Pipeline
start
:Find episode without subtitles;
:Search on **OpenSubtitles.com**\n(uk, 100 requests/day);
if (Found?) then (yes)
:Download subtitles;
:Verify language\n(uk-exclusive chars: ї, є, ґ);
if (Ukrainian?) then (yes)
:Sync timings\n(2-point linear calibration);
:✅ Done;
stop
else (no — it's Russian)
:Discard file;
endif
else (no)
endif
:Translate text\n(Chain of Responsibility);
:Check quality\n(Cyrillic ratio ≥ 70%);
if (Quality OK?) then (yes)
:✅ Save .uk.srt;
else (no, < 60%)
:❌ Delete file;
note right: Will retry\non next run
endif
stop
@endumlStep 1 — OpenSubtitles.com
The first step is to look for already-translated Ukrainian subtitles. OpenSubtitles has a huge database and often has exactly what you need.
API: REST, free tier = 100 downloads/day.
# Search by title and season/episode number
GET /api/v1/subtitles?query=The+Fosters&season_number=1&episode_number=3&languages=ukImportant catch: subtitles tagged as uk (Ukrainian) are often actually Russian. We detect the real language using Ukrainian-exclusive characters:
def is_ukrainian(text):
uk_exclusive = len(re.findall(r'[їЇєЄґҐ]', text))
uk_i = len(re.findall(r'[іІ]', text))
ru_exclusive = len(re.findall(r'[ыЫэЭъЪ]', text))
total_cyr = len(re.findall(r'[а-яА-ЯіІїЇєЄґҐёЁыЫэЭъЪ]', text))
if total_cyr < 20: return True # too little text — don't reject
if ru_exclusive > total_cyr * 0.02: return False # has "ы/э/ъ" — this is RU
return (uk_exclusive + uk_i) > total_cyr * 0.03 # has "ї/є/і" — this is UKStep 2 — Timing Synchronization
Subtitles from OpenSubtitles may be cut for a different video release (different fps, different start offset). We synchronize using 2-point linear calibration:
@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13
title Subtitle Timing Synchronization (2-point calibration)
participant "EN file\n(our timing)" as en #EEF3FF
participant "UK file\n(from OpenSubtitles)" as uk #FFF3EE
participant "compute_sync()" as sync #EEFFEE
participant "apply_sync()" as apply #EEFFEE
participant "verify_sync()" as verify #EEFFEE
en -> sync : t0, t1 (first/last timestamp)
uk -> sync : t0, t1 (first/last timestamp)
sync -> sync : scale = (en_t1 - en_t0) / (uk_t1 - uk_t0)
sync -> sync : offset = en_t0 - scale × uk_t0
sync --> apply : (offset_ms, scale)
apply -> apply : new_t = scale × t + offset_ms
apply --> verify : synchronized subtitles
verify -> verify : |en_t - uk_t| < 2000ms\nfor first 5 entries
alt check passed
verify --> en : ✅ accepted
else drift > 2 sec
verify --> en : ❌ discard file
end
@endumlThe math is simple: two equations, two unknowns (scale and offset). Handles both start offset and fps difference (e.g. 23.976 vs 25) simultaneously.
def compute_sync(en_blocks, uk_blocks):
en_t0 = parse_time(en_blocks[0][1].split('-->')[0])
en_t1 = parse_time(en_blocks[-1][1].split('-->')[0])
uk_t0 = parse_time(uk_blocks[0][1].split('-->')[0])
uk_t1 = parse_time(uk_blocks[-1][1].split('-->')[0])
scale = (en_t1 - en_t0) / (uk_t1 - uk_t0)
offset = en_t0 - scale * uk_t0
return offset, scaleStep 3 — Configurable Chain of Responsibility for Translation
If no subtitles are found on OpenSubtitles, we translate ourselves. We used the Chain of Responsibility pattern — and made the entire chain configurable from the top of the file:
# ─── Ланцюжок перекладу (Chain of Responsibility) ─────────────────
# Format: "Service (fallbacks:timeout_inc) -> ..."
# fallbacks — retries after the first attempt (0 = one attempt only)
# timeout_inc — seconds added to timeout per retry
# First service = primary (Phase 1 batch). Rest = CoR fallback for failures.
# Unavailable service (not installed) → skipped automatically.
TRANSLATOR_CHAIN = "Google (2:30) -> Argos (1) -> MyMemory (1) -> LibreTranslate (1) -> Claude"
# ─── Local LibreTranslate server ──────────────────────────────────
# POST /translate {"q": text, "source": "en", "target": "uk", "format": "text"}
# Leave "" to skip LibreTranslate entirely.
LIBRETRANSLATE_URL = "http://192.168.1.147:6455"At startup, the script analyses the chain and prints what will actually run:
════════════════════════════════════════════════════════════════
🔗 TRANSLATOR_CHAIN: Google (2:30) -> Argos (1) -> MyMemory (1) -> LibreTranslate (1) -> Claude
Фаза 1 (основний): Google
Фаза 2 (CoR fallback):
├─ Google 3 спроби, +30с/retry ✓
├─ Argos 2 спроби ✓ офлайн модель готова
├─ MyMemory 2 спроби ✓
├─ LibreTranslate 2 спроби ✓ POST http://192.168.1.147:6455/translate
└─ Claude 1 спроба ✓
════════════════════════════════════════════════════════════════If a service is unavailable (e.g. Argos not installed) — it’s automatically removed from the chain and the next service takes over. No crashes, no config changes needed.
How the chain config is parsed
def _parse_chain_config(chain_str: str) -> list:
steps = []
for part in re.split(r'\s*->\s*', chain_str.strip()):
# Allows spaces around colon: (1:30) or (1 : 30) or (1: 30)
m = re.match(r'^(\w+)\s*(?:\(\s*(\d+)\s*(?::\s*(\d+)\s*)?\))?$', part.strip())
if not m:
continue
steps.append({
'name': m.group(1),
'fallbacks': int(m.group(2)) if m.group(2) else 0,
'timeout_inc': int(m.group(3)) if m.group(3) else 0,
})
return stepsTwo-phase execution
Phase 1 — the first service in the chain processes all entries in batches, showing progress:
[Google 1/12] 80 entries... ✓ 77/80 (3 → retry)
[Google 2/12] 80 entries... ✓ 80/80
...Phase 2 — failed entries go through the full CoR chain:
@startuml
!theme plain
skinparam backgroundColor #FAFAFA
skinparam defaultFontSize 13
skinparam objectBorderColor #666
skinparam objectBackgroundColor #EEF3FF
title Chain of Responsibility — Configurable Fallback Translators
object "🌐 **Google Translate**\nonline · free\nPhase 1 + 2 retries (+30s each)" as google {
fallbacks = 2
timeout_inc = 30s
}
object "🦜 **Argos Translate**\noffline · unlimited" as argos {
fallbacks = 1
}
object "💬 **MyMemory**\nonline · free" as mymemory {
fallbacks = 1
}
object "🖥️ **LibreTranslate**\nlocal server · POST /translate" as libre {
fallbacks = 1
}
object "🤖 **Claude AI**\ntoken budget control" as claude {
fallbacks = 0
}
google --> argos : failed entries
argos --> mymemory : failed entries
mymemory --> libre : failed entries
libre --> claude : last resort
@endumlTranslatorStep — with timeout increment support
class TranslatorStep:
def __init__(self, name, fn, fallbacks=0, timeout_inc=0, timeout_base=30):
self.name = name
self.fn = fn
self.fallbacks = fallbacks # retries after first attempt
self.timeout_inc = timeout_inc # seconds added to timeout per retry
self.timeout_base = timeout_base
self.next: 'TranslatorStep | None' = None
def execute(self, entries):
remaining = entries
total_att = self.fallbacks + 1
for attempt in range(total_att):
if not remaining:
return
timeout = self.timeout_base + attempt * self.timeout_inc
label = self.name if attempt == 0 else f"{self.name}↺{attempt}"
print(f" [{label}] {len(remaining)} entries...", end='', flush=True)
result = self.fn(remaining, timeout=timeout)
remaining = [
(idx, clean) for idx, clean in remaining
if not (result.get(idx) and cyrillic_ratio(result[idx]) >= MIN_QUALITY)
]
if remaining and self.next:
self.next.execute(remaining)Google (2:30) means: attempt 1 with 30s timeout → attempt 2 with 60s → attempt 3 with 90s. So stubborn lines get more time on each retry instead of failing the same way.
Building the chain dynamically
# Filter out unavailable services, build linked list in config order
resolved = [s for s in _parse_chain_config(TRANSLATOR_CHAIN)
if _check_service(s['name'], has_argos)[0]]
steps = [
TranslatorStep(name=s['name'], fn=_FN_MAP[s['name']],
fallbacks=s['fallbacks'], timeout_inc=s['timeout_inc'])
for s in resolved
]
for i in range(len(steps) - 1):
steps[i].set_next(steps[i + 1])
# Phase 1: primary batch pass
primary_fn(all_entries, timeout=30)
# Phase 2: CoR for failures
steps[0].execute(failed_entries)Why Google, not Argos, as primary?
We initially used Argos Translate as the primary translator (offline, unlimited). It worked for about 6 episodes before we noticed the quality was unacceptable for TV dialogue:
| Original | Argos | |
|---|---|---|
| “I’m sorry” | “Я шкода” | “Вибачте” |
| “Previously on The Fosters” | “Попередньо на піни” | “Раніше у Фостерів” |
Argos translates word-by-word without context. Google handles idioms and dialogue naturally. The fix: Google becomes primary, Argos stays in the CoR chain as an offline fallback for when Google is unavailable.
LibreTranslate local server
Public Lingva Translate instances are mostly dead (403 errors). Instead, we run LibreTranslate locally:
docker run -p 6455:5000 libretranslate/libretranslateThe script uses the LibreTranslate REST API:
POST http://192.168.1.147:6455/translate
{"q": "text", "source": "en", "target": "uk", "format": "text", "api_key": ""}
# → {"translatedText": "переклад"}If LIBRETRANSLATE_URL = "" — the LibreTranslate step is automatically skipped.
Step 4 — Claude Token Budget Control
Claude is used only for entries that failed every previous step. To avoid burning the entire monthly limit in a single run, we read the current token usage:
# ~/.claude/fetch-claude-usage.swift returns "84|2026-03-28T15:00:00Z"
# where 84 = % USED tokens, so free = 100 - 84 = 16%
def get_free_tokens_pct() -> float | None:
...
def claude_allowed(min_pct: float) -> bool:
if _token_pct is None:
return True # unknown → allow optimistically
return _token_pct >= min_pctRun with a 30% threshold:
python3 translate_subtitles.py 11 --min-tokens 30File Structure
The forest - srt - ua/
├── translate_subtitles.py # main script
├── subtitles_source/ # original .srt files
│ ├── Season 1/
│ │ ├── S01E01.srt
│ │ ├── S01E01.en.srt # copy used for translation
│ │ └── ...
│ └── Season 2/
├── subtitles_output/ # results
│ ├── Season 1/
│ │ ├── S01E01.uk.srt # finished translation
│ │ └── ...
├── Log.md # structured run log (markdown table)
└── Log_console.txt # full console outputRun Log
Log.md is a table updated after every file (not just at the end of the run). Per-translator phrase counts are tracked separately:
| Date & Time | Run Result |
|---------------------|-------------------------------------------------------------------------|
| 2026-03-28 21:15:33 | ✓ 4/4 | OS×2 | Google×340 | Claude×12 | OS:86/100 | tokens: 63%→51% |
| 2026-03-28 19:02:11 | ✓ 11/11 | Google×8102 | Argos×44 | OS:98/100 | tokens: 84%→71% |Setup & Usage
# Dependencies
pip3 install deep-translator argostranslate --break-system-packages
# Download Argos offline model (optional — used as fallback)
argospm install translate-en_uk
# Local LibreTranslate (optional — used as fallback)
docker run -p 6455:5000 libretranslate/libretranslate
# Run
python3 translate_subtitles.py # default: 30 files, 30% token threshold
python3 translate_subtitles.py 10 # process 10 files per run
python3 translate_subtitles.py --min-tokens 50 # stricter token limit
python3 translate_subtitles.py --debug-tokens # diagnose token sourceCtrl+C interrupts cleanly — the log is saved and no files are corrupted.
Results
| Metric | Value |
|---|---|
| Total episodes | 106 |
| Translated in 1 day | 25 |
| Primary translator | Google Translate (free, best quality) |
| Average file size | ~70 KB |
| Average coverage | 97–100% |
| Claude usage | ~3% of phrases (last resort only) |
Key Takeaways
- OpenSubtitles first — if subtitles already exist, why translate?
- Configurable chain — one line at the top of the file controls the entire pipeline; add, remove, or reorder translators instantly
- Google as primary — better quality than offline models for TV dialogue; Argos stays as offline fallback
- Timeout escalation —
Google (2:30)means each retry gets 30 more seconds, giving stubborn lines a fair chance ї/є/ґlanguage check — simple but reliable detector to filter Russian-tagged-as-Ukrainian subtitles- Linear timing sync — solves fps drift without any external libraries
- Token budget control — Claude as a precision tool, not a first resort
The script is available in the repository and continues to evolve.