153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken's -f pdf mode tries to write output next to the input file,
which fails on read-only mounts. Instead, extract pages as PNGs via
pypdfium2 (already installed), then run kraken on each image.
Both models run in a single container per PDF to avoid overhead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous approach used find across the htrmopo cache which failed
because -newer /tmp ran in a separate container. Now parses the
'Model dir: <path>' line from kraken get output directly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 uses DOIs (not short names) to identify models from Zenodo.
Updated to use actual DOIs:
- 10.5281/zenodo.7933463 — German handwriting HTR
- 10.5281/zenodo.13788177 — McCATMuS generic handwritten/printed/typed
Added -f pdf flag for PDF input, volume mounts for import dir,
and post-download copy from htrmopo cache to the models volume.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runbook script to download both HTR-United Kurrent model candidates
(german_kurrent_manu_9, kurrent-de) into the ocr_models Docker volume,
test them against sample documents, and activate the winner.
Usage:
./scripts/download-kraken-models.sh # download both
./scripts/download-kraken-models.sh --activate 1 # pick model 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stops the container, removes the stale node_modules volume, and
rebuilds the image. Run this after adding or updating npm dependencies.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>