- entrypoint.sh: replace "cross-job ground-truth leakage" with plain
"Remove stale partial downloads left by a previous docker-kill"
- test_tmpdir_is_inside_persistent_cache_volume: add docker exec command
so future developers know how to run this deployment-contract test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_entrypoint_removes_day_old_orphans and test_entrypoint_preserves_fresh_files
verify the find -mtime +1 -delete logic using os.utime() to fabricate old mtimes
without mocking system time. Also extracts _run_entrypoint helper to remove
subprocess setup duplication.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_validate_zip_entry has no ML-stack dependency; importing it via main.py
pulled in surya/torch and caused the test to be skipped in CI. Moving it
to utils.py (fastapi only) and adding fastapi to the CI lightweight install
lets test_zipslip_still_anchors_under_custom_tmpdir run on every push.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On a fresh ocr_cache volume /app/cache/.tmp does not exist yet. The mkdir
ensures the first Surya model download can proceed without ENOSPC on the
512 MB /tmp tmpfs. The find cleanup removes fragments left by docker-kill
mid-download, preventing cross-job ground-truth leakage.
Fixes#614. See ADR-021.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, running the image outside compose loses the TMPDIR redirect
and Surya model downloads fall back to the 512 MB /tmp tmpfs (ENOSPC).
See issue #614, ADR-021.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
os.environ.get(key, default) returns "" when the key exists but is blank —
the default is only used when the key is absent. The or-fallback treats both
absence and blank values as "use the default".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is
on the read-only container filesystem — fixes Nora's blocker. Also
restores the explanatory comment on the ocr_cache volume mount.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a canary log line if os.getuid() == 0. Produces an observable
signal in container logs if the USER directive is ever removed from
the Dockerfile, without requiring an external audit tool.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With --no-create-home, os.path.expanduser("~") resolves to "/" causing
kraken get to write to /.local/share/htrmopo. Replace with
os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is
explicit and override-friendly without a home directory.
Adds two tests verifying env-var resolution and ~-free default.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root.
Creates /home/ocr and /app/cache with correct ownership so named
volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME
so ~ expansion and Hugging Face caching resolve under /app, not /root.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/README.md: remove duplicate infrastructure/ entry at end of folder tree
- ocr-service/CLAUDE.md: add **LLM reminder:** prefix to ALLOWED_PDF_HOSTS
SSRF warning (consistent with all other machine-readable instructions)
- backend/CLAUDE.md: restore ResponseStatusException note for simple controller
validation — avoids LLMs reaching for DomainException for trivial checks
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
extract_page_blocks() walked `record.boundary` and `record.baseline`
unconditionally, so a record that arrived without either (malformed
kraken output, or a MagicMock in tests that iterates to nothing)
crashed with "min() arg is an empty sequence".
Coerce both attributes through list(), require at least 3 points for
the polygon path, fall back to the baseline path when the polygon is
missing, and skip the record entirely when neither is usable —
emitting no block is safer than emitting one with garbage coordinates.
The test helper now sets `boundary` and `baseline` explicitly to
mirror real Kraken 7.0 records (and so the happy-path test exercises
the polygon branch). A new regression test covers the skip path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
main.py unifies the call to both engines and always passes
`sender_model_path` (None for non-Kurrent scripts). Surya's
extract_region_text / extract_page_blocks accepted one fewer positional
arg than Kraken's, so every guided-OCR run on a TYPEWRITER or
HANDWRITING_LATIN document raised "takes 5 positional arguments but 6
were given" and the stream returned 0 blocks / 1 skipped page.
Add an ignored `sender_model_path` kwarg to both Surya functions so the
signatures match Kraken's, and guard the regression with two signature
tests in test_engines.py that compare both engines' parameter lists.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 503/403 auth tests for the /train-sender endpoint, matching the pattern
already used for /train and /segtrain. Also surface test_sender_registry.py
in CI (it needs no ML stack) and add pytest-asyncio to the install step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace exact-string assertions in test_correctable_ocr_error_gets_corrected
and test_sentence_with_multiple_corrections with structural assertions that
verify behavior (correction attempted, marker present, expected stem) without
coupling to a specific pyspellchecker version's frequency weights.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strict greater-than avoids non-determinism: if multiple candidates share
the minimum frequency value, pyspellchecker's ranking is undefined.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop underscore prefix — the helper is part of confidence.py's effective
public API since spell_check.py imports and calls it directly.
Fixes reviewer concern: importing a _-prefixed name across module boundaries
contradicts Python's private-by-convention signal.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
153K words from dtak+dtae 1800-1899 corpora (min_freq=20),
covering pre-reform spellings common in Kurrent/Süterlin documents.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add test for 1×1 image (sub-tile-size) resilience and narrow preprocess_page
fallback from except Exception to (cv2.error, ValueError, MemoryError) so
programming errors propagate instead of being silently swallowed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _model_is_loadable: narrow bare except to (RuntimeError, OSError, ValueError)
with DEBUG-level fallback for unexpected exceptions — prevents silent masking
of missing kraken install or AttributeError on vgsl
- _run_segtrain: replace bare except:pass with log.warning so height-check
fallback is visible in container logs
- New test_ensure_blla_model.py: covers model-OK early return, incompatible
model rename+replace, and missing model download paths
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three issues fixed:
1. --resize both was removed in ketos 7; replaced with --resize union
which extends the model's class mapping to include training data classes.
2. ketos ignores -s when -i is present, so the 1800px blla model caused
7+ GB peak RAM and OOM-killed the host (no swap, 5 GB free).
Now checks the loaded model's input height: only uses the base model
when it was already fine-tuned at 800px; otherwise trains from scratch
at 800px (~200 MB peak). After the first run the trained 800px model
becomes the base for all subsequent fine-tuning runs.
3. segtrain now computes and returns cer = 1 - accuracy, matching the
recognition training path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds ensure_blla_model.py which loads the blla segmentation model with
ketos on every container start. If the model is missing or in the legacy
PyTorch ZIP format (incompatible with ketos 7), it re-downloads the
correct CoreML protobuf model from Zenodo (DOI 10.5281/zenodo.14602569).
The Dockerfile now uses entrypoint.sh which runs this check before
starting uvicorn.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_check_training_token previously skipped auth when TRAINING_TOKEN was
empty, allowing unauthenticated requests to reach /train and /segtrain.
Now returns 503 ("Training not configured on this node") when the token
is absent, so missing configuration fails closed rather than open.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two workers × ~5 GB Surya model load = ~10 GB required, exceeding the
8 GB memory cap and causing OOM on the first /train call. Two OS
processes also cause model-state divergence after training, contradicting
the single-node constraint documented in ADR-001.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ketos 7 defaults to safetensors output, but kraken's load_any() only
handles CoreML (.mlmodel). Adding --weights-format coreml ensures the
hot-swap after training produces a file that load_any() can parse.
Also fixed _find_best_model to look for best_<score>.mlmodel (produced
by --weights-format coreml) in addition to the previous checkpoint_*
pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kraken 7 removed support for the legacy `path` format (image + .gt.txt
pairs) in VGSLRecognitionDataModule despite the CLI still advertising it.
Switching to PAGE XML (-f page) format which is the supported standard.
- Java export now writes .xml alongside .png (PAGE XML with TextLine,
Baseline at 75% height, and Unicode transcription)
- XML special characters in transcription text are escaped (& < >)
- Python trainer globs *.xml and passes -f page to ketos train
- Regenerated frontend API types to include cer/loss/accuracy/epochs on
OcrTrainingRun (were missing, causing empty CER column in history)
- Updated and extended TrainingDataExportServiceTest
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ketos segtrain has no batch-size flag (-B), so with the default 1800px
input height the intermediate CNN feature maps consume ~500 MB+ per
image, causing the kernel OOM-killer (exit -9) to terminate the process.
On first run (no existing blla.mlmodel), override the VGSL spec to use
800px height instead. Subsequent runs load the saved model with
--resize both, preserving incremental fine-tuning.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Force CPU-only training (--device cpu), cap OpenMP/BLAS thread pool at 2
(--threads 2), and reduce epochs from 50 to 10 (-N 10). 50 epochs on a
laptop OOM-killed the container. 10 epochs is sufficient for incremental
fine-tuning runs; more data is added over time and training re-run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DataLoader worker subprocesses crash inside Docker due to multiprocessing
fork restrictions. Pass --workers 0 to both ketos train and ketos segtrain
so data loading runs in the main process.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>