Files

Marcel e31dac5c9c test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir

A silent non-zero exit would previously cause the test to pass incorrectly
because only directory creation was checked. Exit code is now the first
assertion, catching regressions before the filesystem check runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-18 11:18:14 +02:00

__pycache__

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

.venv

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

dictionaries

feat(ocr): add DTA-derived historical German wordlist and generation script

2026-04-17 16:48:26 +02:00

engines

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

.dockerignore

fix(docker): soften ocr-service dependency and clean up compose

2026-04-13 12:29:21 +02:00

CLAUDE.md

docs(ocr): document TMPDIR convention and add ADR-021

2026-05-18 10:58:10 +02:00

confidence.py

refactor(ocr): make collapse_adjacent_markers a public function

2026-04-17 17:20:31 +02:00

Dockerfile

build(ocr): set ENV TMPDIR=/app/cache/.tmp so docker run uses SSD staging

2026-05-18 10:53:15 +02:00

ensure_blla_model.py

fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback

2026-05-17 18:53:26 +02:00

entrypoint.sh

fix(ocr): create TMPDIR on startup and clear day-old orphans

2026-05-18 10:54:17 +02:00

main.py

refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI

2026-05-18 11:17:15 +02:00

models.py

feat(ocr): per-sender model registry and /train-sender endpoint

2026-04-17 18:05:39 +02:00

preprocessing.py

test(ocr): add resilience tests for tiny image and unexpected exception propagation

2026-04-17 15:16:17 +02:00

README.md

docs(ocr): document TMPDIR convention and add ADR-021

2026-05-18 10:58:10 +02:00

requirements.txt

feat(ocr): add pyspellchecker dependency

2026-04-17 16:41:24 +02:00

spell_check.py

refactor(ocr): document > 50 frequency threshold rationale

2026-04-17 17:21:37 +02:00

test_confidence.py

feat(ocr): per-script-type confidence thresholds

2026-04-12 20:50:59 +02:00

test_engines.py

fix(ocr): guard Kraken block extraction against missing boundary/baseline

2026-04-23 09:33:03 +02:00

test_ensure_blla_model.py

fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback

2026-05-17 18:53:26 +02:00

test_main.py

test(ocr): add startup root canary tests for main.py lifespan

2026-05-17 17:29:47 +02:00

test_preprocessing.py

test(ocr): add resilience tests for tiny image and unexpected exception propagation

2026-04-17 15:16:17 +02:00

test_sender_registry.py

refactor(ocr): mark _SenderModelRegistry.contains as private (_contains)

2026-04-17 21:26:46 +02:00

test_spell_check.py

test(ocr): decouple correction tests from exact library dictionary state

2026-04-17 17:23:09 +02:00

test_stream.py

feat(ocr): integrate preprocessing into stream and batch endpoints

2026-04-17 14:16:47 +02:00

test_tmpdir.py

test(ocr): assert entrypoint.sh exit code in test_entrypoint_creates_tmpdir

2026-05-18 11:18:14 +02:00

test_training_auth.py

test(ocr): add /train-sender auth tests and run sender registry tests in CI

2026-04-17 21:14:27 +02:00

utils.py

refactor(ocr): extract _validate_zip_entry to utils.py so ZIP Slip test runs in CI

2026-05-18 11:17:15 +02:00

README.md

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
Baseline layout analysis: Kraken BLLA model
Sender recognition: trained per-archive sender models
HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

Job lifecycle — tracked in the backend's ocr/ domain
MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint	Auth	Purpose
`POST /ocr`	None (internal network only)	Run OCR on a PDF (presigned MinIO URL in request body)
`POST /train`	`X-Training-Token` header	Trigger sender-model training
`POST /segtrain`	`X-Training-Token` header	Trigger segmentation training
`GET /health`	None	Health check

Environment variables

Variable	Default	Required?	Sensitive?	Purpose
`TRAINING_TOKEN`	—	YES (prod)	YES	Guards `/train` and `/segtrain`. Do not leave empty in production.
`ALLOWED_PDF_HOSTS`	`minio,localhost,127.0.0.1`	YES	—	SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`.
`KRAKEN_MODEL_PATH`	`/app/models/`	—	—	Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`)
`BLLA_MODEL_PATH`	`/app/models/blla.mlmodel`	—	—	Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing.
`HF_HOME`	`/app/cache`	—	—	HuggingFace model cache root. Keeps model downloads on the persistent cache volume.
`XDG_CACHE_HOME`	`/app/cache`	—	—	XDG cache root (used by some Surya components alongside `HF_HOME`).
`TORCH_HOME`	`/app/models/torch`	—	—	PyTorch model cache. Kept on the persistent models volume.
`TMPDIR`	`/app/cache/.tmp`	—	—	Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB `/tmp` tmpfs — see ADR-021.

Key files

File	Purpose
`main.py`	FastAPI app, endpoint definitions, SSRF validation
`engines/`	Surya and Kraken engine wrappers
`models.py`	Pydantic request/response models
`preprocessing.py`	PDF-to-image conversion before OCR
`confidence.py`	Per-block confidence scoring
`spell_check.py`	Post-OCR spell correction using historical dictionaries
`ensure_blla_model.py`	Startup script that downloads the BLLA model if missing
`entrypoint.sh`	Docker entrypoint — runs `ensure_blla_model.py` then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md