All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
56 lines
3.0 KiB
Markdown
56 lines
3.0 KiB
Markdown
# ocr-service
|
|
|
|
Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.
|
|
|
|
## What this service owns
|
|
|
|
- Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
|
|
- Baseline layout analysis: Kraken BLLA model
|
|
- Sender recognition: trained per-archive sender models
|
|
- HTTP API at port 8000 (internal Docker network — no external port)
|
|
|
|
## What this service does NOT own
|
|
|
|
- Job lifecycle — tracked in the backend's `ocr/` domain
|
|
- MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
|
|
- Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL
|
|
|
|
## API endpoints
|
|
|
|
| Endpoint | Auth | Purpose |
|
|
|---|---|---|
|
|
| `POST /ocr` | None (internal network only) | Run OCR on a PDF (presigned MinIO URL in request body) |
|
|
| `POST /train` | `X-Training-Token` header | Trigger sender-model training |
|
|
| `POST /segtrain` | `X-Training-Token` header | Trigger segmentation training |
|
|
| `GET /health` | None | Health check |
|
|
|
|
## Environment variables
|
|
|
|
| Variable | Default | Required? | Sensitive? | Purpose |
|
|
|---|---|---|---|---|
|
|
| `TRAINING_TOKEN` | — | YES (prod) | YES | Guards `/train` and `/segtrain`. Do not leave empty in production. |
|
|
| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | YES | — | SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`. |
|
|
| `KRAKEN_MODEL_PATH` | `/app/models/` | — | — | Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`) |
|
|
| `BLLA_MODEL_PATH` | `/app/models/blla.mlmodel` | — | — | Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing. |
|
|
| `HF_HOME` | `/app/cache` | — | — | HuggingFace model cache root. Keeps model downloads on the persistent cache volume. |
|
|
| `XDG_CACHE_HOME` | `/app/cache` | — | — | XDG cache root (used by some Surya components alongside `HF_HOME`). |
|
|
| `TORCH_HOME` | `/app/models/torch` | — | — | PyTorch model cache. Kept on the persistent models volume. |
|
|
| `TMPDIR` | `/app/cache/.tmp` | — | — | Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB `/tmp` tmpfs — see ADR-021. |
|
|
|
|
## Key files
|
|
|
|
| File | Purpose |
|
|
|---|---|
|
|
| `main.py` | FastAPI app, endpoint definitions, SSRF validation |
|
|
| `engines/` | Surya and Kraken engine wrappers |
|
|
| `models.py` | Pydantic request/response models |
|
|
| `preprocessing.py` | PDF-to-image conversion before OCR |
|
|
| `confidence.py` | Per-block confidence scoring |
|
|
| `spell_check.py` | Post-OCR spell correction using historical dictionaries |
|
|
| `ensure_blla_model.py` | Startup script that downloads the BLLA model if missing |
|
|
| `entrypoint.sh` | Docker entrypoint — runs `ensure_blla_model.py` then starts the server |
|
|
|
|
## Backend counterpart
|
|
|
|
`backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md`
|