marcel/familienarchiv

Fork 0

Files

Marcel cfd49ff69e

CI / Unit & Component Tests (pull_request) Successful in 3m7s

Details

CI / OCR Service Tests (pull_request) Successful in 19s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m7s

Details

CI / fail2ban Regex (pull_request) Successful in 42s

Details

CI / Semgrep Security Scan (pull_request) Successful in 18s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 59s

Details

docs(ocr): document TMPDIR convention and add ADR-021

- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-18 10:58:10 +02:00

3.0 KiB

Raw Blame History

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
Baseline layout analysis: Kraken BLLA model
Sender recognition: trained per-archive sender models
HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

Job lifecycle — tracked in the backend's ocr/ domain
MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint	Auth	Purpose
`POST /ocr`	None (internal network only)	Run OCR on a PDF (presigned MinIO URL in request body)
`POST /train`	`X-Training-Token` header	Trigger sender-model training
`POST /segtrain`	`X-Training-Token` header	Trigger segmentation training
`GET /health`	None	Health check

Environment variables

Variable	Default	Required?	Sensitive?	Purpose
`TRAINING_TOKEN`	—	YES (prod)	YES	Guards `/train` and `/segtrain`. Do not leave empty in production.
`ALLOWED_PDF_HOSTS`	`minio,localhost,127.0.0.1`	YES	—	SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`.
`KRAKEN_MODEL_PATH`	`/app/models/`	—	—	Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`)
`BLLA_MODEL_PATH`	`/app/models/blla.mlmodel`	—	—	Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing.
`HF_HOME`	`/app/cache`	—	—	HuggingFace model cache root. Keeps model downloads on the persistent cache volume.
`XDG_CACHE_HOME`	`/app/cache`	—	—	XDG cache root (used by some Surya components alongside `HF_HOME`).
`TORCH_HOME`	`/app/models/torch`	—	—	PyTorch model cache. Kept on the persistent models volume.
`TMPDIR`	`/app/cache/.tmp`	—	—	Download-staging directory for GB-scale Surya model files. Must point to a disk-backed path, not the 512 MB `/tmp` tmpfs — see ADR-021.

Key files

File	Purpose
`main.py`	FastAPI app, endpoint definitions, SSRF validation
`engines/`	Surya and Kraken engine wrappers
`models.py`	Pydantic request/response models
`preprocessing.py`	PDF-to-image conversion before OCR
`confidence.py`	Per-block confidence scoring
`spell_check.py`	Post-OCR spell correction using historical dictionaries
`ensure_blla_model.py`	Startup script that downloads the BLLA model if missing
`entrypoint.sh`	Docker entrypoint — runs `ensure_blla_model.py` then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md

3.0 KiB Raw Blame History

ocr-service

What this service owns

What this service does NOT own

API endpoints

Environment variables

Key files

Backend counterpart

3.0 KiB

Raw Blame History