Files
familienarchiv/ocr-service/CLAUDE.md
Marcel cfd49ff69e
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m7s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m7s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 18s
CI / Compose Bucket Idempotency (pull_request) Successful in 59s
docs(ocr): document TMPDIR convention and add ADR-021
- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows
  to the environment variables table
- ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume
- docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision,
  trade-offs, and rejected alternatives (Approach B / C) for issue #614
- ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-18 10:58:10 +02:00

943 B

OCR Service

→ See ocr-service/README.md for tech stack, architecture, endpoints, environment variables, local development, testing, and training.

LLM reminder: the OCR service is a single-node container — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.

LLM reminder: ALLOWED_PDF_HOSTS must never be set to * — that opens SSRF. The default (minio,localhost,127.0.0.1) is correct for dev.

LLM reminder: TMPDIR points to /app/cache/.tmp (persistent SSD volume). Never redirect it back to /tmp or any RAM-backed path — /tmp is 512 MB and cannot stage GB-scale Surya model downloads (causes ENOSPC). The ocr-volume-init container creates the directory on fresh volumes; entrypoint.sh ensures it exists as a fallback. See ADR-021.