- ocr-service/README.md: add HF_HOME, XDG_CACHE_HOME, TORCH_HOME, TMPDIR rows to the environment variables table - ocr-service/CLAUDE.md: LLM reminder — TMPDIR must stay on the cache volume - docs/adr/021-tmpdir-persistent-volume-staging.md: records the decision, trade-offs, and rejected alternatives (Approach B / C) for issue #614 - ci.yml: add test_tmpdir.py to the OCR CI run (stdlib-only tests, no ML stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
943 B
OCR Service
→ See ocr-service/README.md for tech stack, architecture, endpoints, environment variables, local development, testing, and training.
LLM reminder: the OCR service is a single-node container — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.
LLM reminder: ALLOWED_PDF_HOSTS must never be set to * — that opens SSRF. The default (minio,localhost,127.0.0.1) is correct for dev.
LLM reminder: TMPDIR points to /app/cache/.tmp (persistent SSD volume). Never redirect it back to /tmp or any RAM-backed path — /tmp is 512 MB and cannot stage GB-scale Surya model downloads (causes ENOSPC). The ocr-volume-init container creates the directory on fresh volumes; entrypoint.sh ensures it exists as a fallback. See ADR-021.