Files
familienarchiv/ocr-service
Marcel 86c13a230c docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7
Processes all 7 CLAUDE.md files according to the 3-bucket classification.
Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md,
domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last.

### scripts/CLAUDE.md → scripts/README.md
New `scripts/README.md` with full script documentation (preserving the
⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md`
reduced to a pointer + "document new scripts in README.md" reminder.

### .devcontainer/CLAUDE.md → .devcontainer/README.md
New `.devcontainer/README.md` with all configuration, usage, and limitations.
`devcontainer/CLAUDE.md` reduced to a single pointer line.

### docs/CLAUDE.md → docs/README.md
New `docs/README.md` covering the folder structure, ADR guide, infrastructure
docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder.

### ocr-service/CLAUDE.md
Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6).
Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk.

### backend/CLAUDE.md
- Layering Rules → pointer to docs/ARCHITECTURE.md
- Error Handling → pointer to CONTRIBUTING.md + reminder
- Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder
- Package Structure → tagged TODO post-REFACTOR-1
- Fixed errors.ts path to frontend/src/lib/shared/errors.ts
- Added ANNOTATE_ALL + BLOG_WRITE to permission list
- Key Entities, Entity Code Style, Services → kept (Bucket-2)

### root CLAUDE.md
- Stack, Infrastructure, Dev Container → pointers
- Layering Rules, Error Handling, Security, OpenAPI, API Client,
  Date Handling, UI Components, Frontend Error Handling → pointers + reminders
- Package Structure → tagged TODO post-REFACTOR-1
- Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2)

### frontend/CLAUDE.md
- API Client Pattern, Date Handling → pointers + reminders
- Key UI Components → pointer to domain READMEs
- Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 07:41:02 +02:00
..

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

  • Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
  • Baseline layout analysis: Kraken BLLA model
  • Sender recognition: trained per-archive sender models
  • HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

  • Job lifecycle — tracked in the backend's ocr/ domain
  • MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
  • Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint Auth Purpose
POST /ocr None (internal network only) Run OCR on a PDF (presigned MinIO URL in request body)
POST /train X-Training-Token header Trigger sender-model training
POST /segtrain X-Training-Token header Trigger segmentation training
GET /health None Health check

Environment variables

Variable Default Required? Sensitive? Purpose
TRAINING_TOKEN YES (prod) YES Guards /train and /segtrain. Do not leave empty in production.
ALLOWED_PDF_HOSTS minio,localhost,127.0.0.1 YES SSRF protection — comma-separated allowed PDF source hosts. Never set to *.
KRAKEN_MODEL_PATH /app/models/ Directory where Kraken HTR models are stored (populated by download-kraken-models.sh)
BLLA_MODEL_PATH /app/models/blla.mlmodel Kraken baseline layout analysis model. Auto-downloaded via ensure_blla_model.py on startup if missing.

Key files

File Purpose
main.py FastAPI app, endpoint definitions, SSRF validation
engines/ Surya and Kraken engine wrappers
models.py Pydantic request/response models
preprocessing.py PDF-to-image conversion before OCR
confidence.py Per-block confidence scoring
spell_check.py Post-OCR spell correction using historical dictionaries
ensure_blla_model.py Startup script that downloads the BLLA model if missing
entrypoint.sh Docker entrypoint — runs ensure_blla_model.py then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md