Files

Marcel 86c13a230c docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7

Processes all 7 CLAUDE.md files according to the 3-bucket classification.
Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md,
domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last.

### scripts/CLAUDE.md → scripts/README.md
New `scripts/README.md` with full script documentation (preserving the
⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md`
reduced to a pointer + "document new scripts in README.md" reminder.

### .devcontainer/CLAUDE.md → .devcontainer/README.md
New `.devcontainer/README.md` with all configuration, usage, and limitations.
`devcontainer/CLAUDE.md` reduced to a single pointer line.

### docs/CLAUDE.md → docs/README.md
New `docs/README.md` covering the folder structure, ADR guide, infrastructure
docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder.

### ocr-service/CLAUDE.md
Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6).
Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk.

### backend/CLAUDE.md
- Layering Rules → pointer to docs/ARCHITECTURE.md
- Error Handling → pointer to CONTRIBUTING.md + reminder
- Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder
- Package Structure → tagged TODO post-REFACTOR-1
- Fixed errors.ts path to frontend/src/lib/shared/errors.ts
- Added ANNOTATE_ALL + BLOG_WRITE to permission list
- Key Entities, Entity Code Style, Services → kept (Bucket-2)

### root CLAUDE.md
- Stack, Infrastructure, Dev Container → pointers
- Layering Rules, Error Handling, Security, OpenAPI, API Client,
  Date Handling, UI Components, Frontend Error Handling → pointers + reminders
- Package Structure → tagged TODO post-REFACTOR-1
- Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2)

### frontend/CLAUDE.md
- API Client Pattern, Date Handling → pointers + reminders
- Key UI Components → pointer to domain READMEs
- Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-06 07:41:02 +02:00

__pycache__

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

.venv

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

dictionaries

feat(ocr): add DTA-derived historical German wordlist and generation script

2026-04-17 16:48:26 +02:00

engines

refactor(document): move document domain core to document/ package

2026-05-05 12:39:20 +02:00

.dockerignore

fix(docker): soften ocr-service dependency and clean up compose

2026-04-13 12:29:21 +02:00

CLAUDE.md

docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7

2026-05-06 07:41:02 +02:00

confidence.py

refactor(ocr): make collapse_adjacent_markers a public function

2026-04-17 17:20:31 +02:00

Dockerfile

chore(ocr): add opencv-python-headless, libglib2.0-0, and CLAHE env vars

2026-04-17 14:14:47 +02:00

ensure_blla_model.py

fix(ocr): narrow exception handling and add unit tests for ensure_blla_model

2026-04-14 21:17:53 +02:00

entrypoint.sh

fix(ocr-service): add entrypoint that validates blla model format on startup

2026-04-14 21:17:53 +02:00

main.py

feat(ocr): per-sender model registry and /train-sender endpoint

2026-04-17 18:05:39 +02:00

models.py

feat(ocr): per-sender model registry and /train-sender endpoint

2026-04-17 18:05:39 +02:00

preprocessing.py

test(ocr): add resilience tests for tiny image and unexpected exception propagation

2026-04-17 15:16:17 +02:00

README.md

docs(legibility): add 18 per-domain README.md files (DOC-6)

2026-05-06 07:36:38 +02:00

requirements.txt

feat(ocr): add pyspellchecker dependency

2026-04-17 16:41:24 +02:00

spell_check.py

refactor(ocr): document > 50 frequency threshold rationale

2026-04-17 17:21:37 +02:00

test_confidence.py

feat(ocr): per-script-type confidence thresholds

2026-04-12 20:50:59 +02:00

test_engines.py

fix(ocr): guard Kraken block extraction against missing boundary/baseline

2026-04-23 09:33:03 +02:00

test_ensure_blla_model.py

fix(ocr): narrow exception handling and add unit tests for ensure_blla_model

2026-04-14 21:17:53 +02:00

test_preprocessing.py

test(ocr): add resilience tests for tiny image and unexpected exception propagation

2026-04-17 15:16:17 +02:00

test_sender_registry.py

refactor(ocr): mark _SenderModelRegistry.contains as private (_contains)

2026-04-17 21:26:46 +02:00

test_spell_check.py

test(ocr): decouple correction tests from exact library dictionary state

2026-04-17 17:23:09 +02:00

test_stream.py

feat(ocr): integrate preprocessing into stream and batch endpoints

2026-04-17 14:16:47 +02:00

test_training_auth.py

test(ocr): add /train-sender auth tests and run sender registry tests in CI

2026-04-17 21:14:27 +02:00

README.md

ocr-service

Python FastAPI microservice that performs the actual handwritten text recognition (HTR) and OCR. The Spring Boot backend orchestrates jobs; this service executes them.

What this service owns

Text recognition: Surya (typewritten text) and Kraken (Kurrent/Sütterlin historical handwriting)
Baseline layout analysis: Kraken BLLA model
Sender recognition: trained per-archive sender models
HTTP API at port 8000 (internal Docker network — no external port)

What this service does NOT own

Job lifecycle — tracked in the backend's ocr/ domain
MinIO storage — the service fetches PDFs via presigned URLs generated by the backend; it does not hold credentials
Transcription block storage — results are streamed back to the backend, which writes them to PostgreSQL

API endpoints

Endpoint	Auth	Purpose
`POST /ocr`	None (internal network only)	Run OCR on a PDF (presigned MinIO URL in request body)
`POST /train`	`X-Training-Token` header	Trigger sender-model training
`POST /segtrain`	`X-Training-Token` header	Trigger segmentation training
`GET /health`	None	Health check

Environment variables

Variable	Default	Required?	Sensitive?	Purpose
`TRAINING_TOKEN`	—	YES (prod)	YES	Guards `/train` and `/segtrain`. Do not leave empty in production.
`ALLOWED_PDF_HOSTS`	`minio,localhost,127.0.0.1`	YES	—	SSRF protection — comma-separated allowed PDF source hosts. Never set to `*`.
`KRAKEN_MODEL_PATH`	`/app/models/`	—	—	Directory where Kraken HTR models are stored (populated by `download-kraken-models.sh`)
`BLLA_MODEL_PATH`	`/app/models/blla.mlmodel`	—	—	Kraken baseline layout analysis model. Auto-downloaded via `ensure_blla_model.py` on startup if missing.

Key files

File	Purpose
`main.py`	FastAPI app, endpoint definitions, SSRF validation
`engines/`	Surya and Kraken engine wrappers
`models.py`	Pydantic request/response models
`preprocessing.py`	PDF-to-image conversion before OCR
`confidence.py`	Per-block confidence scoring
`spell_check.py`	Post-OCR spell correction using historical dictionaries
`ensure_blla_model.py`	Startup script that downloads the BLLA model if missing
`entrypoint.sh`	Docker entrypoint — runs `ensure_blla_model.py` then starts the server

Backend counterpart

backend/src/main/java/org/raddatz/familienarchiv/ocr/README.md