audit(ocr-service): score ocr-service/ against legibility rubric C1–C10 #390

New Issue

marcel · 2026-05-04T16:04:10+02:00

marcel commented

2026-05-04 16:04:10 +02:00

Context

Part of Epic #387 — Codebase Legibility Audit. This is AUDIT-3: read-only assessment of ocr-service/ against the full Legibility Rubric. The OCR service is domain-bearing (it embeds historical script knowledge, confidence calibration, Kurrent/Sütterlin model lifecycle) so it earns its own audit rather than folding into rest-of-repo.

The output is a Markdown report under docs/audits/audit-ocr-report.md following the per-subsystem orient template. No code changes in this issue.

Inputs (read first)

See the first comment of #387 for the complete reference bundle.

Note: the OCR service is a Python FastAPI microservice. Some rubric checks (e.g., C3.4 about Java package structure) are N/A here. C3.5 also N/A. Apply C5/C6/C7 to the Python module structure on its own terms — the canonical "domain-based packaging" goal applies to the backend Java code, NOT to the internal organization of ocr-service/. The OCR service IS one domain (ocr/) when viewed from the outside; its internal organization is local concern.

Scope (in)

Everything under ocr-service/ — main.py, engines/, supporting modules, requirements, tests
ocr-service/Dockerfile, ocr-service/CLAUDE.md (evaluate as documentation surface)
The contract between OCR service and backend: HTTP endpoints, request/response shapes, error semantics

Scope (out)

ocr-service/.venv/, ocr-service/__pycache__/, ocr-service/engines/__pycache__/
Backend's OcrController/OcrService/OcrClient (AUDIT-2 covers those — but note the contract clarity at the boundary in §3)
Model files / training data

Rubric checks particularly relevant to this subsystem

C1, C2 (Python prerequisites + bootstrap), C3.1, C3.3 (terminology — Kurrent vs Sütterlin vs HTR vs OCR), C5 (Python conventions), C7 (especially C7.4 on API contract documentation), C8 (test coverage of OCR pipeline), C9.1 + C9.2 (deployment + env vars — single-node constraint per ADR-001 in ocr-service/CLAUDE.md), C10.

Specific things to verify (subsystem-specific)

Bootstrap. Can a stranger run ocr-service/ standalone (without backend) for development? Is this documented?
Engine pluggability. The Surya/Kraken split is non-obvious. Is the choice between engines documented?
Single-node ADR. ocr-service/CLAUDE.md references ADR-001 (single-node only because in-process model reload). This MUST appear in human-readable docs after refactor (C5.3).
Contract clarity. Does the OCR service know what backend expects? Is there a shared schema?

Acceptance criteria

File docs/audits/audit-ocr-report.md exists on a feature branch
Report contains all 10 template sections
§3 scorecard explicitly marks Java-specific checks as N/A with reason
§4 assigns 🟢 / 🟡 / 🔴
§8 explicitly recommends migrating ADR-001 (single-node) into human-readable docs (CONTRIBUTING.md or docs/ARCHITECTURE.md)
§10 lists top-5 recommendations
No prose contains "ask Marcel," "Claude generated," "TODO," or "non-obvious" without explanation
PR opened and merged before this issue is closed

Definition of Done

Report committed under docs/audits/audit-ocr-report.md on main. Top-5 recommendations summarized as a closing comment.

Dispatch

Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit ocr-service/ per the orient template and produce docs/audits/audit-ocr-report.md. Read-only; no code changes.

## Context Part of **Epic #387** — Codebase Legibility Audit. This is **AUDIT-3**: read-only assessment of `ocr-service/` against the full Legibility Rubric. The OCR service is **domain-bearing** (it embeds historical script knowledge, confidence calibration, Kurrent/Sütterlin model lifecycle) so it earns its own audit rather than folding into rest-of-repo. The output is a Markdown report under `docs/audits/audit-ocr-report.md` following the per-subsystem orient template. **No code changes in this issue.** ## Inputs (read first) See **the first comment of #387** for the complete reference bundle. Note: the OCR service is a Python FastAPI microservice. Some rubric checks (e.g., C3.4 about Java package structure) are N/A here. C3.5 also N/A. Apply C5/C6/C7 to the Python module structure on its own terms — the canonical "domain-based packaging" goal applies to the backend Java code, NOT to the internal organization of `ocr-service/`. The OCR service IS one domain (`ocr/`) when viewed from the outside; its internal organization is local concern. ## Scope (in) - Everything under `ocr-service/` — `main.py`, `engines/`, supporting modules, requirements, tests - `ocr-service/Dockerfile`, `ocr-service/CLAUDE.md` (evaluate as documentation surface) - The contract between OCR service and backend: HTTP endpoints, request/response shapes, error semantics ## Scope (out) - `ocr-service/.venv/`, `ocr-service/__pycache__/`, `ocr-service/engines/__pycache__/` - Backend's `OcrController`/`OcrService`/`OcrClient` (AUDIT-2 covers those — but note the contract clarity at the boundary in §3) - Model files / training data ## Rubric checks particularly relevant to this subsystem C1, C2 (Python prerequisites + bootstrap), C3.1, C3.3 (terminology — Kurrent vs Sütterlin vs HTR vs OCR), C5 (Python conventions), C7 (especially C7.4 on API contract documentation), C8 (test coverage of OCR pipeline), C9.1 + C9.2 (deployment + env vars — single-node constraint per ADR-001 in `ocr-service/CLAUDE.md`), C10. ## Specific things to verify (subsystem-specific) - **Bootstrap.** Can a stranger run `ocr-service/` standalone (without backend) for development? Is this documented? - **Engine pluggability.** The Surya/Kraken split is non-obvious. Is the choice between engines documented? - **Single-node ADR.** `ocr-service/CLAUDE.md` references ADR-001 (single-node only because in-process model reload). This MUST appear in human-readable docs after refactor (C5.3). - **Contract clarity.** Does the OCR service know what backend expects? Is there a shared schema? ## Acceptance criteria - [ ] File `docs/audits/audit-ocr-report.md` exists on a feature branch - [ ] Report contains all 10 template sections - [ ] §3 scorecard explicitly marks Java-specific checks as N/A with reason - [ ] §4 assigns 🟢 / 🟡 / 🔴 - [ ] §8 explicitly recommends migrating ADR-001 (single-node) into human-readable docs (CONTRIBUTING.md or `docs/ARCHITECTURE.md`) - [ ] §10 lists top-5 recommendations - [ ] No prose contains "ask Marcel," "Claude generated," "TODO," or "non-obvious" without explanation - [ ] PR opened and merged before this issue is closed ## Definition of Done Report committed under `docs/audits/audit-ocr-report.md` on `main`. Top-5 recommendations summarized as a closing comment. ## Dispatch > Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit `ocr-service/` per the orient template and produce `docs/audits/audit-ocr-report.md`. Read-only; no code changes.

marcel added this to the Codebase Legibility milestone 2026-05-04 16:04:10 +02:00

marcel added the P1-high audit legibility labels 2026-05-04 16:05:42 +02:00

marcel referenced this issue

2026-05-04 16:05:50 +02:00

audit(rollup): produce global readiness scorecard from subsystem audits #393

marcel referenced this issue

2026-05-04 16:07:41 +02:00

docs(legibility): write docs/ARCHITECTURE.md with diagram and domain list #396

marcel referenced this issue

2026-05-04 16:17:02 +02:00

cleanup(legibility): remove dead code identified by audits #412

marcel commented

2026-05-04 16:29:02 +02:00

AUDIT-3 — `ocr-service/` Legibility Report

1. Subsystem profile

ocr-service/ is a Python 3.11 / FastAPI microservice that performs OCR (typewritten Latin) and HTR (handwritten Kurrent/Sütterlin) on PDFs delivered via presigned MinIO URLs. It owns: model lifecycle (Kraken + Surya), CLAHE-based image preprocessing, per-character → per-word confidence reconstruction, German spell-check post-processing with a historical wordlist, and in-process model fine-tuning (/train, /train-sender, /segtrain). It is stateless w.r.t. business data — all job orchestration lives in the Spring Boot backend. Single-replica is mandatory because training mutates in-process model state.

ocr-service/
├── main.py                     # FastAPI app + 6 endpoints + training pipeline
├── models.py                   # Pydantic OcrRequest / OcrRegion / OcrBlock
├── preprocessing.py            # CLAHE + grayscale + blur
├── confidence.py               # words_from_characters + [unleserlich] markers
├── spell_check.py              # pyspellchecker + de_historical wordlist
├── ensure_blla_model.py        # startup helper: download/validate blla
├── engines/
│   ├── kraken.py               # Kurrent HTR + per-sender LRU registry
│   └── surya.py                # typewritten / Latin handwriting
├── dictionaries/de_historical.txt
├── Dockerfile, entrypoint.sh, requirements.txt, .dockerignore
├── CLAUDE.md                   # current docs surface
└── test_*.py                   # 7 test modules

Total ~2,717 LOC across 16 Python files (engines + tests included).

2. Inventory

Application modules

Module	LOC	Responsibility
`main.py`	660	FastAPI app, 6 HTTP endpoints, ZIP-Slip-safe extraction, ketos subprocess driver, training-token guard, SSRF host allowlist
`models.py`	35	Pydantic models — `OcrRequest`, `OcrRegion`, `OcrBlock`
`preprocessing.py`	50	`preprocess_page()` — CLAHE + grayscale + Gaussian blur, falls back silently on cv2/value/memory errors
`confidence.py`	100	`apply_confidence_markers()`, `words_from_characters()`, `get_threshold()`, `[unleserlich]` / `[?]` constants
`spell_check.py`	117	German spell-check with historical wordlist; collapses adjacent markers
`ensure_blla_model.py`	80	Container-startup helper to download/validate the blla segmentation model
`engines/kraken.py`	339	Kraken engine wrapper, LRU per-sender model cache, pure-Python convex hull + min-bounding-rect
`engines/surya.py`	131	Surya wrapper (lazy-loaded)

HTTP endpoints (actual)

Endpoint	Method	Purpose
`/health`	GET	503 until lifespan finishes; otherwise `{status, surya, kraken}`
`/ocr`	POST	Sync OCR → `list[OcrBlock]`
`/ocr/stream`	POST	NDJSON stream — supports guided regions and per-sender models
`/train`	POST	Fine-tune the global Kurrent model
`/train-sender`	POST	Fine-tune a per-sender Kurrent model to a target path under `/app/models/`
`/segtrain`	POST	Fine-tune the blla segmentation model

Test modules

Test file	Coverage	External ML deps required?
`test_confidence.py`	confidence.py — 22 cases	No
`test_spell_check.py`	spell_check.py — 16 cases	pyspellchecker only
`test_preprocessing.py`	preprocessing.py — 5 cases incl. cv2 fallback	numpy + cv2 only
`test_sender_registry.py`	LRU cache logic	Mocks Kraken loader
`test_engines.py`	Mocked Surya/Kraken record extraction + signature parity	Heavy mocking via `sys.modules`
`test_stream.py`	`/ocr/stream` end-to-end via ASGITransport	Mocks engines + PDF download
`test_training_auth.py`	Token-guard fail-closed semantics	Mocks
`test_ensure_blla_model.py`	bootstrap helper	Mocks subprocess

There are no tests that exercise a real model on a real image. Every "engine" test uses MagicMock. End-to-end correctness against a fixture image is not verified anywhere in the repo.

3. Rubric scorecard

Check	Result	Severity	Evidence	Recommendation
C1.1 Root README purpose	N/A here	—	Root has no `README.md` (only `frontend/README.md`); this is a repo-wide concern covered by AUDIT-1	—
C1.2 README lists 5 components	N/A here	—	Same — repo-level	—
C1.3 `docs/ARCHITECTURE.md` w/ diagram	FAIL (referrals exist)	Major	`docs/architecture/c4-diagrams.md` exists; no top-level ARCHITECTURE.md links into ocr-service	Add an OCR section to `docs/ARCHITECTURE.md` summarizing the request flow backend→ocr-service→MinIO
C2.1 Single command brings stack up	PASS	—	`docker-compose up -d` brings `ocr-service` together with backend (`docker-compose.yml:78–110`)	—
C2.2 Prerequisites with versions	PASS	—	`ocr-service/CLAUDE.md:9–17` lists Python 3.11, FastAPI 0.115.6, kraken==7.0, surya-ocr 0.17.1, torch 2.7.1 (CPU); `requirements.txt` pins	—
C2.3 Default credentials documented	N/A	—	Service has no auth besides `TRAINING_TOKEN`; the env table says default is empty (fails closed)	—
C2.4 Common failures with fixes	FAIL	Minor	No troubleshooting section. blla 800px-vs-1800px memory fight is documented inline at `ocr-service/main.py:578–602` only	Add a "known issues" subsection covering OOM at 1800px segtrain, the `.incompatible` rename behavior, and 200-DPI vs 300-DPI memory tradeoff
C2.5 Per-subsystem README	FAIL	Major	Only `ocr-service/CLAUDE.md` exists; this is an LLM-prefixed file that `COLLABORATING.md` treats as ephemeral	Promote `ocr-service/CLAUDE.md` to `ocr-service/README.md` (or split: README for humans, CLAUDE.md only for agent-specific notes)
C3.1 Domains stated in docs	PASS	—	`ocr-service/CLAUDE.md:1–18` cleanly states OCR vs HTR boundary	—
C3.2 Per-domain definition	PASS	—	Surya = typewritten + modern Latin handwriting; Kraken = Kurrent/Sütterlin. Docs state which is for what	—
C3.3 Glossary disambiguates terms	PARTIAL FAIL	Major	OCR vs HTR is contrasted, but Kurrent vs Sütterlin is conflated (CLAUDE.md uses "Kurrent" only; ADR-001 uses both). Spell-check covers `HANDWRITING_KURRENT` and `HANDWRITING_LATIN` (`main.py:43`) but only `HANDWRITING_KURRENT` triggers Kraken (`main.py:100`) — that asymmetry is undocumented	Add a glossary block: Kurrent (broad German cursive family pre-1941), Sütterlin (1915–1941 specific variant), HTR vs OCR, the four `scriptType` values and which engine each uses, and why `HANDWRITING_LATIN` uses Surya despite being handwriting
C3.4 Backend domain packaging	N/A	—	Python service — Java rule does not apply	—
C3.5 Frontend lib domain match	N/A	—	Python service has no frontend	—
C3.6 `shared/` justified	N/A	—	No `shared/` concept inside ocr-service	—
C4.1 Find files for a domain	PASS	—	Tree is small and flat; `engines/` cleanly separates the two engines	—
C4.2 Find feature end-to-end	PASS	—	Endpoint → handler → engine wrapper is one hop in `main.py`	—
C4.3 No `Helper`/`Utils` names	PASS	—	All filenames concept-named (`preprocessing`, `confidence`, `spell_check`, `ensure_blla_model`)	—
C4.4 Routes predictable	PASS	—	`/ocr`, `/ocr/stream`, `/train`, `/train-sender`, `/segtrain`, `/health` — all guessable	—
C5.1 How-to add an endpoint/etc.	FAIL	Major	No conventions doc for ocr-service contributors	Add a short "Conventions" section to README covering: how to add a new engine, how to add a new env var, how to add a new endpoint, where to put tests
C5.2 One way to do common things	PARTIAL PASS	—	`HTTPException` everywhere; one error idiom; but engine-loading is split (Surya lazy in `surya.py:12`, Kraken eager via lifespan in `main.py:60`) — undocumented asymmetry	Add a note explaining why Kraken is eager (model exists or it doesn't) and Surya is lazy (saves ~5GB at idle)
C5.3 CLAUDE.md rules mirrored in human docs	FAIL	Major	The single-replica constraint is in `ocr-service/CLAUDE.md:21` and inline in `docker-compose.yml:75–77`, but the ADR (`docs/adr/001-ocr-python-microservice.md`) does NOT state it as a constraint — it only mentions "model reloading" while rejecting `ProcessBuilder`	Add an explicit "Operational constraint: single-node only" section to ADR-001 (or new ADR-008) and to `docs/ARCHITECTURE.md`. See §8
C5.4 Naming conventions stated	FAIL	Minor	Naming is consistent (snake_case, descriptive) but the convention is not stated anywhere	Add a one-paragraph naming convention to README
C6.1 No controller→repo (Java)	N/A	—	Python	—
C6.2 No cross-domain repo (Java)	N/A	—	Python	—
C6.3 Frontend cross-domain	N/A	—	No frontend	—
C6.4 Cross-cutting in `shared/`	PASS	—	Cross-cutting concerns (logging, env reading) live where used; no duplication	—
C7.1 Per-domain README	FAIL	Major	`CLAUDE.md` is the only doc surface; should be a `README.md` (see C2.5)	Promote / split
C7.2 Why-comments on non-obvious blocks	PASS	—	`main.py:578–602` (segtrain 800px logic), `main.py:654–656` (200 DPI choice), `kraken.py:198–200` (Kraken bounds quirk), `confidence.py:110` strict-`>` rationale all carry strong why-comments	—
C7.3 No "ask Marcel" / "non-obvious" / "Claude generated"	PASS	—	grep clean. Single use of "non-obvious" in `confidence.py` neighborhood is actually fine (not a TODO marker)	—
C7.4 API contracts documented	PARTIAL FAIL	Major	FastAPI auto-generates `/docs` and `/openapi.json` (default with FastAPI), but this is not mentioned anywhere. The endpoint table in `ocr-service/CLAUDE.md:70–75` lists `/training/submit` which does not exist; actual endpoints are `/train`, `/train-sender`, `/segtrain` (`main.py:358,450,530`). There is no shared schema between the Python `OcrRequest`/`OcrBlock` (`models.py`) and the Java `OcrClient.OcrRegion` (`backend/.../OcrClient.java:19`) — they are hand-mirrored	(1) Fix the stale endpoint table. (2) Document `GET /docs` for live OpenAPI. (3) Add a "Contract with backend" subsection naming `OcrClient` / `RestClientOcrClient` as the consumer and listing the field-by-field mapping
C7.5 DB schema commented	N/A	—	Service has no DB	—
C8.1 Mutation spot-check viable	PARTIAL PASS	—	`test_confidence.py`, `test_spell_check.py`, `test_sender_registry.py` are tight assertion-based unit tests that would catch mutations. `test_engines.py` and `test_stream.py` are heavily mocked — they verify wiring, not OCR correctness	See C8.2
C8.2 e2e per critical journey	FAIL	Major	No test takes a real fixture image and asserts on extracted text. The OCR pipeline's correctness against actual handwriting/typewriting is unverified by the test suite — only contract shape is tested	Add at least one fixture-image test per engine (small Kurrent crop, small typewritten crop) marked `@pytest.mark.slow`, gated behind a `SKIP_ML_TESTS` env var
C8.3 Test names descriptive	PASS	—	Names like `test_kraken_extract_page_blocks_skips_records_without_positional_data` describe the assertion clearly	—
C8.4 Test ordering independent	PASS	—	Tests use `patch.object` and ASGI transport per test; no module-level mutable state pollution observed	—
C9.1 Deployment documented	PARTIAL PASS	—	`ocr-service/CLAUDE.md:114–127` covers compose; missing volume strategy (`ocr_models`, `ocr_cache`) at the `docs/` level; no `docs/DEPLOYMENT.md`	Cross-reference from ADR-001 / future DEPLOYMENT.md to compose entries and to the volume layout
C9.2 Env vars in one place	PARTIAL FAIL	Major	`ocr-service/CLAUDE.md:79–90` documents 10 env vars, but `BLLA_MODEL_PATH` (used in `ensure_blla_model.py:24` and `main.py:552`) is missing. Also `ALLOWED_PDF_HOSTS` lives in CLAUDE.md but not in `docker-compose.yml` defaults	Add `BLLA_MODEL_PATH` row; align defaults between CLAUDE.md and `docker-compose.yml`
C9.3 Migrations naming	N/A	—	No DB	—
C9.4 Logs/observability	PARTIAL FAIL	Minor	Service uses stdlib logging at INFO; healthcheck exists. Where logs go in production (stdout → docker logs) and how to read them is not documented	One-line addition to README
C10.1 No dead code	PARTIAL FAIL	Minor	`from unittest.mock import ... call ...` is imported but unused in `test_sender_registry.py:2` and `test_ensure_blla_model.py:3`. The `language` parameter is plumbed through both engines (`engines/kraken.py:103,220`, `engines/surya.py:37,118`) but never used by either Kraken or Surya — pure dead plumbing	Remove unused `call` imports; either use `language` (pass to predictors) or remove from signatures with a comment
C10.2 No premature abstractions	PASS	—	`_SenderModelRegistry` looks abstract but is justified by concurrent loads and explicit invalidation requirements (tested in `test_sender_registry.py`)	—
C10.3 No magic numbers	PARTIAL FAIL	Minor	`200 / 72` DPI scaling in `main.py:656`, `_MIN_SPELL_CHECK_LEN = 4` is named (good), `> 50` frequency floor in `spell_check.py:110` is justified inline (good), `"-N", "10"` epoch count repeated three times across `/train`, `/train-sender`, `/segtrain` (`main.py:405,499,576`)	Lift the ketos epoch / batch arguments into module-level constants (or env vars); they will drift across the three endpoints otherwise
C10.4 No half-finished features	PASS	—	No `if False`, no commented-out blocks in production code	—
C10.5 No defensive nulls	PASS	—	`boundary` / `baseline` defensive coercion in `kraken.py:128–148` is justified by mock + real-world record shape variance and is covered by an explicit test (`test_engines.py:187`)	—

4. Subsystem health summary

Category	PASS	Major	Minor
C1 — First Contact	0/1 applicable	1	0
C2 — Local Bootstrap	2/4 applicable	1	1
C3 — Domain Clarity	2/3 applicable	1	0
C4 — Findability	4/4	0	0
C5 — Conventions	1/4	2	1
C6 — Layering	1/1 applicable	0	0
C7 — Documentation	2/4 applicable	2	0
C8 — Test Trust	3/4	1	0
C9 — Operability	0/3 applicable (partials)	1	1
C10 — AI smell	3/5	0	2

Applicable checks: 33. Pass / partial-pass: ~18. Critical: 0. Major: 9.

Overall verdict: 🟡 Yellow — zero Critical findings, fundamentals (layering, naming, why-comments, test independence) are in good shape, but documentation drift and missing per-subsystem README/contract docs put pass-rate around 55%.

5. Domain mapping gaps

N/A — ocr-service/ IS one canonical domain (ocr/) when viewed from outside the repo. Its internal Python module organization is local concern and is appropriately flat.

6. Cross-cutting candidates

N/A — Python service has no Java shared/ to migrate into. Internally, no module deserves promotion to a common/ subpackage at this size.

7. Dead/suspicious code register

File:line	Smell	Suggested action
`ocr-service/test_sender_registry.py:2`	`call` imported but unused	Remove from import
`ocr-service/test_ensure_blla_model.py:3`	`call` imported but unused	Remove from import
`ocr-service/engines/kraken.py:103,220`	`language` parameter accepted but never passed to Kraken	Either wire it through to recognition (Kraken supports lang hints) or drop the parameter and update `main.py` callsites
`ocr-service/engines/surya.py:37,118`	`language` parameter accepted but never used	Same as above
`ocr-service/main.py:405,499,576`	`"-N", "10"` epoch count duplicated across three training endpoints	Lift to module constant or env var (drift risk)
`ocr-service/main.py:398–410` vs `main.py:492–504`	`/train` and `/train-sender` ketos commands are near-identical but copy-pasted	Extract a `_run_ketos_training(checkpoint_dir, ground_truth, base_model_path)` helper
`ocr-service/CLAUDE.md:74`	Documents `/training/submit` endpoint that does not exist in `main.py`	Replace with `/train`, `/train-sender`, `/segtrain` rows

8. Documentation gaps

No ocr-service/README.md — only CLAUDE.md exists. Per repo convention (CLAUDE.md files are agent-facing), a human-targeted README is missing. Promote or split.
Single-replica constraint not in human-readable architecture docs. ocr-service/CLAUDE.md:21 says it; docker-compose.yml:75–77 says it; but docs/adr/001-ocr-python-microservice.md does not state it as a hard operational constraint. Recommendation: add an explicit "Operational constraint: single-node only �� training mutates in-process model state; running >1 replica diverges weights and breaks request routing" section to ADR-001 (or open ADR-008 narrowly scoped to this) and reference it from docs/ARCHITECTURE.md (currently has no OCR section).
No HTTP contract document. FastAPI exposes /docs and /openapi.json for free, but this is not mentioned in any doc. Backend OcrClient.java and RestClientOcrClient.java mirror the schema by hand — no shared contract. Recommend documenting GET /docs and adding a "Contract with backend" subsection that names the consumer and the field-by-field mapping (Java OcrClient.OcrRegion ↔ Python OcrRegion in models.py:4).
Stale endpoint table. ocr-service/CLAUDE.md:74 lists /training/submit — that endpoint does not exist; actuals are /train, /train-sender, /segtrain.
Missing env var. BLLA_MODEL_PATH is referenced in ensure_blla_model.py:24 and main.py:552 but absent from the env-var table in CLAUDE.md:79–90.
Glossary gap. Kurrent vs Sütterlin distinction is not made; the asymmetry that HANDWRITING_LATIN uses Surya (not Kraken) is not explained anywhere.
No troubleshooting section. OOM-at-1800px segtrain and .incompatible model rename behavior are only commented inline, not in docs.
No fixture-image tests. test_engines.py mocks every model call; OCR correctness is not exercised end-to-end. Worth a tests/fixtures/ and a @pytest.mark.slow integration test per engine.

9. Prerequisites for big-bang refactor

REFACTOR-1 / REFACTOR-2 (per the AUDIT-1/AUDIT-2 epic context) reorganize the backend Java packages by domain. They do not touch ocr-service/ directly. The constraint to flag:

Contract is HTTP-only and brittle. The Python OcrRequest / OcrBlock shapes are mirrored by hand in RestClientOcrClient (in org.raddatz.familienarchiv.service). If REFACTOR-1 moves the backend OCR code into a new package (e.g. org.raddatz.familienarchiv.ocr), then:
- URL paths sent to ocr-service (/ocr, /ocr/stream, /train, /train-sender, /segtrain) must remain byte-identical — they are hardcoded in RestClientOcrClient, not derived.
- Field names in the JSON payloads (pdfUrl, scriptType, senderModelPath, regions[].annotationId, pageNumber, x, y, width, height) must remain identical — the Python Pydantic model (models.py:4–22) and Java records are linked only by string convention.
- The HTTP-NDJSON event types (start, preprocessing, page, error, done) and their fields are similarly contract-frozen.
Ideally before any refactor that touches this contract, add either (a) a generated OpenAPI client on the Java side, or (b) at minimum a contract test that POSTs a known shape to a stubbed ocr-service and asserts the mirror is intact.

Within ocr-service/ itself, the prerequisite for any internal restructure is: add at least one fixture-image integration test per engine so a refactor cannot silently break OCR correctness while preserving the HTTP shape.

10. Top 5 prioritized recommendations

Promote ocr-service/CLAUDE.md to ocr-service/README.md (or split: README for humans, CLAUDE.md only for agent-specific notes). Same act fixes C2.5 + C5.3 + C7.1.
Fix the stale endpoint table in the docs — replace the non-existent /training/submit row with the three real training endpoints (/train, /train-sender, /segtrain) and add BLLA_MODEL_PATH to the env-var table. Cheap, high-impact (C7.4 + C9.2).
Migrate the single-replica constraint into ADR-001 as an explicit "Operational constraint" section, then link it from docs/ARCHITECTURE.md. Today this load-bearing fact lives only in CLAUDE.md and a comment in docker-compose.yml (C5.3).
Add a glossary block (Kurrent vs Sütterlin, OCR vs HTR, the four scriptType values and which engine handles each, why HANDWRITING_LATIN uses Surya) — and a "Contract with backend" subsection naming RestClientOcrClient as the consumer and pointing readers at GET /docs for live OpenAPI (C3.3 + C7.4).
Add at least one fixture-image test per engine (Kurrent + typewritten), gated behind @pytest.mark.slow or a SKIP_ML_TESTS env var, so OCR correctness is verified somewhere in the suite — and lift the duplicated ketos cmd array into a single _run_ketos_training() helper to cut copy-paste between the three training endpoints (C8.2 + C10.1).

# AUDIT-3 — `ocr-service/` Legibility Report ## 1. Subsystem profile `ocr-service/` is a Python 3.11 / FastAPI microservice that performs OCR (typewritten Latin) and HTR (handwritten Kurrent/Sütterlin) on PDFs delivered via presigned MinIO URLs. It owns: model lifecycle (Kraken + Surya), CLAHE-based image preprocessing, per-character → per-word confidence reconstruction, German spell-check post-processing with a historical wordlist, and in-process model fine-tuning (`/train`, `/train-sender`, `/segtrain`). It is stateless w.r.t. business data — all job orchestration lives in the Spring Boot backend. Single-replica is mandatory because training mutates in-process model state. ``` ocr-service/ ├── main.py # FastAPI app + 6 endpoints + training pipeline ├── models.py # Pydantic OcrRequest / OcrRegion / OcrBlock ├── preprocessing.py # CLAHE + grayscale + blur ├── confidence.py # words_from_characters + [unleserlich] markers ├── spell_check.py # pyspellchecker + de_historical wordlist ├── ensure_blla_model.py # startup helper: download/validate blla ├── engines/ │ ├── kraken.py # Kurrent HTR + per-sender LRU registry │ └── surya.py # typewritten / Latin handwriting ├── dictionaries/de_historical.txt ├── Dockerfile, entrypoint.sh, requirements.txt, .dockerignore ├── CLAUDE.md # current docs surface └── test_*.py # 7 test modules ``` Total ~2,717 LOC across 16 Python files (engines + tests included). ## 2. Inventory ### Application modules | Module | LOC | Responsibility | |---|---|---| | `main.py` | 660 | FastAPI app, 6 HTTP endpoints, ZIP-Slip-safe extraction, ketos subprocess driver, training-token guard, SSRF host allowlist | | `models.py` | 35 | Pydantic models — `OcrRequest`, `OcrRegion`, `OcrBlock` | | `preprocessing.py` | 50 | `preprocess_page()` — CLAHE + grayscale + Gaussian blur, falls back silently on cv2/value/memory errors | | `confidence.py` | 100 | `apply_confidence_markers()`, `words_from_characters()`, `get_threshold()`, `[unleserlich]` / `[?]` constants | | `spell_check.py` | 117 | German spell-check with historical wordlist; collapses adjacent markers | | `ensure_blla_model.py` | 80 | Container-startup helper to download/validate the blla segmentation model | | `engines/kraken.py` | 339 | Kraken engine wrapper, LRU per-sender model cache, pure-Python convex hull + min-bounding-rect | | `engines/surya.py` | 131 | Surya wrapper (lazy-loaded) | ### HTTP endpoints (actual) | Endpoint | Method | Purpose | |---|---|---| | `/health` | GET | 503 until lifespan finishes; otherwise `{status, surya, kraken}` | | `/ocr` | POST | Sync OCR → `list[OcrBlock]` | | `/ocr/stream` | POST | NDJSON stream — supports guided regions and per-sender models | | `/train` | POST | Fine-tune the global Kurrent model | | `/train-sender` | POST | Fine-tune a per-sender Kurrent model to a target path under `/app/models/` | | `/segtrain` | POST | Fine-tune the blla segmentation model | ### Test modules | Test file | Coverage | External ML deps required? | |---|---|---| | `test_confidence.py` | confidence.py — 22 cases | No | | `test_spell_check.py` | spell_check.py — 16 cases | pyspellchecker only | | `test_preprocessing.py` | preprocessing.py — 5 cases incl. cv2 fallback | numpy + cv2 only | | `test_sender_registry.py` | LRU cache logic | Mocks Kraken loader | | `test_engines.py` | Mocked Surya/Kraken record extraction + signature parity | Heavy mocking via `sys.modules` | | `test_stream.py` | `/ocr/stream` end-to-end via ASGITransport | Mocks engines + PDF download | | `test_training_auth.py` | Token-guard fail-closed semantics | Mocks | | `test_ensure_blla_model.py` | bootstrap helper | Mocks subprocess | There are **no tests that exercise a real model on a real image**. Every "engine" test uses `MagicMock`. End-to-end correctness against a fixture image is not verified anywhere in the repo. ## 3. Rubric scorecard | Check | Result | Severity | Evidence | Recommendation | |---|---|---|---|---| | C1.1 Root README purpose | N/A here | — | Root has no `README.md` (only `frontend/README.md`); this is a repo-wide concern covered by AUDIT-1 | — | | C1.2 README lists 5 components | N/A here | — | Same — repo-level | — | | C1.3 `docs/ARCHITECTURE.md` w/ diagram | FAIL (referrals exist) | Major | `docs/architecture/c4-diagrams.md` exists; no top-level ARCHITECTURE.md links into ocr-service | Add an OCR section to `docs/ARCHITECTURE.md` summarizing the request flow backend→ocr-service→MinIO | | C2.1 Single command brings stack up | PASS | — | `docker-compose up -d` brings `ocr-service` together with backend (`docker-compose.yml:78–110`) | — | | C2.2 Prerequisites with versions | PASS | — | `ocr-service/CLAUDE.md:9–17` lists Python 3.11, FastAPI 0.115.6, kraken==7.0, surya-ocr 0.17.1, torch 2.7.1 (CPU); `requirements.txt` pins | — | | C2.3 Default credentials documented | N/A | — | Service has no auth besides `TRAINING_TOKEN`; the env table says default is empty (fails closed) | — | | C2.4 Common failures with fixes | FAIL | Minor | No troubleshooting section. blla 800px-vs-1800px memory fight is documented inline at `ocr-service/main.py:578–602` only | Add a "known issues" subsection covering OOM at 1800px segtrain, the `.incompatible` rename behavior, and 200-DPI vs 300-DPI memory tradeoff | | C2.5 Per-subsystem README | FAIL | Major | Only `ocr-service/CLAUDE.md` exists; this is an LLM-prefixed file that `COLLABORATING.md` treats as ephemeral | Promote `ocr-service/CLAUDE.md` to `ocr-service/README.md` (or split: README for humans, CLAUDE.md only for agent-specific notes) | | C3.1 Domains stated in docs | PASS | — | `ocr-service/CLAUDE.md:1–18` cleanly states OCR vs HTR boundary | — | | C3.2 Per-domain definition | PASS | — | Surya = typewritten + modern Latin handwriting; Kraken = Kurrent/Sütterlin. Docs state which is for what | — | | C3.3 Glossary disambiguates terms | PARTIAL FAIL | Major | OCR vs HTR is contrasted, but **Kurrent vs Sütterlin is conflated** (CLAUDE.md uses "Kurrent" only; ADR-001 uses both). Spell-check covers `HANDWRITING_KURRENT` and `HANDWRITING_LATIN` (`main.py:43`) but only `HANDWRITING_KURRENT` triggers Kraken (`main.py:100`) — that asymmetry is undocumented | Add a glossary block: Kurrent (broad German cursive family pre-1941), Sütterlin (1915–1941 specific variant), HTR vs OCR, the four `scriptType` values and which engine each uses, and why `HANDWRITING_LATIN` uses Surya despite being handwriting | | C3.4 Backend domain packaging | N/A | — | Python service — Java rule does not apply | — | | C3.5 Frontend lib domain match | N/A | — | Python service has no frontend | — | | C3.6 `shared/` justified | N/A | — | No `shared/` concept inside ocr-service | — | | C4.1 Find files for a domain | PASS | — | Tree is small and flat; `engines/` cleanly separates the two engines | — | | C4.2 Find feature end-to-end | PASS | — | Endpoint → handler → engine wrapper is one hop in `main.py` | — | | C4.3 No `Helper`/`Utils` names | PASS | — | All filenames concept-named (`preprocessing`, `confidence`, `spell_check`, `ensure_blla_model`) | — | | C4.4 Routes predictable | PASS | — | `/ocr`, `/ocr/stream`, `/train`, `/train-sender`, `/segtrain`, `/health` — all guessable | — | | C5.1 How-to add an endpoint/etc. | FAIL | Major | No conventions doc for ocr-service contributors | Add a short "Conventions" section to README covering: how to add a new engine, how to add a new env var, how to add a new endpoint, where to put tests | | C5.2 One way to do common things | PARTIAL PASS | — | `HTTPException` everywhere; one error idiom; but engine-loading is split (Surya lazy in `surya.py:12`, Kraken eager via lifespan in `main.py:60`) — undocumented asymmetry | Add a note explaining why Kraken is eager (model exists or it doesn't) and Surya is lazy (saves ~5GB at idle) | | C5.3 CLAUDE.md rules mirrored in human docs | FAIL | Major | The single-replica constraint is in `ocr-service/CLAUDE.md:21` and inline in `docker-compose.yml:75–77`, but the ADR (`docs/adr/001-ocr-python-microservice.md`) does NOT state it as a constraint — it only mentions "model reloading" while rejecting `ProcessBuilder` | Add an explicit "Operational constraint: single-node only" section to ADR-001 (or new ADR-008) and to `docs/ARCHITECTURE.md`. See §8 | | C5.4 Naming conventions stated | FAIL | Minor | Naming is consistent (snake_case, descriptive) but the convention is not stated anywhere | Add a one-paragraph naming convention to README | | C6.1 No controller→repo (Java) | N/A | — | Python | — | | C6.2 No cross-domain repo (Java) | N/A | — | Python | — | | C6.3 Frontend cross-domain | N/A | — | No frontend | — | | C6.4 Cross-cutting in `shared/` | PASS | — | Cross-cutting concerns (logging, env reading) live where used; no duplication | — | | C7.1 Per-domain README | FAIL | Major | `CLAUDE.md` is the only doc surface; should be a `README.md` (see C2.5) | Promote / split | | C7.2 Why-comments on non-obvious blocks | PASS | — | `main.py:578–602` (segtrain 800px logic), `main.py:654–656` (200 DPI choice), `kraken.py:198–200` (Kraken bounds quirk), `confidence.py:110` strict-`>` rationale all carry strong why-comments | — | | C7.3 No "ask Marcel" / "non-obvious" / "Claude generated" | PASS | — | grep clean. Single use of "non-obvious" in `confidence.py` neighborhood is actually fine (not a TODO marker) | — | | C7.4 API contracts documented | PARTIAL FAIL | Major | FastAPI auto-generates `/docs` and `/openapi.json` (default with FastAPI), but this is not mentioned anywhere. The endpoint table in `ocr-service/CLAUDE.md:70–75` lists `/training/submit` which **does not exist**; actual endpoints are `/train`, `/train-sender`, `/segtrain` (`main.py:358,450,530`). There is no shared schema between the Python `OcrRequest`/`OcrBlock` (`models.py`) and the Java `OcrClient.OcrRegion` (`backend/.../OcrClient.java:19`) — they are hand-mirrored | (1) Fix the stale endpoint table. (2) Document `GET /docs` for live OpenAPI. (3) Add a "Contract with backend" subsection naming `OcrClient` / `RestClientOcrClient` as the consumer and listing the field-by-field mapping | | C7.5 DB schema commented | N/A | — | Service has no DB | — | | C8.1 Mutation spot-check viable | PARTIAL PASS | — | `test_confidence.py`, `test_spell_check.py`, `test_sender_registry.py` are tight assertion-based unit tests that would catch mutations. `test_engines.py` and `test_stream.py` are heavily mocked — they verify wiring, not OCR correctness | See C8.2 | | C8.2 e2e per critical journey | FAIL | Major | No test takes a real fixture image and asserts on extracted text. The OCR pipeline's correctness against actual handwriting/typewriting is unverified by the test suite — only contract shape is tested | Add at least one fixture-image test per engine (small Kurrent crop, small typewritten crop) marked `@pytest.mark.slow`, gated behind a `SKIP_ML_TESTS` env var | | C8.3 Test names descriptive | PASS | — | Names like `test_kraken_extract_page_blocks_skips_records_without_positional_data` describe the assertion clearly | — | | C8.4 Test ordering independent | PASS | — | Tests use `patch.object` and ASGI transport per test; no module-level mutable state pollution observed | — | | C9.1 Deployment documented | PARTIAL PASS | — | `ocr-service/CLAUDE.md:114–127` covers compose; missing volume strategy (`ocr_models`, `ocr_cache`) at the `docs/` level; no `docs/DEPLOYMENT.md` | Cross-reference from ADR-001 / future DEPLOYMENT.md to compose entries and to the volume layout | | C9.2 Env vars in one place | PARTIAL FAIL | Major | `ocr-service/CLAUDE.md:79–90` documents 10 env vars, **but `BLLA_MODEL_PATH` (used in `ensure_blla_model.py:24` and `main.py:552`) is missing**. Also `ALLOWED_PDF_HOSTS` lives in CLAUDE.md but not in `docker-compose.yml` defaults | Add `BLLA_MODEL_PATH` row; align defaults between CLAUDE.md and `docker-compose.yml` | | C9.3 Migrations naming | N/A | — | No DB | — | | C9.4 Logs/observability | PARTIAL FAIL | Minor | Service uses stdlib logging at INFO; healthcheck exists. Where logs go in production (stdout → docker logs) and how to read them is not documented | One-line addition to README | | C10.1 No dead code | PARTIAL FAIL | Minor | `from unittest.mock import ... call ...` is imported but unused in `test_sender_registry.py:2` and `test_ensure_blla_model.py:3`. The `language` parameter is plumbed through both engines (`engines/kraken.py:103,220`, `engines/surya.py:37,118`) but never used by either Kraken or Surya — pure dead plumbing | Remove unused `call` imports; either use `language` (pass to predictors) or remove from signatures with a comment | | C10.2 No premature abstractions | PASS | — | `_SenderModelRegistry` looks abstract but is justified by concurrent loads and explicit invalidation requirements (tested in `test_sender_registry.py`) | — | | C10.3 No magic numbers | PARTIAL FAIL | Minor | `200 / 72` DPI scaling in `main.py:656`, `_MIN_SPELL_CHECK_LEN = 4` is named (good), `> 50` frequency floor in `spell_check.py:110` is justified inline (good), `"-N", "10"` epoch count repeated three times across `/train`, `/train-sender`, `/segtrain` (`main.py:405,499,576`) | Lift the ketos epoch / batch arguments into module-level constants (or env vars); they will drift across the three endpoints otherwise | | C10.4 No half-finished features | PASS | — | No `if False`, no commented-out blocks in production code | — | | C10.5 No defensive nulls | PASS | — | `boundary` / `baseline` defensive coercion in `kraken.py:128–148` is justified by mock + real-world record shape variance and is covered by an explicit test (`test_engines.py:187`) | — | ## 4. Subsystem health summary | Category | PASS | Critical | Major | Minor | Cosmetic | |---|---|---|---|---|---| | C1 — First Contact | 0/1 applicable | 0 | 1 | 0 | 0 | | C2 — Local Bootstrap | 2/4 applicable | 0 | 1 | 1 | 0 | | C3 — Domain Clarity | 2/3 applicable | 0 | 1 | 0 | 0 | | C4 — Findability | 4/4 | 0 | 0 | 0 | 0 | | C5 — Conventions | 1/4 | 0 | 2 | 1 | 0 | | C6 — Layering | 1/1 applicable | 0 | 0 | 0 | 0 | | C7 — Documentation | 2/4 applicable | 0 | 2 | 0 | 0 | | C8 — Test Trust | 3/4 | 0 | 1 | 0 | 0 | | C9 — Operability | 0/3 applicable (partials) | 0 | 1 | 1 | 0 | | C10 — AI smell | 3/5 | 0 | 0 | 2 | 0 | **Applicable checks:** 33. **Pass / partial-pass:** ~18. **Critical:** 0. **Major:** 9. **Overall verdict: 🟡 Yellow** — zero Critical findings, fundamentals (layering, naming, why-comments, test independence) are in good shape, but documentation drift and missing per-subsystem README/contract docs put pass-rate around 55%. ## 5. Domain mapping gaps N/A — `ocr-service/` IS one canonical domain (`ocr/`) when viewed from outside the repo. Its internal Python module organization is local concern and is appropriately flat. ## 6. Cross-cutting candidates N/A — Python service has no Java `shared/` to migrate into. Internally, no module deserves promotion to a `common/` subpackage at this size. ## 7. Dead/suspicious code register | File:line | Smell | Suggested action | |---|---|---| | `ocr-service/test_sender_registry.py:2` | `call` imported but unused | Remove from import | | `ocr-service/test_ensure_blla_model.py:3` | `call` imported but unused | Remove from import | | `ocr-service/engines/kraken.py:103,220` | `language` parameter accepted but never passed to Kraken | Either wire it through to recognition (Kraken supports lang hints) or drop the parameter and update `main.py` callsites | | `ocr-service/engines/surya.py:37,118` | `language` parameter accepted but never used | Same as above | | `ocr-service/main.py:405,499,576` | `"-N", "10"` epoch count duplicated across three training endpoints | Lift to module constant or env var (drift risk) | | `ocr-service/main.py:398–410` vs `main.py:492–504` | `/train` and `/train-sender` ketos commands are near-identical but copy-pasted | Extract a `_run_ketos_training(checkpoint_dir, ground_truth, base_model_path)` helper | | `ocr-service/CLAUDE.md:74` | Documents `/training/submit` endpoint that **does not exist** in `main.py` | Replace with `/train`, `/train-sender`, `/segtrain` rows | ## 8. Documentation gaps 1. **No `ocr-service/README.md`** — only `CLAUDE.md` exists. Per repo convention (`CLAUDE.md` files are agent-facing), a human-targeted README is missing. Promote or split. 2. **Single-replica constraint not in human-readable architecture docs.** `ocr-service/CLAUDE.md:21` says it; `docker-compose.yml:75–77` says it; but `docs/adr/001-ocr-python-microservice.md` does not state it as a hard operational constraint. **Recommendation:** add an explicit "Operational constraint: single-node only �� training mutates in-process model state; running >1 replica diverges weights and breaks request routing" section to ADR-001 (or open ADR-008 narrowly scoped to this) and reference it from `docs/ARCHITECTURE.md` (currently has no OCR section). 3. **No HTTP contract document.** FastAPI exposes `/docs` and `/openapi.json` for free, but this is not mentioned in any doc. Backend `OcrClient.java` and `RestClientOcrClient.java` mirror the schema by hand — no shared contract. Recommend documenting `GET /docs` and adding a "Contract with backend" subsection that names the consumer and the field-by-field mapping (Java `OcrClient.OcrRegion` ↔ Python `OcrRegion` in `models.py:4`). 4. **Stale endpoint table.** `ocr-service/CLAUDE.md:74` lists `/training/submit` — that endpoint does not exist; actuals are `/train`, `/train-sender`, `/segtrain`. 5. **Missing env var.** `BLLA_MODEL_PATH` is referenced in `ensure_blla_model.py:24` and `main.py:552` but absent from the env-var table in `CLAUDE.md:79–90`. 6. **Glossary gap.** Kurrent vs Sütterlin distinction is not made; the asymmetry that `HANDWRITING_LATIN` uses Surya (not Kraken) is not explained anywhere. 7. **No troubleshooting section.** OOM-at-1800px segtrain and `.incompatible` model rename behavior are only commented inline, not in docs. 8. **No fixture-image tests.** `test_engines.py` mocks every model call; OCR correctness is not exercised end-to-end. Worth a `tests/fixtures/` and a `@pytest.mark.slow` integration test per engine. ## 9. Prerequisites for big-bang refactor REFACTOR-1 / REFACTOR-2 (per the AUDIT-1/AUDIT-2 epic context) reorganize the **backend** Java packages by domain. They do not touch `ocr-service/` directly. The constraint to flag: - **Contract is HTTP-only and brittle.** The Python `OcrRequest` / `OcrBlock` shapes are mirrored by hand in `RestClientOcrClient` (in `org.raddatz.familienarchiv.service`). If REFACTOR-1 moves the backend OCR code into a new package (e.g. `org.raddatz.familienarchiv.ocr`), then: - URL paths sent to ocr-service (`/ocr`, `/ocr/stream`, `/train`, `/train-sender`, `/segtrain`) must remain **byte-identical** — they are hardcoded in `RestClientOcrClient`, not derived. - Field names in the JSON payloads (`pdfUrl`, `scriptType`, `senderModelPath`, `regions[].annotationId`, `pageNumber`, `x`, `y`, `width`, `height`) must remain identical — the Python `Pydantic` model (`models.py:4–22`) and Java records are linked only by string convention. - The HTTP-NDJSON event types (`start`, `preprocessing`, `page`, `error`, `done`) and their fields are similarly contract-frozen. - **Ideally before any refactor that touches this contract,** add either (a) a generated OpenAPI client on the Java side, or (b) at minimum a contract test that POSTs a known shape to a stubbed ocr-service and asserts the mirror is intact. Within `ocr-service/` itself, the prerequisite for any internal restructure is: **add at least one fixture-image integration test per engine** so a refactor cannot silently break OCR correctness while preserving the HTTP shape. ## 10. Top 5 prioritized recommendations 1. **Promote `ocr-service/CLAUDE.md` to `ocr-service/README.md`** (or split: README for humans, CLAUDE.md only for agent-specific notes). Same act fixes C2.5 + C5.3 + C7.1. 2. **Fix the stale endpoint table in the docs** — replace the non-existent `/training/submit` row with the three real training endpoints (`/train`, `/train-sender`, `/segtrain`) and add `BLLA_MODEL_PATH` to the env-var table. Cheap, high-impact (C7.4 + C9.2). 3. **Migrate the single-replica constraint into ADR-001** as an explicit "Operational constraint" section, then link it from `docs/ARCHITECTURE.md`. Today this load-bearing fact lives only in `CLAUDE.md` and a comment in `docker-compose.yml` (C5.3). 4. **Add a glossary block** (Kurrent vs Sütterlin, OCR vs HTR, the four `scriptType` values and which engine handles each, why `HANDWRITING_LATIN` uses Surya) — and a "Contract with backend" subsection naming `RestClientOcrClient` as the consumer and pointing readers at `GET /docs` for live OpenAPI (C3.3 + C7.4). 5. **Add at least one fixture-image test per engine** (Kurrent + typewritten), gated behind `@pytest.mark.slow` or a `SKIP_ML_TESTS` env var, so OCR correctness is verified somewhere in the suite — and lift the duplicated ketos `cmd` array into a single `_run_ketos_training()` helper to cut copy-paste between the three training endpoints (C8.2 + C10.1).

marcel referenced this issue

2026-05-04 16:41:37 +02:00

audit(rollup): produce global readiness scorecard from subsystem audits #393

marcel referenced this issue

2026-05-04 21:07:07 +02:00

cleanup(legibility): remove dead code identified by audits #412

marcel referenced this issue

2026-05-04 21:07:13 +02:00

epic(legibility): polish — remove smells surfaced by audits #411

marcel referenced this issue

2026-05-04 21:07:42 +02:00

cleanup(legibility): remove dead code identified by audits #412