audit(ocr-service): score ocr-service/ against legibility rubric C1–C10 #390

Closed
opened 2026-05-04 16:04:10 +02:00 by marcel · 1 comment
Owner

Context

Part of Epic #387 — Codebase Legibility Audit. This is AUDIT-3: read-only assessment of ocr-service/ against the full Legibility Rubric. The OCR service is domain-bearing (it embeds historical script knowledge, confidence calibration, Kurrent/Sütterlin model lifecycle) so it earns its own audit rather than folding into rest-of-repo.

The output is a Markdown report under docs/audits/audit-ocr-report.md following the per-subsystem orient template. No code changes in this issue.

Inputs (read first)

See the first comment of #387 for the complete reference bundle.

Note: the OCR service is a Python FastAPI microservice. Some rubric checks (e.g., C3.4 about Java package structure) are N/A here. C3.5 also N/A. Apply C5/C6/C7 to the Python module structure on its own terms — the canonical "domain-based packaging" goal applies to the backend Java code, NOT to the internal organization of ocr-service/. The OCR service IS one domain (ocr/) when viewed from the outside; its internal organization is local concern.

Scope (in)

  • Everything under ocr-service/main.py, engines/, supporting modules, requirements, tests
  • ocr-service/Dockerfile, ocr-service/CLAUDE.md (evaluate as documentation surface)
  • The contract between OCR service and backend: HTTP endpoints, request/response shapes, error semantics

Scope (out)

  • ocr-service/.venv/, ocr-service/__pycache__/, ocr-service/engines/__pycache__/
  • Backend's OcrController/OcrService/OcrClient (AUDIT-2 covers those — but note the contract clarity at the boundary in §3)
  • Model files / training data

Rubric checks particularly relevant to this subsystem

C1, C2 (Python prerequisites + bootstrap), C3.1, C3.3 (terminology — Kurrent vs Sütterlin vs HTR vs OCR), C5 (Python conventions), C7 (especially C7.4 on API contract documentation), C8 (test coverage of OCR pipeline), C9.1 + C9.2 (deployment + env vars — single-node constraint per ADR-001 in ocr-service/CLAUDE.md), C10.

Specific things to verify (subsystem-specific)

  • Bootstrap. Can a stranger run ocr-service/ standalone (without backend) for development? Is this documented?
  • Engine pluggability. The Surya/Kraken split is non-obvious. Is the choice between engines documented?
  • Single-node ADR. ocr-service/CLAUDE.md references ADR-001 (single-node only because in-process model reload). This MUST appear in human-readable docs after refactor (C5.3).
  • Contract clarity. Does the OCR service know what backend expects? Is there a shared schema?

Acceptance criteria

  • File docs/audits/audit-ocr-report.md exists on a feature branch
  • Report contains all 10 template sections
  • §3 scorecard explicitly marks Java-specific checks as N/A with reason
  • §4 assigns 🟢 / 🟡 / 🔴
  • §8 explicitly recommends migrating ADR-001 (single-node) into human-readable docs (CONTRIBUTING.md or docs/ARCHITECTURE.md)
  • §10 lists top-5 recommendations
  • No prose contains "ask Marcel," "Claude generated," "TODO," or "non-obvious" without explanation
  • PR opened and merged before this issue is closed

Definition of Done

Report committed under docs/audits/audit-ocr-report.md on main. Top-5 recommendations summarized as a closing comment.

Dispatch

Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit ocr-service/ per the orient template and produce docs/audits/audit-ocr-report.md. Read-only; no code changes.

## Context Part of **Epic #387** — Codebase Legibility Audit. This is **AUDIT-3**: read-only assessment of `ocr-service/` against the full Legibility Rubric. The OCR service is **domain-bearing** (it embeds historical script knowledge, confidence calibration, Kurrent/Sütterlin model lifecycle) so it earns its own audit rather than folding into rest-of-repo. The output is a Markdown report under `docs/audits/audit-ocr-report.md` following the per-subsystem orient template. **No code changes in this issue.** ## Inputs (read first) See **the first comment of #387** for the complete reference bundle. Note: the OCR service is a Python FastAPI microservice. Some rubric checks (e.g., C3.4 about Java package structure) are N/A here. C3.5 also N/A. Apply C5/C6/C7 to the Python module structure on its own terms — the canonical "domain-based packaging" goal applies to the backend Java code, NOT to the internal organization of `ocr-service/`. The OCR service IS one domain (`ocr/`) when viewed from the outside; its internal organization is local concern. ## Scope (in) - Everything under `ocr-service/` — `main.py`, `engines/`, supporting modules, requirements, tests - `ocr-service/Dockerfile`, `ocr-service/CLAUDE.md` (evaluate as documentation surface) - The contract between OCR service and backend: HTTP endpoints, request/response shapes, error semantics ## Scope (out) - `ocr-service/.venv/`, `ocr-service/__pycache__/`, `ocr-service/engines/__pycache__/` - Backend's `OcrController`/`OcrService`/`OcrClient` (AUDIT-2 covers those — but note the contract clarity at the boundary in §3) - Model files / training data ## Rubric checks particularly relevant to this subsystem C1, C2 (Python prerequisites + bootstrap), C3.1, C3.3 (terminology — Kurrent vs Sütterlin vs HTR vs OCR), C5 (Python conventions), C7 (especially C7.4 on API contract documentation), C8 (test coverage of OCR pipeline), C9.1 + C9.2 (deployment + env vars — single-node constraint per ADR-001 in `ocr-service/CLAUDE.md`), C10. ## Specific things to verify (subsystem-specific) - **Bootstrap.** Can a stranger run `ocr-service/` standalone (without backend) for development? Is this documented? - **Engine pluggability.** The Surya/Kraken split is non-obvious. Is the choice between engines documented? - **Single-node ADR.** `ocr-service/CLAUDE.md` references ADR-001 (single-node only because in-process model reload). This MUST appear in human-readable docs after refactor (C5.3). - **Contract clarity.** Does the OCR service know what backend expects? Is there a shared schema? ## Acceptance criteria - [ ] File `docs/audits/audit-ocr-report.md` exists on a feature branch - [ ] Report contains all 10 template sections - [ ] §3 scorecard explicitly marks Java-specific checks as N/A with reason - [ ] §4 assigns 🟢 / 🟡 / 🔴 - [ ] §8 explicitly recommends migrating ADR-001 (single-node) into human-readable docs (CONTRIBUTING.md or `docs/ARCHITECTURE.md`) - [ ] §10 lists top-5 recommendations - [ ] No prose contains "ask Marcel," "Claude generated," "TODO," or "non-obvious" without explanation - [ ] PR opened and merged before this issue is closed ## Definition of Done Report committed under `docs/audits/audit-ocr-report.md` on `main`. Top-5 recommendations summarized as a closing comment. ## Dispatch > Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit `ocr-service/` per the orient template and produce `docs/audits/audit-ocr-report.md`. Read-only; no code changes.
marcel added this to the Codebase Legibility milestone 2026-05-04 16:04:10 +02:00
marcel added the P1-highauditlegibility labels 2026-05-04 16:05:42 +02:00
Author
Owner

AUDIT-3 — ocr-service/ Legibility Report

1. Subsystem profile

ocr-service/ is a Python 3.11 / FastAPI microservice that performs OCR (typewritten Latin) and HTR (handwritten Kurrent/Sütterlin) on PDFs delivered via presigned MinIO URLs. It owns: model lifecycle (Kraken + Surya), CLAHE-based image preprocessing, per-character → per-word confidence reconstruction, German spell-check post-processing with a historical wordlist, and in-process model fine-tuning (/train, /train-sender, /segtrain). It is stateless w.r.t. business data — all job orchestration lives in the Spring Boot backend. Single-replica is mandatory because training mutates in-process model state.

ocr-service/
├── main.py                     # FastAPI app + 6 endpoints + training pipeline
├── models.py                   # Pydantic OcrRequest / OcrRegion / OcrBlock
├── preprocessing.py            # CLAHE + grayscale + blur
├── confidence.py               # words_from_characters + [unleserlich] markers
├── spell_check.py              # pyspellchecker + de_historical wordlist
├── ensure_blla_model.py        # startup helper: download/validate blla
├── engines/
│   ├── kraken.py               # Kurrent HTR + per-sender LRU registry
│   └── surya.py                # typewritten / Latin handwriting
├── dictionaries/de_historical.txt
├── Dockerfile, entrypoint.sh, requirements.txt, .dockerignore
├── CLAUDE.md                   # current docs surface
└── test_*.py                   # 7 test modules

Total ~2,717 LOC across 16 Python files (engines + tests included).

2. Inventory

Application modules

Module LOC Responsibility
main.py 660 FastAPI app, 6 HTTP endpoints, ZIP-Slip-safe extraction, ketos subprocess driver, training-token guard, SSRF host allowlist
models.py 35 Pydantic models — OcrRequest, OcrRegion, OcrBlock
preprocessing.py 50 preprocess_page() — CLAHE + grayscale + Gaussian blur, falls back silently on cv2/value/memory errors
confidence.py 100 apply_confidence_markers(), words_from_characters(), get_threshold(), [unleserlich] / [?] constants
spell_check.py 117 German spell-check with historical wordlist; collapses adjacent markers
ensure_blla_model.py 80 Container-startup helper to download/validate the blla segmentation model
engines/kraken.py 339 Kraken engine wrapper, LRU per-sender model cache, pure-Python convex hull + min-bounding-rect
engines/surya.py 131 Surya wrapper (lazy-loaded)

HTTP endpoints (actual)

Endpoint Method Purpose
/health GET 503 until lifespan finishes; otherwise {status, surya, kraken}
/ocr POST Sync OCR → list[OcrBlock]
/ocr/stream POST NDJSON stream — supports guided regions and per-sender models
/train POST Fine-tune the global Kurrent model
/train-sender POST Fine-tune a per-sender Kurrent model to a target path under /app/models/
/segtrain POST Fine-tune the blla segmentation model

Test modules

Test file Coverage External ML deps required?
test_confidence.py confidence.py — 22 cases No
test_spell_check.py spell_check.py — 16 cases pyspellchecker only
test_preprocessing.py preprocessing.py — 5 cases incl. cv2 fallback numpy + cv2 only
test_sender_registry.py LRU cache logic Mocks Kraken loader
test_engines.py Mocked Surya/Kraken record extraction + signature parity Heavy mocking via sys.modules
test_stream.py /ocr/stream end-to-end via ASGITransport Mocks engines + PDF download
test_training_auth.py Token-guard fail-closed semantics Mocks
test_ensure_blla_model.py bootstrap helper Mocks subprocess

There are no tests that exercise a real model on a real image. Every "engine" test uses MagicMock. End-to-end correctness against a fixture image is not verified anywhere in the repo.

3. Rubric scorecard

Check Result Severity Evidence Recommendation
C1.1 Root README purpose N/A here Root has no README.md (only frontend/README.md); this is a repo-wide concern covered by AUDIT-1
C1.2 README lists 5 components N/A here Same — repo-level
C1.3 docs/ARCHITECTURE.md w/ diagram FAIL (referrals exist) Major docs/architecture/c4-diagrams.md exists; no top-level ARCHITECTURE.md links into ocr-service Add an OCR section to docs/ARCHITECTURE.md summarizing the request flow backend→ocr-service→MinIO
C2.1 Single command brings stack up PASS docker-compose up -d brings ocr-service together with backend (docker-compose.yml:78–110)
C2.2 Prerequisites with versions PASS ocr-service/CLAUDE.md:9–17 lists Python 3.11, FastAPI 0.115.6, kraken==7.0, surya-ocr 0.17.1, torch 2.7.1 (CPU); requirements.txt pins
C2.3 Default credentials documented N/A Service has no auth besides TRAINING_TOKEN; the env table says default is empty (fails closed)
C2.4 Common failures with fixes FAIL Minor No troubleshooting section. blla 800px-vs-1800px memory fight is documented inline at ocr-service/main.py:578–602 only Add a "known issues" subsection covering OOM at 1800px segtrain, the .incompatible rename behavior, and 200-DPI vs 300-DPI memory tradeoff
C2.5 Per-subsystem README FAIL Major Only ocr-service/CLAUDE.md exists; this is an LLM-prefixed file that COLLABORATING.md treats as ephemeral Promote ocr-service/CLAUDE.md to ocr-service/README.md (or split: README for humans, CLAUDE.md only for agent-specific notes)
C3.1 Domains stated in docs PASS ocr-service/CLAUDE.md:1–18 cleanly states OCR vs HTR boundary
C3.2 Per-domain definition PASS Surya = typewritten + modern Latin handwriting; Kraken = Kurrent/Sütterlin. Docs state which is for what
C3.3 Glossary disambiguates terms PARTIAL FAIL Major OCR vs HTR is contrasted, but Kurrent vs Sütterlin is conflated (CLAUDE.md uses "Kurrent" only; ADR-001 uses both). Spell-check covers HANDWRITING_KURRENT and HANDWRITING_LATIN (main.py:43) but only HANDWRITING_KURRENT triggers Kraken (main.py:100) — that asymmetry is undocumented Add a glossary block: Kurrent (broad German cursive family pre-1941), Sütterlin (1915–1941 specific variant), HTR vs OCR, the four scriptType values and which engine each uses, and why HANDWRITING_LATIN uses Surya despite being handwriting
C3.4 Backend domain packaging N/A Python service — Java rule does not apply
C3.5 Frontend lib domain match N/A Python service has no frontend
C3.6 shared/ justified N/A No shared/ concept inside ocr-service
C4.1 Find files for a domain PASS Tree is small and flat; engines/ cleanly separates the two engines
C4.2 Find feature end-to-end PASS Endpoint → handler → engine wrapper is one hop in main.py
C4.3 No Helper/Utils names PASS All filenames concept-named (preprocessing, confidence, spell_check, ensure_blla_model)
C4.4 Routes predictable PASS /ocr, /ocr/stream, /train, /train-sender, /segtrain, /health — all guessable
C5.1 How-to add an endpoint/etc. FAIL Major No conventions doc for ocr-service contributors Add a short "Conventions" section to README covering: how to add a new engine, how to add a new env var, how to add a new endpoint, where to put tests
C5.2 One way to do common things PARTIAL PASS HTTPException everywhere; one error idiom; but engine-loading is split (Surya lazy in surya.py:12, Kraken eager via lifespan in main.py:60) — undocumented asymmetry Add a note explaining why Kraken is eager (model exists or it doesn't) and Surya is lazy (saves ~5GB at idle)
C5.3 CLAUDE.md rules mirrored in human docs FAIL Major The single-replica constraint is in ocr-service/CLAUDE.md:21 and inline in docker-compose.yml:75–77, but the ADR (docs/adr/001-ocr-python-microservice.md) does NOT state it as a constraint — it only mentions "model reloading" while rejecting ProcessBuilder Add an explicit "Operational constraint: single-node only" section to ADR-001 (or new ADR-008) and to docs/ARCHITECTURE.md. See §8
C5.4 Naming conventions stated FAIL Minor Naming is consistent (snake_case, descriptive) but the convention is not stated anywhere Add a one-paragraph naming convention to README
C6.1 No controller→repo (Java) N/A Python
C6.2 No cross-domain repo (Java) N/A Python
C6.3 Frontend cross-domain N/A No frontend
C6.4 Cross-cutting in shared/ PASS Cross-cutting concerns (logging, env reading) live where used; no duplication
C7.1 Per-domain README FAIL Major CLAUDE.md is the only doc surface; should be a README.md (see C2.5) Promote / split
C7.2 Why-comments on non-obvious blocks PASS main.py:578–602 (segtrain 800px logic), main.py:654–656 (200 DPI choice), kraken.py:198–200 (Kraken bounds quirk), confidence.py:110 strict-> rationale all carry strong why-comments
C7.3 No "ask Marcel" / "non-obvious" / "Claude generated" PASS grep clean. Single use of "non-obvious" in confidence.py neighborhood is actually fine (not a TODO marker)
C7.4 API contracts documented PARTIAL FAIL Major FastAPI auto-generates /docs and /openapi.json (default with FastAPI), but this is not mentioned anywhere. The endpoint table in ocr-service/CLAUDE.md:70–75 lists /training/submit which does not exist; actual endpoints are /train, /train-sender, /segtrain (main.py:358,450,530). There is no shared schema between the Python OcrRequest/OcrBlock (models.py) and the Java OcrClient.OcrRegion (backend/.../OcrClient.java:19) — they are hand-mirrored (1) Fix the stale endpoint table. (2) Document GET /docs for live OpenAPI. (3) Add a "Contract with backend" subsection naming OcrClient / RestClientOcrClient as the consumer and listing the field-by-field mapping
C7.5 DB schema commented N/A Service has no DB
C8.1 Mutation spot-check viable PARTIAL PASS test_confidence.py, test_spell_check.py, test_sender_registry.py are tight assertion-based unit tests that would catch mutations. test_engines.py and test_stream.py are heavily mocked — they verify wiring, not OCR correctness See C8.2
C8.2 e2e per critical journey FAIL Major No test takes a real fixture image and asserts on extracted text. The OCR pipeline's correctness against actual handwriting/typewriting is unverified by the test suite — only contract shape is tested Add at least one fixture-image test per engine (small Kurrent crop, small typewritten crop) marked @pytest.mark.slow, gated behind a SKIP_ML_TESTS env var
C8.3 Test names descriptive PASS Names like test_kraken_extract_page_blocks_skips_records_without_positional_data describe the assertion clearly
C8.4 Test ordering independent PASS Tests use patch.object and ASGI transport per test; no module-level mutable state pollution observed
C9.1 Deployment documented PARTIAL PASS ocr-service/CLAUDE.md:114–127 covers compose; missing volume strategy (ocr_models, ocr_cache) at the docs/ level; no docs/DEPLOYMENT.md Cross-reference from ADR-001 / future DEPLOYMENT.md to compose entries and to the volume layout
C9.2 Env vars in one place PARTIAL FAIL Major ocr-service/CLAUDE.md:79–90 documents 10 env vars, but BLLA_MODEL_PATH (used in ensure_blla_model.py:24 and main.py:552) is missing. Also ALLOWED_PDF_HOSTS lives in CLAUDE.md but not in docker-compose.yml defaults Add BLLA_MODEL_PATH row; align defaults between CLAUDE.md and docker-compose.yml
C9.3 Migrations naming N/A No DB
C9.4 Logs/observability PARTIAL FAIL Minor Service uses stdlib logging at INFO; healthcheck exists. Where logs go in production (stdout → docker logs) and how to read them is not documented One-line addition to README
C10.1 No dead code PARTIAL FAIL Minor from unittest.mock import ... call ... is imported but unused in test_sender_registry.py:2 and test_ensure_blla_model.py:3. The language parameter is plumbed through both engines (engines/kraken.py:103,220, engines/surya.py:37,118) but never used by either Kraken or Surya — pure dead plumbing Remove unused call imports; either use language (pass to predictors) or remove from signatures with a comment
C10.2 No premature abstractions PASS _SenderModelRegistry looks abstract but is justified by concurrent loads and explicit invalidation requirements (tested in test_sender_registry.py)
C10.3 No magic numbers PARTIAL FAIL Minor 200 / 72 DPI scaling in main.py:656, _MIN_SPELL_CHECK_LEN = 4 is named (good), > 50 frequency floor in spell_check.py:110 is justified inline (good), "-N", "10" epoch count repeated three times across /train, /train-sender, /segtrain (main.py:405,499,576) Lift the ketos epoch / batch arguments into module-level constants (or env vars); they will drift across the three endpoints otherwise
C10.4 No half-finished features PASS No if False, no commented-out blocks in production code
C10.5 No defensive nulls PASS boundary / baseline defensive coercion in kraken.py:128–148 is justified by mock + real-world record shape variance and is covered by an explicit test (test_engines.py:187)

4. Subsystem health summary

Category PASS Critical Major Minor Cosmetic
C1 — First Contact 0/1 applicable 0 1 0 0
C2 — Local Bootstrap 2/4 applicable 0 1 1 0
C3 — Domain Clarity 2/3 applicable 0 1 0 0
C4 — Findability 4/4 0 0 0 0
C5 — Conventions 1/4 0 2 1 0
C6 — Layering 1/1 applicable 0 0 0 0
C7 — Documentation 2/4 applicable 0 2 0 0
C8 — Test Trust 3/4 0 1 0 0
C9 — Operability 0/3 applicable (partials) 0 1 1 0
C10 — AI smell 3/5 0 0 2 0

Applicable checks: 33. Pass / partial-pass: ~18. Critical: 0. Major: 9.

Overall verdict: 🟡 Yellow — zero Critical findings, fundamentals (layering, naming, why-comments, test independence) are in good shape, but documentation drift and missing per-subsystem README/contract docs put pass-rate around 55%.

5. Domain mapping gaps

N/A — ocr-service/ IS one canonical domain (ocr/) when viewed from outside the repo. Its internal Python module organization is local concern and is appropriately flat.

6. Cross-cutting candidates

N/A — Python service has no Java shared/ to migrate into. Internally, no module deserves promotion to a common/ subpackage at this size.

7. Dead/suspicious code register

File:line Smell Suggested action
ocr-service/test_sender_registry.py:2 call imported but unused Remove from import
ocr-service/test_ensure_blla_model.py:3 call imported but unused Remove from import
ocr-service/engines/kraken.py:103,220 language parameter accepted but never passed to Kraken Either wire it through to recognition (Kraken supports lang hints) or drop the parameter and update main.py callsites
ocr-service/engines/surya.py:37,118 language parameter accepted but never used Same as above
ocr-service/main.py:405,499,576 "-N", "10" epoch count duplicated across three training endpoints Lift to module constant or env var (drift risk)
ocr-service/main.py:398–410 vs main.py:492–504 /train and /train-sender ketos commands are near-identical but copy-pasted Extract a _run_ketos_training(checkpoint_dir, ground_truth, base_model_path) helper
ocr-service/CLAUDE.md:74 Documents /training/submit endpoint that does not exist in main.py Replace with /train, /train-sender, /segtrain rows

8. Documentation gaps

  1. No ocr-service/README.md — only CLAUDE.md exists. Per repo convention (CLAUDE.md files are agent-facing), a human-targeted README is missing. Promote or split.
  2. Single-replica constraint not in human-readable architecture docs. ocr-service/CLAUDE.md:21 says it; docker-compose.yml:75–77 says it; but docs/adr/001-ocr-python-microservice.md does not state it as a hard operational constraint. Recommendation: add an explicit "Operational constraint: single-node only ��� training mutates in-process model state; running >1 replica diverges weights and breaks request routing" section to ADR-001 (or open ADR-008 narrowly scoped to this) and reference it from docs/ARCHITECTURE.md (currently has no OCR section).
  3. No HTTP contract document. FastAPI exposes /docs and /openapi.json for free, but this is not mentioned in any doc. Backend OcrClient.java and RestClientOcrClient.java mirror the schema by hand — no shared contract. Recommend documenting GET /docs and adding a "Contract with backend" subsection that names the consumer and the field-by-field mapping (Java OcrClient.OcrRegion ↔ Python OcrRegion in models.py:4).
  4. Stale endpoint table. ocr-service/CLAUDE.md:74 lists /training/submit — that endpoint does not exist; actuals are /train, /train-sender, /segtrain.
  5. Missing env var. BLLA_MODEL_PATH is referenced in ensure_blla_model.py:24 and main.py:552 but absent from the env-var table in CLAUDE.md:79–90.
  6. Glossary gap. Kurrent vs Sütterlin distinction is not made; the asymmetry that HANDWRITING_LATIN uses Surya (not Kraken) is not explained anywhere.
  7. No troubleshooting section. OOM-at-1800px segtrain and .incompatible model rename behavior are only commented inline, not in docs.
  8. No fixture-image tests. test_engines.py mocks every model call; OCR correctness is not exercised end-to-end. Worth a tests/fixtures/ and a @pytest.mark.slow integration test per engine.

9. Prerequisites for big-bang refactor

REFACTOR-1 / REFACTOR-2 (per the AUDIT-1/AUDIT-2 epic context) reorganize the backend Java packages by domain. They do not touch ocr-service/ directly. The constraint to flag:

  • Contract is HTTP-only and brittle. The Python OcrRequest / OcrBlock shapes are mirrored by hand in RestClientOcrClient (in org.raddatz.familienarchiv.service). If REFACTOR-1 moves the backend OCR code into a new package (e.g. org.raddatz.familienarchiv.ocr), then:
    • URL paths sent to ocr-service (/ocr, /ocr/stream, /train, /train-sender, /segtrain) must remain byte-identical — they are hardcoded in RestClientOcrClient, not derived.
    • Field names in the JSON payloads (pdfUrl, scriptType, senderModelPath, regions[].annotationId, pageNumber, x, y, width, height) must remain identical — the Python Pydantic model (models.py:4–22) and Java records are linked only by string convention.
    • The HTTP-NDJSON event types (start, preprocessing, page, error, done) and their fields are similarly contract-frozen.
  • Ideally before any refactor that touches this contract, add either (a) a generated OpenAPI client on the Java side, or (b) at minimum a contract test that POSTs a known shape to a stubbed ocr-service and asserts the mirror is intact.

Within ocr-service/ itself, the prerequisite for any internal restructure is: add at least one fixture-image integration test per engine so a refactor cannot silently break OCR correctness while preserving the HTTP shape.

10. Top 5 prioritized recommendations

  1. Promote ocr-service/CLAUDE.md to ocr-service/README.md (or split: README for humans, CLAUDE.md only for agent-specific notes). Same act fixes C2.5 + C5.3 + C7.1.
  2. Fix the stale endpoint table in the docs — replace the non-existent /training/submit row with the three real training endpoints (/train, /train-sender, /segtrain) and add BLLA_MODEL_PATH to the env-var table. Cheap, high-impact (C7.4 + C9.2).
  3. Migrate the single-replica constraint into ADR-001 as an explicit "Operational constraint" section, then link it from docs/ARCHITECTURE.md. Today this load-bearing fact lives only in CLAUDE.md and a comment in docker-compose.yml (C5.3).
  4. Add a glossary block (Kurrent vs Sütterlin, OCR vs HTR, the four scriptType values and which engine handles each, why HANDWRITING_LATIN uses Surya) — and a "Contract with backend" subsection naming RestClientOcrClient as the consumer and pointing readers at GET /docs for live OpenAPI (C3.3 + C7.4).
  5. Add at least one fixture-image test per engine (Kurrent + typewritten), gated behind @pytest.mark.slow or a SKIP_ML_TESTS env var, so OCR correctness is verified somewhere in the suite — and lift the duplicated ketos cmd array into a single _run_ketos_training() helper to cut copy-paste between the three training endpoints (C8.2 + C10.1).
# AUDIT-3 — `ocr-service/` Legibility Report ## 1. Subsystem profile `ocr-service/` is a Python 3.11 / FastAPI microservice that performs OCR (typewritten Latin) and HTR (handwritten Kurrent/Sütterlin) on PDFs delivered via presigned MinIO URLs. It owns: model lifecycle (Kraken + Surya), CLAHE-based image preprocessing, per-character → per-word confidence reconstruction, German spell-check post-processing with a historical wordlist, and in-process model fine-tuning (`/train`, `/train-sender`, `/segtrain`). It is stateless w.r.t. business data — all job orchestration lives in the Spring Boot backend. Single-replica is mandatory because training mutates in-process model state. ``` ocr-service/ ├── main.py # FastAPI app + 6 endpoints + training pipeline ├── models.py # Pydantic OcrRequest / OcrRegion / OcrBlock ├── preprocessing.py # CLAHE + grayscale + blur ├── confidence.py # words_from_characters + [unleserlich] markers ├── spell_check.py # pyspellchecker + de_historical wordlist ├── ensure_blla_model.py # startup helper: download/validate blla ├── engines/ │ ├── kraken.py # Kurrent HTR + per-sender LRU registry │ └── surya.py # typewritten / Latin handwriting ├── dictionaries/de_historical.txt ├── Dockerfile, entrypoint.sh, requirements.txt, .dockerignore ├── CLAUDE.md # current docs surface └── test_*.py # 7 test modules ``` Total ~2,717 LOC across 16 Python files (engines + tests included). ## 2. Inventory ### Application modules | Module | LOC | Responsibility | |---|---|---| | `main.py` | 660 | FastAPI app, 6 HTTP endpoints, ZIP-Slip-safe extraction, ketos subprocess driver, training-token guard, SSRF host allowlist | | `models.py` | 35 | Pydantic models — `OcrRequest`, `OcrRegion`, `OcrBlock` | | `preprocessing.py` | 50 | `preprocess_page()` — CLAHE + grayscale + Gaussian blur, falls back silently on cv2/value/memory errors | | `confidence.py` | 100 | `apply_confidence_markers()`, `words_from_characters()`, `get_threshold()`, `[unleserlich]` / `[?]` constants | | `spell_check.py` | 117 | German spell-check with historical wordlist; collapses adjacent markers | | `ensure_blla_model.py` | 80 | Container-startup helper to download/validate the blla segmentation model | | `engines/kraken.py` | 339 | Kraken engine wrapper, LRU per-sender model cache, pure-Python convex hull + min-bounding-rect | | `engines/surya.py` | 131 | Surya wrapper (lazy-loaded) | ### HTTP endpoints (actual) | Endpoint | Method | Purpose | |---|---|---| | `/health` | GET | 503 until lifespan finishes; otherwise `{status, surya, kraken}` | | `/ocr` | POST | Sync OCR → `list[OcrBlock]` | | `/ocr/stream` | POST | NDJSON stream — supports guided regions and per-sender models | | `/train` | POST | Fine-tune the global Kurrent model | | `/train-sender` | POST | Fine-tune a per-sender Kurrent model to a target path under `/app/models/` | | `/segtrain` | POST | Fine-tune the blla segmentation model | ### Test modules | Test file | Coverage | External ML deps required? | |---|---|---| | `test_confidence.py` | confidence.py — 22 cases | No | | `test_spell_check.py` | spell_check.py — 16 cases | pyspellchecker only | | `test_preprocessing.py` | preprocessing.py — 5 cases incl. cv2 fallback | numpy + cv2 only | | `test_sender_registry.py` | LRU cache logic | Mocks Kraken loader | | `test_engines.py` | Mocked Surya/Kraken record extraction + signature parity | Heavy mocking via `sys.modules` | | `test_stream.py` | `/ocr/stream` end-to-end via ASGITransport | Mocks engines + PDF download | | `test_training_auth.py` | Token-guard fail-closed semantics | Mocks | | `test_ensure_blla_model.py` | bootstrap helper | Mocks subprocess | There are **no tests that exercise a real model on a real image**. Every "engine" test uses `MagicMock`. End-to-end correctness against a fixture image is not verified anywhere in the repo. ## 3. Rubric scorecard | Check | Result | Severity | Evidence | Recommendation | |---|---|---|---|---| | C1.1 Root README purpose | N/A here | — | Root has no `README.md` (only `frontend/README.md`); this is a repo-wide concern covered by AUDIT-1 | — | | C1.2 README lists 5 components | N/A here | — | Same — repo-level | — | | C1.3 `docs/ARCHITECTURE.md` w/ diagram | FAIL (referrals exist) | Major | `docs/architecture/c4-diagrams.md` exists; no top-level ARCHITECTURE.md links into ocr-service | Add an OCR section to `docs/ARCHITECTURE.md` summarizing the request flow backend→ocr-service→MinIO | | C2.1 Single command brings stack up | PASS | — | `docker-compose up -d` brings `ocr-service` together with backend (`docker-compose.yml:78–110`) | — | | C2.2 Prerequisites with versions | PASS | — | `ocr-service/CLAUDE.md:9–17` lists Python 3.11, FastAPI 0.115.6, kraken==7.0, surya-ocr 0.17.1, torch 2.7.1 (CPU); `requirements.txt` pins | — | | C2.3 Default credentials documented | N/A | — | Service has no auth besides `TRAINING_TOKEN`; the env table says default is empty (fails closed) | — | | C2.4 Common failures with fixes | FAIL | Minor | No troubleshooting section. blla 800px-vs-1800px memory fight is documented inline at `ocr-service/main.py:578–602` only | Add a "known issues" subsection covering OOM at 1800px segtrain, the `.incompatible` rename behavior, and 200-DPI vs 300-DPI memory tradeoff | | C2.5 Per-subsystem README | FAIL | Major | Only `ocr-service/CLAUDE.md` exists; this is an LLM-prefixed file that `COLLABORATING.md` treats as ephemeral | Promote `ocr-service/CLAUDE.md` to `ocr-service/README.md` (or split: README for humans, CLAUDE.md only for agent-specific notes) | | C3.1 Domains stated in docs | PASS | — | `ocr-service/CLAUDE.md:1–18` cleanly states OCR vs HTR boundary | — | | C3.2 Per-domain definition | PASS | — | Surya = typewritten + modern Latin handwriting; Kraken = Kurrent/Sütterlin. Docs state which is for what | — | | C3.3 Glossary disambiguates terms | PARTIAL FAIL | Major | OCR vs HTR is contrasted, but **Kurrent vs Sütterlin is conflated** (CLAUDE.md uses "Kurrent" only; ADR-001 uses both). Spell-check covers `HANDWRITING_KURRENT` and `HANDWRITING_LATIN` (`main.py:43`) but only `HANDWRITING_KURRENT` triggers Kraken (`main.py:100`) — that asymmetry is undocumented | Add a glossary block: Kurrent (broad German cursive family pre-1941), Sütterlin (1915–1941 specific variant), HTR vs OCR, the four `scriptType` values and which engine each uses, and why `HANDWRITING_LATIN` uses Surya despite being handwriting | | C3.4 Backend domain packaging | N/A | — | Python service — Java rule does not apply | — | | C3.5 Frontend lib domain match | N/A | — | Python service has no frontend | — | | C3.6 `shared/` justified | N/A | — | No `shared/` concept inside ocr-service | — | | C4.1 Find files for a domain | PASS | — | Tree is small and flat; `engines/` cleanly separates the two engines | — | | C4.2 Find feature end-to-end | PASS | — | Endpoint → handler → engine wrapper is one hop in `main.py` | — | | C4.3 No `Helper`/`Utils` names | PASS | — | All filenames concept-named (`preprocessing`, `confidence`, `spell_check`, `ensure_blla_model`) | — | | C4.4 Routes predictable | PASS | — | `/ocr`, `/ocr/stream`, `/train`, `/train-sender`, `/segtrain`, `/health` — all guessable | — | | C5.1 How-to add an endpoint/etc. | FAIL | Major | No conventions doc for ocr-service contributors | Add a short "Conventions" section to README covering: how to add a new engine, how to add a new env var, how to add a new endpoint, where to put tests | | C5.2 One way to do common things | PARTIAL PASS | — | `HTTPException` everywhere; one error idiom; but engine-loading is split (Surya lazy in `surya.py:12`, Kraken eager via lifespan in `main.py:60`) — undocumented asymmetry | Add a note explaining why Kraken is eager (model exists or it doesn't) and Surya is lazy (saves ~5GB at idle) | | C5.3 CLAUDE.md rules mirrored in human docs | FAIL | Major | The single-replica constraint is in `ocr-service/CLAUDE.md:21` and inline in `docker-compose.yml:75–77`, but the ADR (`docs/adr/001-ocr-python-microservice.md`) does NOT state it as a constraint — it only mentions "model reloading" while rejecting `ProcessBuilder` | Add an explicit "Operational constraint: single-node only" section to ADR-001 (or new ADR-008) and to `docs/ARCHITECTURE.md`. See §8 | | C5.4 Naming conventions stated | FAIL | Minor | Naming is consistent (snake_case, descriptive) but the convention is not stated anywhere | Add a one-paragraph naming convention to README | | C6.1 No controller→repo (Java) | N/A | — | Python | — | | C6.2 No cross-domain repo (Java) | N/A | — | Python | — | | C6.3 Frontend cross-domain | N/A | — | No frontend | — | | C6.4 Cross-cutting in `shared/` | PASS | — | Cross-cutting concerns (logging, env reading) live where used; no duplication | — | | C7.1 Per-domain README | FAIL | Major | `CLAUDE.md` is the only doc surface; should be a `README.md` (see C2.5) | Promote / split | | C7.2 Why-comments on non-obvious blocks | PASS | — | `main.py:578–602` (segtrain 800px logic), `main.py:654–656` (200 DPI choice), `kraken.py:198–200` (Kraken bounds quirk), `confidence.py:110` strict-`>` rationale all carry strong why-comments | — | | C7.3 No "ask Marcel" / "non-obvious" / "Claude generated" | PASS | — | grep clean. Single use of "non-obvious" in `confidence.py` neighborhood is actually fine (not a TODO marker) | — | | C7.4 API contracts documented | PARTIAL FAIL | Major | FastAPI auto-generates `/docs` and `/openapi.json` (default with FastAPI), but this is not mentioned anywhere. The endpoint table in `ocr-service/CLAUDE.md:70–75` lists `/training/submit` which **does not exist**; actual endpoints are `/train`, `/train-sender`, `/segtrain` (`main.py:358,450,530`). There is no shared schema between the Python `OcrRequest`/`OcrBlock` (`models.py`) and the Java `OcrClient.OcrRegion` (`backend/.../OcrClient.java:19`) — they are hand-mirrored | (1) Fix the stale endpoint table. (2) Document `GET /docs` for live OpenAPI. (3) Add a "Contract with backend" subsection naming `OcrClient` / `RestClientOcrClient` as the consumer and listing the field-by-field mapping | | C7.5 DB schema commented | N/A | — | Service has no DB | — | | C8.1 Mutation spot-check viable | PARTIAL PASS | — | `test_confidence.py`, `test_spell_check.py`, `test_sender_registry.py` are tight assertion-based unit tests that would catch mutations. `test_engines.py` and `test_stream.py` are heavily mocked — they verify wiring, not OCR correctness | See C8.2 | | C8.2 e2e per critical journey | FAIL | Major | No test takes a real fixture image and asserts on extracted text. The OCR pipeline's correctness against actual handwriting/typewriting is unverified by the test suite — only contract shape is tested | Add at least one fixture-image test per engine (small Kurrent crop, small typewritten crop) marked `@pytest.mark.slow`, gated behind a `SKIP_ML_TESTS` env var | | C8.3 Test names descriptive | PASS | — | Names like `test_kraken_extract_page_blocks_skips_records_without_positional_data` describe the assertion clearly | — | | C8.4 Test ordering independent | PASS | — | Tests use `patch.object` and ASGI transport per test; no module-level mutable state pollution observed | — | | C9.1 Deployment documented | PARTIAL PASS | — | `ocr-service/CLAUDE.md:114–127` covers compose; missing volume strategy (`ocr_models`, `ocr_cache`) at the `docs/` level; no `docs/DEPLOYMENT.md` | Cross-reference from ADR-001 / future DEPLOYMENT.md to compose entries and to the volume layout | | C9.2 Env vars in one place | PARTIAL FAIL | Major | `ocr-service/CLAUDE.md:79–90` documents 10 env vars, **but `BLLA_MODEL_PATH` (used in `ensure_blla_model.py:24` and `main.py:552`) is missing**. Also `ALLOWED_PDF_HOSTS` lives in CLAUDE.md but not in `docker-compose.yml` defaults | Add `BLLA_MODEL_PATH` row; align defaults between CLAUDE.md and `docker-compose.yml` | | C9.3 Migrations naming | N/A | — | No DB | — | | C9.4 Logs/observability | PARTIAL FAIL | Minor | Service uses stdlib logging at INFO; healthcheck exists. Where logs go in production (stdout → docker logs) and how to read them is not documented | One-line addition to README | | C10.1 No dead code | PARTIAL FAIL | Minor | `from unittest.mock import ... call ...` is imported but unused in `test_sender_registry.py:2` and `test_ensure_blla_model.py:3`. The `language` parameter is plumbed through both engines (`engines/kraken.py:103,220`, `engines/surya.py:37,118`) but never used by either Kraken or Surya — pure dead plumbing | Remove unused `call` imports; either use `language` (pass to predictors) or remove from signatures with a comment | | C10.2 No premature abstractions | PASS | — | `_SenderModelRegistry` looks abstract but is justified by concurrent loads and explicit invalidation requirements (tested in `test_sender_registry.py`) | — | | C10.3 No magic numbers | PARTIAL FAIL | Minor | `200 / 72` DPI scaling in `main.py:656`, `_MIN_SPELL_CHECK_LEN = 4` is named (good), `> 50` frequency floor in `spell_check.py:110` is justified inline (good), `"-N", "10"` epoch count repeated three times across `/train`, `/train-sender`, `/segtrain` (`main.py:405,499,576`) | Lift the ketos epoch / batch arguments into module-level constants (or env vars); they will drift across the three endpoints otherwise | | C10.4 No half-finished features | PASS | — | No `if False`, no commented-out blocks in production code | — | | C10.5 No defensive nulls | PASS | — | `boundary` / `baseline` defensive coercion in `kraken.py:128–148` is justified by mock + real-world record shape variance and is covered by an explicit test (`test_engines.py:187`) | — | ## 4. Subsystem health summary | Category | PASS | Critical | Major | Minor | Cosmetic | |---|---|---|---|---|---| | C1 — First Contact | 0/1 applicable | 0 | 1 | 0 | 0 | | C2 — Local Bootstrap | 2/4 applicable | 0 | 1 | 1 | 0 | | C3 — Domain Clarity | 2/3 applicable | 0 | 1 | 0 | 0 | | C4 — Findability | 4/4 | 0 | 0 | 0 | 0 | | C5 — Conventions | 1/4 | 0 | 2 | 1 | 0 | | C6 — Layering | 1/1 applicable | 0 | 0 | 0 | 0 | | C7 — Documentation | 2/4 applicable | 0 | 2 | 0 | 0 | | C8 — Test Trust | 3/4 | 0 | 1 | 0 | 0 | | C9 — Operability | 0/3 applicable (partials) | 0 | 1 | 1 | 0 | | C10 — AI smell | 3/5 | 0 | 0 | 2 | 0 | **Applicable checks:** 33. **Pass / partial-pass:** ~18. **Critical:** 0. **Major:** 9. **Overall verdict: 🟡 Yellow** — zero Critical findings, fundamentals (layering, naming, why-comments, test independence) are in good shape, but documentation drift and missing per-subsystem README/contract docs put pass-rate around 55%. ## 5. Domain mapping gaps N/A — `ocr-service/` IS one canonical domain (`ocr/`) when viewed from outside the repo. Its internal Python module organization is local concern and is appropriately flat. ## 6. Cross-cutting candidates N/A — Python service has no Java `shared/` to migrate into. Internally, no module deserves promotion to a `common/` subpackage at this size. ## 7. Dead/suspicious code register | File:line | Smell | Suggested action | |---|---|---| | `ocr-service/test_sender_registry.py:2` | `call` imported but unused | Remove from import | | `ocr-service/test_ensure_blla_model.py:3` | `call` imported but unused | Remove from import | | `ocr-service/engines/kraken.py:103,220` | `language` parameter accepted but never passed to Kraken | Either wire it through to recognition (Kraken supports lang hints) or drop the parameter and update `main.py` callsites | | `ocr-service/engines/surya.py:37,118` | `language` parameter accepted but never used | Same as above | | `ocr-service/main.py:405,499,576` | `"-N", "10"` epoch count duplicated across three training endpoints | Lift to module constant or env var (drift risk) | | `ocr-service/main.py:398–410` vs `main.py:492–504` | `/train` and `/train-sender` ketos commands are near-identical but copy-pasted | Extract a `_run_ketos_training(checkpoint_dir, ground_truth, base_model_path)` helper | | `ocr-service/CLAUDE.md:74` | Documents `/training/submit` endpoint that **does not exist** in `main.py` | Replace with `/train`, `/train-sender`, `/segtrain` rows | ## 8. Documentation gaps 1. **No `ocr-service/README.md`** — only `CLAUDE.md` exists. Per repo convention (`CLAUDE.md` files are agent-facing), a human-targeted README is missing. Promote or split. 2. **Single-replica constraint not in human-readable architecture docs.** `ocr-service/CLAUDE.md:21` says it; `docker-compose.yml:75–77` says it; but `docs/adr/001-ocr-python-microservice.md` does not state it as a hard operational constraint. **Recommendation:** add an explicit "Operational constraint: single-node only ��� training mutates in-process model state; running >1 replica diverges weights and breaks request routing" section to ADR-001 (or open ADR-008 narrowly scoped to this) and reference it from `docs/ARCHITECTURE.md` (currently has no OCR section). 3. **No HTTP contract document.** FastAPI exposes `/docs` and `/openapi.json` for free, but this is not mentioned in any doc. Backend `OcrClient.java` and `RestClientOcrClient.java` mirror the schema by hand — no shared contract. Recommend documenting `GET /docs` and adding a "Contract with backend" subsection that names the consumer and the field-by-field mapping (Java `OcrClient.OcrRegion` ↔ Python `OcrRegion` in `models.py:4`). 4. **Stale endpoint table.** `ocr-service/CLAUDE.md:74` lists `/training/submit` — that endpoint does not exist; actuals are `/train`, `/train-sender`, `/segtrain`. 5. **Missing env var.** `BLLA_MODEL_PATH` is referenced in `ensure_blla_model.py:24` and `main.py:552` but absent from the env-var table in `CLAUDE.md:79–90`. 6. **Glossary gap.** Kurrent vs Sütterlin distinction is not made; the asymmetry that `HANDWRITING_LATIN` uses Surya (not Kraken) is not explained anywhere. 7. **No troubleshooting section.** OOM-at-1800px segtrain and `.incompatible` model rename behavior are only commented inline, not in docs. 8. **No fixture-image tests.** `test_engines.py` mocks every model call; OCR correctness is not exercised end-to-end. Worth a `tests/fixtures/` and a `@pytest.mark.slow` integration test per engine. ## 9. Prerequisites for big-bang refactor REFACTOR-1 / REFACTOR-2 (per the AUDIT-1/AUDIT-2 epic context) reorganize the **backend** Java packages by domain. They do not touch `ocr-service/` directly. The constraint to flag: - **Contract is HTTP-only and brittle.** The Python `OcrRequest` / `OcrBlock` shapes are mirrored by hand in `RestClientOcrClient` (in `org.raddatz.familienarchiv.service`). If REFACTOR-1 moves the backend OCR code into a new package (e.g. `org.raddatz.familienarchiv.ocr`), then: - URL paths sent to ocr-service (`/ocr`, `/ocr/stream`, `/train`, `/train-sender`, `/segtrain`) must remain **byte-identical** — they are hardcoded in `RestClientOcrClient`, not derived. - Field names in the JSON payloads (`pdfUrl`, `scriptType`, `senderModelPath`, `regions[].annotationId`, `pageNumber`, `x`, `y`, `width`, `height`) must remain identical — the Python `Pydantic` model (`models.py:4–22`) and Java records are linked only by string convention. - The HTTP-NDJSON event types (`start`, `preprocessing`, `page`, `error`, `done`) and their fields are similarly contract-frozen. - **Ideally before any refactor that touches this contract,** add either (a) a generated OpenAPI client on the Java side, or (b) at minimum a contract test that POSTs a known shape to a stubbed ocr-service and asserts the mirror is intact. Within `ocr-service/` itself, the prerequisite for any internal restructure is: **add at least one fixture-image integration test per engine** so a refactor cannot silently break OCR correctness while preserving the HTTP shape. ## 10. Top 5 prioritized recommendations 1. **Promote `ocr-service/CLAUDE.md` to `ocr-service/README.md`** (or split: README for humans, CLAUDE.md only for agent-specific notes). Same act fixes C2.5 + C5.3 + C7.1. 2. **Fix the stale endpoint table in the docs** — replace the non-existent `/training/submit` row with the three real training endpoints (`/train`, `/train-sender`, `/segtrain`) and add `BLLA_MODEL_PATH` to the env-var table. Cheap, high-impact (C7.4 + C9.2). 3. **Migrate the single-replica constraint into ADR-001** as an explicit "Operational constraint" section, then link it from `docs/ARCHITECTURE.md`. Today this load-bearing fact lives only in `CLAUDE.md` and a comment in `docker-compose.yml` (C5.3). 4. **Add a glossary block** (Kurrent vs Sütterlin, OCR vs HTR, the four `scriptType` values and which engine handles each, why `HANDWRITING_LATIN` uses Surya) — and a "Contract with backend" subsection naming `RestClientOcrClient` as the consumer and pointing readers at `GET /docs` for live OpenAPI (C3.3 + C7.4). 5. **Add at least one fixture-image test per engine** (Kurrent + typewritten), gated behind `@pytest.mark.slow` or a `SKIP_ML_TESTS` env var, so OCR correctness is verified somewhere in the suite — and lift the duplicated ketos `cmd` array into a single `_run_ketos_training()` helper to cut copy-paste between the three training endpoints (C8.2 + C10.1).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#390