audit(ocr-service): score ocr-service/ against legibility rubric C1–C10 #390
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Part of Epic #387 — Codebase Legibility Audit. This is AUDIT-3: read-only assessment of
ocr-service/against the full Legibility Rubric. The OCR service is domain-bearing (it embeds historical script knowledge, confidence calibration, Kurrent/Sütterlin model lifecycle) so it earns its own audit rather than folding into rest-of-repo.The output is a Markdown report under
docs/audits/audit-ocr-report.mdfollowing the per-subsystem orient template. No code changes in this issue.Inputs (read first)
See the first comment of #387 for the complete reference bundle.
Note: the OCR service is a Python FastAPI microservice. Some rubric checks (e.g., C3.4 about Java package structure) are N/A here. C3.5 also N/A. Apply C5/C6/C7 to the Python module structure on its own terms — the canonical "domain-based packaging" goal applies to the backend Java code, NOT to the internal organization of
ocr-service/. The OCR service IS one domain (ocr/) when viewed from the outside; its internal organization is local concern.Scope (in)
ocr-service/—main.py,engines/, supporting modules, requirements, testsocr-service/Dockerfile,ocr-service/CLAUDE.md(evaluate as documentation surface)Scope (out)
ocr-service/.venv/,ocr-service/__pycache__/,ocr-service/engines/__pycache__/OcrController/OcrService/OcrClient(AUDIT-2 covers those — but note the contract clarity at the boundary in §3)Rubric checks particularly relevant to this subsystem
C1, C2 (Python prerequisites + bootstrap), C3.1, C3.3 (terminology — Kurrent vs Sütterlin vs HTR vs OCR), C5 (Python conventions), C7 (especially C7.4 on API contract documentation), C8 (test coverage of OCR pipeline), C9.1 + C9.2 (deployment + env vars — single-node constraint per ADR-001 in
ocr-service/CLAUDE.md), C10.Specific things to verify (subsystem-specific)
ocr-service/standalone (without backend) for development? Is this documented?ocr-service/CLAUDE.mdreferences ADR-001 (single-node only because in-process model reload). This MUST appear in human-readable docs after refactor (C5.3).Acceptance criteria
docs/audits/audit-ocr-report.mdexists on a feature branchdocs/ARCHITECTURE.md)Definition of Done
Report committed under
docs/audits/audit-ocr-report.mdonmain. Top-5 recommendations summarized as a closing comment.Dispatch
AUDIT-3 —
ocr-service/Legibility Report1. Subsystem profile
ocr-service/is a Python 3.11 / FastAPI microservice that performs OCR (typewritten Latin) and HTR (handwritten Kurrent/Sütterlin) on PDFs delivered via presigned MinIO URLs. It owns: model lifecycle (Kraken + Surya), CLAHE-based image preprocessing, per-character → per-word confidence reconstruction, German spell-check post-processing with a historical wordlist, and in-process model fine-tuning (/train,/train-sender,/segtrain). It is stateless w.r.t. business data — all job orchestration lives in the Spring Boot backend. Single-replica is mandatory because training mutates in-process model state.Total ~2,717 LOC across 16 Python files (engines + tests included).
2. Inventory
Application modules
main.pymodels.pyOcrRequest,OcrRegion,OcrBlockpreprocessing.pypreprocess_page()— CLAHE + grayscale + Gaussian blur, falls back silently on cv2/value/memory errorsconfidence.pyapply_confidence_markers(),words_from_characters(),get_threshold(),[unleserlich]/[?]constantsspell_check.pyensure_blla_model.pyengines/kraken.pyengines/surya.pyHTTP endpoints (actual)
/health{status, surya, kraken}/ocrlist[OcrBlock]/ocr/stream/train/train-sender/app/models//segtrainTest modules
test_confidence.pytest_spell_check.pytest_preprocessing.pytest_sender_registry.pytest_engines.pysys.modulestest_stream.py/ocr/streamend-to-end via ASGITransporttest_training_auth.pytest_ensure_blla_model.pyThere are no tests that exercise a real model on a real image. Every "engine" test uses
MagicMock. End-to-end correctness against a fixture image is not verified anywhere in the repo.3. Rubric scorecard
README.md(onlyfrontend/README.md); this is a repo-wide concern covered by AUDIT-1docs/ARCHITECTURE.mdw/ diagramdocs/architecture/c4-diagrams.mdexists; no top-level ARCHITECTURE.md links into ocr-servicedocs/ARCHITECTURE.mdsummarizing the request flow backend→ocr-service→MinIOdocker-compose up -dbringsocr-servicetogether with backend (docker-compose.yml:78–110)ocr-service/CLAUDE.md:9–17lists Python 3.11, FastAPI 0.115.6, kraken==7.0, surya-ocr 0.17.1, torch 2.7.1 (CPU);requirements.txtpinsTRAINING_TOKEN; the env table says default is empty (fails closed)ocr-service/main.py:578–602only.incompatiblerename behavior, and 200-DPI vs 300-DPI memory tradeoffocr-service/CLAUDE.mdexists; this is an LLM-prefixed file thatCOLLABORATING.mdtreats as ephemeralocr-service/CLAUDE.mdtoocr-service/README.md(or split: README for humans, CLAUDE.md only for agent-specific notes)ocr-service/CLAUDE.md:1–18cleanly states OCR vs HTR boundaryHANDWRITING_KURRENTandHANDWRITING_LATIN(main.py:43) but onlyHANDWRITING_KURRENTtriggers Kraken (main.py:100) — that asymmetry is undocumentedscriptTypevalues and which engine each uses, and whyHANDWRITING_LATINuses Surya despite being handwritingshared/justifiedshared/concept inside ocr-serviceengines/cleanly separates the two enginesmain.pyHelper/Utilsnamespreprocessing,confidence,spell_check,ensure_blla_model)/ocr,/ocr/stream,/train,/train-sender,/segtrain,/health— all guessableHTTPExceptioneverywhere; one error idiom; but engine-loading is split (Surya lazy insurya.py:12, Kraken eager via lifespan inmain.py:60) — undocumented asymmetryocr-service/CLAUDE.md:21and inline indocker-compose.yml:75–77, but the ADR (docs/adr/001-ocr-python-microservice.md) does NOT state it as a constraint — it only mentions "model reloading" while rejectingProcessBuilderdocs/ARCHITECTURE.md. See §8shared/CLAUDE.mdis the only doc surface; should be aREADME.md(see C2.5)main.py:578–602(segtrain 800px logic),main.py:654–656(200 DPI choice),kraken.py:198–200(Kraken bounds quirk),confidence.py:110strict->rationale all carry strong why-commentsconfidence.pyneighborhood is actually fine (not a TODO marker)/docsand/openapi.json(default with FastAPI), but this is not mentioned anywhere. The endpoint table inocr-service/CLAUDE.md:70–75lists/training/submitwhich does not exist; actual endpoints are/train,/train-sender,/segtrain(main.py:358,450,530). There is no shared schema between the PythonOcrRequest/OcrBlock(models.py) and the JavaOcrClient.OcrRegion(backend/.../OcrClient.java:19) — they are hand-mirroredGET /docsfor live OpenAPI. (3) Add a "Contract with backend" subsection namingOcrClient/RestClientOcrClientas the consumer and listing the field-by-field mappingtest_confidence.py,test_spell_check.py,test_sender_registry.pyare tight assertion-based unit tests that would catch mutations.test_engines.pyandtest_stream.pyare heavily mocked — they verify wiring, not OCR correctness@pytest.mark.slow, gated behind aSKIP_ML_TESTSenv vartest_kraken_extract_page_blocks_skips_records_without_positional_datadescribe the assertion clearlypatch.objectand ASGI transport per test; no module-level mutable state pollution observedocr-service/CLAUDE.md:114–127covers compose; missing volume strategy (ocr_models,ocr_cache) at thedocs/level; nodocs/DEPLOYMENT.mdocr-service/CLAUDE.md:79–90documents 10 env vars, butBLLA_MODEL_PATH(used inensure_blla_model.py:24andmain.py:552) is missing. AlsoALLOWED_PDF_HOSTSlives in CLAUDE.md but not indocker-compose.ymldefaultsBLLA_MODEL_PATHrow; align defaults between CLAUDE.md anddocker-compose.ymlfrom unittest.mock import ... call ...is imported but unused intest_sender_registry.py:2andtest_ensure_blla_model.py:3. Thelanguageparameter is plumbed through both engines (engines/kraken.py:103,220,engines/surya.py:37,118) but never used by either Kraken or Surya — pure dead plumbingcallimports; either uselanguage(pass to predictors) or remove from signatures with a comment_SenderModelRegistrylooks abstract but is justified by concurrent loads and explicit invalidation requirements (tested intest_sender_registry.py)200 / 72DPI scaling inmain.py:656,_MIN_SPELL_CHECK_LEN = 4is named (good),> 50frequency floor inspell_check.py:110is justified inline (good),"-N", "10"epoch count repeated three times across/train,/train-sender,/segtrain(main.py:405,499,576)if False, no commented-out blocks in production codeboundary/baselinedefensive coercion inkraken.py:128–148is justified by mock + real-world record shape variance and is covered by an explicit test (test_engines.py:187)4. Subsystem health summary
Applicable checks: 33. Pass / partial-pass: ~18. Critical: 0. Major: 9.
Overall verdict: 🟡 Yellow — zero Critical findings, fundamentals (layering, naming, why-comments, test independence) are in good shape, but documentation drift and missing per-subsystem README/contract docs put pass-rate around 55%.
5. Domain mapping gaps
N/A —
ocr-service/IS one canonical domain (ocr/) when viewed from outside the repo. Its internal Python module organization is local concern and is appropriately flat.6. Cross-cutting candidates
N/A — Python service has no Java
shared/to migrate into. Internally, no module deserves promotion to acommon/subpackage at this size.7. Dead/suspicious code register
ocr-service/test_sender_registry.py:2callimported but unusedocr-service/test_ensure_blla_model.py:3callimported but unusedocr-service/engines/kraken.py:103,220languageparameter accepted but never passed to Krakenmain.pycallsitesocr-service/engines/surya.py:37,118languageparameter accepted but never usedocr-service/main.py:405,499,576"-N", "10"epoch count duplicated across three training endpointsocr-service/main.py:398–410vsmain.py:492–504/trainand/train-senderketos commands are near-identical but copy-pasted_run_ketos_training(checkpoint_dir, ground_truth, base_model_path)helperocr-service/CLAUDE.md:74/training/submitendpoint that does not exist inmain.py/train,/train-sender,/segtrainrows8. Documentation gaps
ocr-service/README.md— onlyCLAUDE.mdexists. Per repo convention (CLAUDE.mdfiles are agent-facing), a human-targeted README is missing. Promote or split.ocr-service/CLAUDE.md:21says it;docker-compose.yml:75–77says it; butdocs/adr/001-ocr-python-microservice.mddoes not state it as a hard operational constraint. Recommendation: add an explicit "Operational constraint: single-node only ��� training mutates in-process model state; running >1 replica diverges weights and breaks request routing" section to ADR-001 (or open ADR-008 narrowly scoped to this) and reference it fromdocs/ARCHITECTURE.md(currently has no OCR section)./docsand/openapi.jsonfor free, but this is not mentioned in any doc. BackendOcrClient.javaandRestClientOcrClient.javamirror the schema by hand — no shared contract. Recommend documentingGET /docsand adding a "Contract with backend" subsection that names the consumer and the field-by-field mapping (JavaOcrClient.OcrRegion↔ PythonOcrRegioninmodels.py:4).ocr-service/CLAUDE.md:74lists/training/submit— that endpoint does not exist; actuals are/train,/train-sender,/segtrain.BLLA_MODEL_PATHis referenced inensure_blla_model.py:24andmain.py:552but absent from the env-var table inCLAUDE.md:79–90.HANDWRITING_LATINuses Surya (not Kraken) is not explained anywhere..incompatiblemodel rename behavior are only commented inline, not in docs.test_engines.pymocks every model call; OCR correctness is not exercised end-to-end. Worth atests/fixtures/and a@pytest.mark.slowintegration test per engine.9. Prerequisites for big-bang refactor
REFACTOR-1 / REFACTOR-2 (per the AUDIT-1/AUDIT-2 epic context) reorganize the backend Java packages by domain. They do not touch
ocr-service/directly. The constraint to flag:OcrRequest/OcrBlockshapes are mirrored by hand inRestClientOcrClient(inorg.raddatz.familienarchiv.service). If REFACTOR-1 moves the backend OCR code into a new package (e.g.org.raddatz.familienarchiv.ocr), then:/ocr,/ocr/stream,/train,/train-sender,/segtrain) must remain byte-identical — they are hardcoded inRestClientOcrClient, not derived.pdfUrl,scriptType,senderModelPath,regions[].annotationId,pageNumber,x,y,width,height) must remain identical — the PythonPydanticmodel (models.py:4–22) and Java records are linked only by string convention.start,preprocessing,page,error,done) and their fields are similarly contract-frozen.Within
ocr-service/itself, the prerequisite for any internal restructure is: add at least one fixture-image integration test per engine so a refactor cannot silently break OCR correctness while preserving the HTTP shape.10. Top 5 prioritized recommendations
ocr-service/CLAUDE.mdtoocr-service/README.md(or split: README for humans, CLAUDE.md only for agent-specific notes). Same act fixes C2.5 + C5.3 + C7.1./training/submitrow with the three real training endpoints (/train,/train-sender,/segtrain) and addBLLA_MODEL_PATHto the env-var table. Cheap, high-impact (C7.4 + C9.2).docs/ARCHITECTURE.md. Today this load-bearing fact lives only inCLAUDE.mdand a comment indocker-compose.yml(C5.3).scriptTypevalues and which engine handles each, whyHANDWRITING_LATINuses Surya) — and a "Contract with backend" subsection namingRestClientOcrClientas the consumer and pointing readers atGET /docsfor live OpenAPI (C3.3 + C7.4).@pytest.mark.slowor aSKIP_ML_TESTSenv var, so OCR correctness is verified somewhere in the suite — and lift the duplicated ketoscmdarray into a single_run_ketos_training()helper to cut copy-paste between the three training endpoints (C8.2 + C10.1).