fix(ocr): route Surya model staging to SSD via TMPDIR + add volume-init service #615
Reference in New Issue
Block a user
Delete Branch "feat/issue-614-tmpdir-persistent-volume"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fixes the ENOSPC failure on staging where Surya's 1.34 GB
text_recognitionmodel exhausted the 512 MB/tmptmpfs during download. Root cause:tempfile.TemporaryDirectory()defaults to/tmp; the tmpfs was sized for training ZIPs, not GB-scale model staging.TMPDIR=/app/cache/.tmpin Dockerfile + both compose files — routes all model staging to the SSD-backed cache volume;/tmpkeeps its 512 MB RAM budget for training ZIPsocr-volume-initone-shot service to both compose files — automating the manualchown -R 1000:1000from 2026-05-18 and creating/app/cache/.tmpon fresh volumes (resolves AC #6)mkdir -p "$TMPDIR"+ day-old orphan cleanup toentrypoint.shas a fallback layertest_tmpdir.pywith a proper red/green TDD regression test for the entrypoint mkdirCloses #614
Test plan
test_tmpdir.py test_spell_check.py test_confidence.py test_sender_registry.py)test_entrypoint_creates_tmpdirwas confirmed RED before entrypoint change, GREEN afterdocker compose configoutput containsTMPDIR: /app/cache/.tmpin both dev and proddocker volume rm archiv-staging_ocr-cache→ restart → trigger guided OCR → Surya model downloads fully, inference returns textdocker exec archiv-ocr stat -c %u /app/cache/.tmpreturns1000(uid ofocruser)🤖 Generated with Claude Code
🏗️ Markus Keller (@mkeller) — Senior Application Architect
Verdict: ⚠️ Approved with concerns
Blockers
New Docker service missing from documentation
ocr-volume-initis a new service in both compose files. Per our doc-update rules, this triggers:docs/architecture/c4/l2-containers.puml— init container should appeardocs/DEPLOYMENT.md— needs a note about the bootstrap step and how to verify it succeeded (docker logs archiv-ocr-volume-init)ADR-021 covers the why — the C4 container diagram and deployment doc cover the what, and they're absent.
Counter-argument:
ocr-volume-initis a one-shot infrastructure step, not a software system with a port or API surface — closer to adocker runcall than an application container. If that framing is accepted, a single sentence inDEPLOYMENT.md("volume ownership is bootstrapped byocr-volume-init, see ADR-021") would be the minimum, and the C4 update could be waived. Decide and document the reasoning — don't leave the ambiguity open.What's done well
mem_limit, same-filesystemrename(2)benefit). This is the load-bearing reference document this change needs.ocr-volume-initcreates/app/cache/.tmp,entrypoint.shcreates it again as a fallback. Neither layer depends solely on the other — a baredocker runstill works.depends_on: condition: service_completed_successfullyis the correct dependency declaration for a one-shot init pattern.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
Suggestions (non-blocking)
entrypoint.sh: misleading comment"Cross-job ground-truth leakage" is NLP/ML terminology for labeled training data bleeding between runs. The orphaned files here are binary model staging blobs, not ground-truth labels. A cleaner phrasing:
# Remove stale partial downloads left by a previous docker-kill. Minor, but comments should reveal intent precisely.test_entrypoint_creates_tmpdir: exit code not assertedIf the entrypoint exits non-zero after creating TMPDIR, the test still passes. Given the
python3anduvicornstubs bothexit 0, this won't bite today — but a future entrypoint change could introduce a silent regression. Add:find -mtime +1 -deletebehavior has no testThe orphan cleanup is exercised by the entrypoint test, but the behavior — old files deleted, new files preserved — has no dedicated test.
os.utime()can fabricate an old mtime without mocking time; this is a stdlib-only test. Not a blocker, but this is the core of the cleanup feature.What's done well
test_entrypoint_creates_tmpdir(fakepython3+uvicorn) is clean and keeps the test out of the ML stack dependency.${TMPDIR:-/tmp}fallback correctly handles the baredocker runcase where the env var may not be set.test_zipslip_still_anchors_under_custom_tmpdiris exactly the right regression test: verifying that relocating the extraction directory doesn't breakos.path.realpath()anchoring. Testing the security-adjacent consequence of an infrastructure change is good practice.test_tempfile_uses_tmpdir_when_setdocuments the Python mechanism the entire fix relies on — if Python ever changed this behavior, the test catches it before production.🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
Blockers
alpine:3is unpinned — both compose filesalpine:3is a moving pointer — currently resolves to3.21but bumps with every Alpine minor release. Adocker compose pullin 6 months pullsalpine:3.22with potentially different busybox behavior. Builds are non-reproducible and rollbacks are impossible. Fix:Add a Renovate rule to auto-bump the patch version:
DEPLOYMENT.mdnot updatedocr-volume-initis a new deployment step operators need to understand. When deploying to staging or production on a fresh host, they should know:ocr-volume-initdoes and why it runs firstdocker logs archiv-ocr-volume-initchowndenied on a read-only volume)One paragraph in
DEPLOYMENT.mdis enough; ADR-021 is the detail reference.Concerns (non-blocking)
Init container joins
archiv-netunnecessarilyThe init container only needs volume access, not network access. By default, Compose adds it to the project network. Explicitly exclude it:
Least-privilege at the container level costs nothing and reduces the network's exposed surface by one container.
Duplication between compose files
The
ocr-volume-initblock is copy-pasted betweendocker-compose.yml(volume names:ocr_models,ocr_cache) anddocker-compose.prod.yml(ocr-models,ocr-cache). If the init command changes, it requires two identical edits. Acceptable given the existing overlay structure, but worth noting for the next time this block is touched.What's done well
restart: "no"is correct — init containers must not restart. A failedchownlooping forever would be worse than a failed start.depends_on: condition: service_completed_successfullyis the right gate;service_startedwould allow a race.find -mtime +1 -delete 2>/dev/null || truepattern is safe:-mindepth 1prevents deleting TMPDIR itself,2>/dev/nullhandles a missing directory gracefully,|| truepreventsset -efrom aborting the entrypoint.ci.ymlcorrectly appendstest_tmpdir.pyto the lightweight test command — no new dependencies, no ML stack required.🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
What I checked
/app/cache/.tmp) — not user-controlled, no injection vector.find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -delete:-mindepth 1prevents deleting TMPDIR root; runs as uid1000(theocruser), so it can only delete files that user owns;2>/dev/null || truemeans a failed find cannot crash the container. No path traversal risk.chown -R 1000:1000 /app/cache /app/models: Runs as root in a one-shot init container, on internal Docker volumes not reachable by external actors. Acceptable._validate_zip_entry()usesos.path.realpath()anchored to the extraction directory — relocating that directory to/app/cache/.tmpdoesn't change the anchor logic. Correctly documented in ADR-021.Concerns (non-blocking)
Security regression test
test_zipslip_still_anchors_under_custom_tmpdirskips in CIHAS_MAIN = Falsein CI becausefrom main import _validate_zip_entryfails when the ML stack isn't installed. However,_validate_zip_entryis a pure path-manipulation function — it has no ML dependency itself. TheImportErroris a transitive import throughmain.py's module-level ML imports.Options (neither is a blocker):
_validate_zip_entryto autils.pywith no ML imports → test runs in CI, security regression is continuously enforced._validate_zip_entryis already covered by other CI tests that pre-date this change.Option 1 is the stronger guarantee. Option 2 is acceptable if the existing coverage is verified.
alpine:3supply chain concern (shared with @tobiwendt)Unpinned tags make the init container's exact behavior non-reproducible and introduce a low-probability supply chain risk. Pin to
alpine:3.21.What's done well
test_zipslip_still_anchors_under_custom_tmpdirtest exists and correctly exercises the boundary case — the intent is right even if CI coverage is incomplete.🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Concerns (non-blocking)
Orphan cleanup behavior is untested
find "${TMPDIR:-/tmp}" -mindepth 1 -mtime +1 -deleteis exercised bytest_entrypoint_creates_tmpdirbut its behavior is not tested:This is the core logic of the cleanup feature.
os.utime()can fabricate an old mtime without mocking system time — a stdlib-only test:test_entrypoint_creates_tmpdir: exit code not assertedA test that passes even when the process exits with code 1 is a test that can mask regressions. First assertion should be:
test_tmpdir_is_inside_persistent_cache_volumeis permanently skipped in CI without documentationThe skip is intentional and correct (deployment contract test, not a unit test). But there's no comment explaining how this test is expected to run — e.g., inside the OCR container during a staging smoke test. Without that, a future developer may wonder if this test is dead code. A one-line comment like
# run: docker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -vwould close the loop.TDD evidence
The PR description states: "test_entrypoint_creates_tmpdir was confirmed RED before entrypoint change, GREEN after." That's the discipline this project requires. ✓
What's done well
test_tempfile_uses_tmpdir_when_set), (2) entrypoint behavior (test_entrypoint_creates_tmpdir), (3) deployment contract (test_tmpdir_is_inside_persistent_cache_volume).@pytest.mark.skipifconditions are explicit and include the reason — no guesswork about why they skip.test_tmpdir.pyadded to the CI test command inci.ymlcloses the coverage gap immediately. ✓python3+uvicorn) keeps the test in the lightweight CI tier, no ML stack dependency.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Requirements coverage assessment
Issue #614 ACs fully addressed
The PR description maps directly to the acceptance criteria from #614. AC #6 ("creating
/app/cache/.tmpon fresh volumes") is explicitly resolved viaocr-volume-initand the entrypoint fallback. All five test plan items are specific and testable — no vague "it works" language.NFR completeness
The fix correctly preserves two competing NFRs:
/tmpretains its 512 MB DoS cap for attacker-influenceable training endpoints (post-auth only, but still intentional).ADR-021 documents both explicitly, including the threat model for why the 512 MB cap must stay.
Alternatives evaluated
Approach B (enlarge
/tmpto 4 GB) is rejected with quantified reasoning: combined Surya resident set + tmpfs would trigger OOMKill on CX32 hosts withOCR_MEM_LIMIT=6g. Approach C (belt-and-suspenders) is rejected as unnecessary complexity. This is good requirements discipline — non-chosen alternatives with documented rejection reasons prevent the same debate from recurring.Scope discipline
The PR is tightly scoped to the root cause (TMPDIR routing). No scope creep identified.
Minor observation (non-blocking)
The staging dry-run acceptance criterion — "docker volume rm archiv-staging_ocr-cache → restart → trigger guided OCR → Surya model downloads fully" — is a manual test. For a solo-developer workflow, documenting the expected observable outcome (specific log lines, or
stat -c %u /app/cache/.tmpreturning1000) would make future regression verification faster and less ambiguous. The PR already includesdocker exec archiv-ocr stat -c %u /app/cache/.tmpas AC #5 — extending that pattern to the download verification step would complete the picture.🎨 Leonie Voss (@leonievoss) — UX Design Lead & Accessibility Strategist
Verdict: ✅ Approved
No UI, frontend, or accessibility changes in this PR. The changes are confined to:
entrypoint.sh, Python tests)docker-compose.yml,docker-compose.prod.yml)Nothing to review from a UX or accessibility perspective.
The fix indirectly benefits the transcription experience — fewer
ENOSPCfailures during Surya model downloads means more reliable OCR results for the people transcribing letters — but no user-facing changes are introduced and no design decisions are affected.Review concerns addressed
All open concerns from the review round resolved across 5 commits on this branch.
Blockers fixed
alpine:3pinned toalpine:3.21— bothdocker-compose.ymlanddocker-compose.prod.yml(3182da8d)DEPLOYMENT.mdupdated — §8 Upgrade notes now documentsocr-volume-init: what it does (chown volumes + create TMPDIR), how to verify it succeeded (docker logs archiv-ocr-volume-init), and what failure looks like (chowndenied → non-zero exit →ocr-servicewon't start) (193a4d6e)Non-blocking concerns resolved
_validate_zip_entryextracted toutils.py— no ML-stack imports;test_zipslip_still_anchors_under_custom_tmpdirnow runs in CI without the skip (c2bd1b34)assert result.returncode == 0added totest_entrypoint_creates_tmpdir— exit code is now the first assertion (e31dac5c)Orphan cleanup behavior tested —
test_entrypoint_removes_day_old_orphans(fabricates 2-day-old mtime viaos.utime(), asserts file deleted) andtest_entrypoint_preserves_fresh_files(current mtime, asserts file survives) added (775b5c06)Manual run comment added to
test_tmpdir_is_inside_persistent_cache_volume—docker exec archiv-ocr python -m pytest test_tmpdir.py::test_tmpdir_is_inside_persistent_cache_volume -v(6839cf2a)entrypoint.shcomment fixed — "cross-job ground-truth leakage" → "Remove stale partial downloads left by a previous docker-kill" (6839cf2a)networks: []added to bothocr-volume-initdefinitions (3182da8d)Final test run
(1 skip =
test_tmpdir_is_inside_persistent_cache_volume, intentional — deployment-contract test that only runs inside the OCR container)🏛️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
What I checked
This PR resolves a storage-tier routing mistake, not a capacity problem — that framing is exactly right, and the ADR reflects it.
Positives
ocr-volume-initis minimal and correctly scoped.restart: "no",condition: service_completed_successfully, andnetworks: []are all correct. The init container cannot loop, cannot startocr-serviceon failure, and is not reachable from the app network.Concern (non-blocking)
docs/architecture/c4/l2-containers.puml— the doc requirements table says a new Docker service triggers an update to the L2 container diagram.ocr-volume-initis a new Docker service in both compose files. I'd argue a transient init container is more of an operational detail than an architectural component, so I'm not flagging this as a blocker — but it's worth a decision: either add it to the diagram with a "«init»" stereotype, or document explicitly that ephemeral bootstrap services are excluded from the L2 diagram. Right now the policy is silent on this case.Summary
The architecture is sound. The change routes model downloads to the right storage tier without touching the security boundary, the
/tmpDoS cap, or the ZIP Slip protection. ADR-021 is load-bearing documentation — the CLAUDE.md reminder points future LLMs at it. Approve.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
What I checked
TDD evidence, naming, function size, Python patterns, and the
utils.pyextraction.Positives
_validate_zip_entryextraction toutils.pyis clean. The function was buried inmain.pyalongside FastAPI app setup and business logic. It now lives in a module with a single purpose and no ML-stack imports — this is why it can be tested in CI without PyTorch. Good judgment call.test_entrypoint_creates_tmpdir,test_entrypoint_removes_day_old_orphans,test_entrypoint_preserves_fresh_files,test_zipslip_still_anchors_under_custom_tmpdir— each name tells me exactly what behavior is under test and what the expected outcome is.test_tmpdir_is_inside_persistent_cache_volumeskip annotation is correctly written — skip reason explains why the test doesn't run in CI and includes the manual docker exec command. Not a disabled test; a conditionally enabled one.Suggestions (non-blocking)
Missing return type hint on
_run_entrypoint(test_tmpdir.py):The existing code uses
pathlib.Pathobjects (fromtmp_path) but the type is unannounced. Minor — this is a test helper, not production code — but the rest of the codebase is consistently typed.test_tempfile_uses_tmpdir_when_setdoesn't test the fallback path. The entrypoint uses${TMPDIR:-/tmp}, so whenTMPDIRis unset it falls back to/tmp. There's no test assertingtempfile.TemporaryDirectory()uses/tmpwhenTMPDIRis absent. Not a blocker, but a mirror test would complete the contract.Summary
Clean extraction, good TDD evidence, well-named tests. The two suggestions are polish, not correctness issues. Approve.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ✅ Approved
What I checked
Compose changes, image pinning, service ordering, CI pipeline, and startup behavior.
Positives
alpine:3.21pinned. Not:latest. Renovate will bump it. ✅restart: "no"onocr-volume-init. If chown fails, the container exits non-zero and stays dead — it doesn't loop.ocr-serviceis blocked bycondition: service_completed_successfully, so the dependency chain fails loudly. This is the correct failure posture.networks: []drops the init container fromarchiv-net. It only needs volume access, not network access. Minimal exposure surface. ✅fastapi==0.115.6to the test pip install is required becauseutils.pyimportsfrom fastapi import HTTPExceptionandtest_tmpdir.pyimportsfrom utils import _validate_zip_entry. Without this, the OCR CI job would fail on import. The pinned version is good practice.DEPLOYMENT.mdupgrade notes are complete — log command, expected output, and failure mode description are all present. An operator following this guide cold can verify the init container succeeded without guessing.Concern (non-blocking)
chown -R 1000:1000 /app/cache /app/modelsruns on everydocker compose up. On a fully-populated cache volume with the Suryatext_recognitionmodel (~1.34 GB in many small files), this recursive chown adds 1–4 seconds to every startup. Not a problem today; worth monitoring if startup time becomes a friction point. One mitigation when the time comes: run chown only if the volume is newly created (e.g. check a sentinel file). Pre-existing issue context: this replaces a manual one-time step, so running it idempotently on every up is acceptable for now.Volume naming inconsistency between files (pre-existing, not introduced here). Dev uses
ocr_models/ocr_cache(underscores); prod usesocr-models/ocr-cache(hyphens). This PR correctly follows the existing convention in each file. Not a problem this PR introduced, just noting it for awareness.Summary
Compose changes are minimal and correct. Init container is properly scoped. CI pipeline is updated with the right dependency. The recurring
chown -Ris an acceptable trade-off at current scale. Approve.🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
What I checked
TMPDIR redirect attack surface,
ocr-volume-initprivilege model,_validate_zip_entryextraction, ZIP Slip regression, and entrypoint orphan cleanup.Positives
ZIP Slip protection is maintained and explicitly regression-tested.
_validate_zip_entrywas extracted frommain.pytoutils.pywithout modification — the logic is identical:The critical question with a TMPDIR redirect is: does
os.path.realpath()still anchor correctly whenextract_diris/app/cache/.tmp/tmpXXXinstead of/tmp/tmpXXX? Yes —realpath()resolves the full canonical path regardless of where the extraction directory lives.test_zipslip_still_anchors_under_custom_tmpdirexplicitly tests this with the new TMPDIR path. ✅ADR-021 explicitly documents the security non-impact:
This is the right thing to document — a future developer seeing TMPDIR pointing somewhere unusual shouldn't have to reason through the ZIP Slip logic from scratch.
ocr-volume-initprivilege model is appropriate. The init container runs as Alpine's default root user, which is required tochownvolumes created by Docker (owned by root). The container has no network access (networks: []), no capabilities beyond what Alpine provides by default, and exits immediately after chown + mkdir. The attack surface of a container that exits in <1 second with no network and no entrypoint persistence is minimal.Entrypoint orphan cleanup is safe.
-mindepth 1prevents deleting TMPDIR itself. The cleanup is scoped to the TMPDIR subtree — it cannot traverse upward.2>/dev/null || trueensures a find error (e.g. permission denied on a subdirectory) doesn't abort the entrypoint. Safe.No findings.
This PR does not touch authentication, authorization, SSRF protection, or any security boundary. The change is purely about which storage tier receives model download staging. Security posture is unchanged or improved (the ZIP Slip test now has broader coverage). Approve.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
What I checked
Traceability from issue #614 to acceptance criteria, completeness of the PR test plan, and whether the stated ACs map to the implemented behavior.
Requirements Traceability
test_tmpdir.pyadded to CI;fastapidep addeddocker compose configcontainsTMPDIR: /app/cache/.tmpstat -c %u /app/cache/.tmpreturns1000ocr-volume-initchowns/app/cache/.tmpto 1000:1000/app/cache/.tmpcreated on fresh volumesocr-volume-initrunsmkdir -p /app/cache/.tmp;entrypoint.shas fallbackThe root cause (routing problem, not capacity) is correctly identified and the issue title accurately reflects the fix.
Observation (non-blocking)
The failure mode when
mkdir -p /app/cache/.tmpfails insideocr-volume-initis undocumented in DEPLOYMENT.md. The chown failure mode is documented ("if chown is denied, container exits non-zero"). The mkdir step is in the samesh -ccommand joined by&&, so a mkdir failure would also cause a non-zero exit — but this isn't stated explicitly. A reader following the upgrade notes might not realize the mkdir step can fail independently of chown (e.g., if the volume is nearly full at the filesystem level). Minor documentation gap; not a correctness issue.Summary
All automated acceptance criteria are implemented and testable. The PR description is precise and includes the manual staging AC that can't be automated. Requirements are fully traceable from issue → implementation → test. Approve.
🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
What I checked
Test pyramid placement, test naming, TDD evidence, CI pipeline update, coverage of the mtime boundary, skip annotation quality, and missing edge cases.
Positives
TDD evidence is explicit and credible. The PR description states "confirmed RED before entrypoint change, GREEN after" — that's the red proof I want to see. The failing test existed before the entrypoint
mkdir -pline was added. ✅Five distinct behaviors, five distinct tests. No bundled assertions, no multi-behavior test functions:
test_tempfile_uses_tmpdir_when_settest_entrypoint_creates_tmpdirtest_tmpdir_is_inside_persistent_cache_volumetest_entrypoint_removes_day_old_orphanstest_entrypoint_preserves_fresh_filestest_zipslip_still_anchors_under_custom_tmpdirMtime boundary is tested on both sides.
test_entrypoint_removes_day_old_orphansusesos.utime()to backdate by 2 days (clearly over threshold).test_entrypoint_preserves_fresh_filesuses the current mtime (clearly under threshold). Together they fully specify thefind -mtime +1behavior. ✅Skip annotation is correct.
@pytest.mark.skipif(not os.environ.get("TMPDIR", "").startswith("/app/cache"), reason=...)is a conditional skip, not a disabled test. The skip reason is precise, and the manual run command is documented in the docstring. This is the right pattern. ✅CI pipeline updated.
test_tmpdir.pyis added to the pytest invocation.fastapi==0.115.6is pinned in the pip install step — required becauseutils.pyimportsHTTPException. ✅Suggestions (non-blocking)
Missing test: unset TMPDIR fallback path. The entrypoint uses
${TMPDIR:-/tmp}, which means when TMPDIR is unset the mkdir and find operate on/tmp. There's no test covering this path. A test stub:This would catch a regression if someone changes
${TMPDIR:-/tmp}to${TMPDIR}(which would causemkdir -p ""or a bash error)._run_entrypointhelper duplicates setup fromtest_entrypoint_creates_tmpdir. The stub-bin creation logic appears twice (once inline in the first long test, once via_run_entrypoint). The first test doesn't use_run_entrypoint— it duplicates the setup manually. Extracting to_run_entrypointwas the right call; applying it to the first test too would reduce duplication. Minor refactor, not a correctness issue.Summary
Test coverage is solid for the implemented behaviors. TDD discipline was followed. CI is updated. The two suggestions are completeness items, not correctness failures. Approve.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
What I checked
All 12 changed files for any user-facing UI, frontend component, or accessibility impact.
This PR is entirely infrastructure and DevOps:
There are no frontend files, no Svelte components, no routes, no Tailwind classes, and no i18n strings in this diff. No user-facing behavior is affected.
The
docs/DEPLOYMENT.mdprose additions are clear and followable by an operator — the command examples use proper code blocks and the expected outputs are explicit.LGTM from a UX perspective.