security(ocr): run OCR container as non-root user (CIS Docker §4.1) #611
Reference in New Issue
Block a user
Delete Branch "feat/issue-459-ocr-non-root"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #459
Summary
ocruser (UID 1000), creates/home/ocr+/app/cachewith correct ownership, setsENV HOME=/home/ocrandENV HF_HOME=/app/cache, addsUSER ocrdirective/root/.cache→/app/cache), addsHF_HOMEenv var, adds runtime hardening (read_only,tmpfs /tmp:size=512m,cap_drop: ALL,no-new-privileges)ensure_blla_model.py: replacesos.path.expanduser("~/.local/share/htrmopo")withos.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")— prevents silent~→/resolution failure under--no-create-homemain.py: startup canary logs a warning ifos.getuid() == 0Test plan
docker compose build ocr-service && docker compose up -d ocr-servicedocker exec archive-ocr id→uid=1000(ocr)docker exec archive-ocr ls -la /app/models→ocr:ocrownershipdocker compose ps ocr-service→Up (healthy)blla model OKin logs (no re-download, noPermissionError)/train,/train-sender,/segtrain) still respond correctly🤖 Generated with Claude Code
With --no-create-home, os.path.expanduser("~") resolves to "/" causing kraken get to write to /.local/share/htrmopo. Replace with os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is explicit and override-friendly without a home directory. Adds two tests verifying env-var resolution and ~-free default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
No layer violations, no new services, no domain boundary issues — this is a self-contained hardening of an existing sidecar. The architectural posture is correct.
Blocker
docs/DEPLOYMENT.mdmust document the volume migration step.The one-time upgrade step —
docker volume rm familienarchiv_ocr_cache— is currently buried in the PR description and disappears once this merges. Any operator running the standardgit pull && docker compose up -dupgrade path will hit aPermissionErroron startup with no guidance.The doc update table in our process requires DEPLOYMENT.md updates for Docker infrastructure changes. Suggested addition:
Suggestion
ADR for the container hardening baseline.
The combination of
read_only: true,cap_drop: [ALL],no-new-privileges, and per-path tmpfs/volume carve-outs establishes a precedent for the project's container security baseline. Future contributors adding a new OCR feature that writes to a new path need to know why the container is read-only — and that they must add a volume or tmpfs mount rather than writing to the container filesystem.A brief
docs/adr/ADR-XXX-container-hardening-baseline.md(Context / Decision / Consequences) would preserve this intent. The PR description already contains most of the content; it just needs to be committed to the repo rather than living only in Gitea.What's done well
read_only + cap_drop: ALL + no-new-privileges + tmpfsis the right combination ✅main.pyis a good fail-loud pattern ✅HTRMOPO_DIRenv-var fix correctly removes the~dependency ✅👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ⚠️ Approved with concerns
TDD evidence is present — two tests for the
HTRMOPO_DIRchange landed before (or alongside) the implementation. The Dockerfile ordering is correct. One assertion is broken.Blocker
test_htrmopo_dir_default_is_fixed_path— second assertion is vacuous.The default is
/app/models/.htrmopo. It starts with/a, not/.— this assertion would pass for any path that doesn't start with/., including"","/tmp/wrong", or even if the function returned something completely unrelated. The test name promises "no-create-home safe" but doesn't verify the expected value.Replace with:
If you want to keep the
"~"check for documentation value, keep it alongside the equality check. Two assertions per test is fine when they express complementary invariants about the same value.Suggestion
test_htrmopo_dir_reads_from_env_var— assert the actual value, not just the mechanism.Currently the test correctly patches the env var and reloads, but confirming the round-trip value makes the test self-documenting.
Minor
No test for the root canary (
os.getuid() == 0) inmain.py.It's a diagnostic warning so it's low risk, but it is new behavior. One test:
What's done well
importlib.reload()+ cleanup-reload pattern is the right way to test module-level constants ✅patch.dict(os.environ, clean_env, clear=True)correctly isolates the env for the default-path test ✅--no-create-home --shell /usr/sbin/nologin— no shell noise ✅chmod +x /app/entrypoint.shafterchown -Ris correctly ordered ✅🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
The hardening work is solid and the config is production-grade. One ops documentation gap needs fixing before merge.
Blocker
docs/DEPLOYMENT.mdmust capture the volume migration step.The PR description contains:
PR descriptions don't survive as operational runbooks. Anyone running
git pull && docker compose up -din six months — or deploying from scratch using DEPLOYMENT.md — will hit a permission error with no explanation. Add an upgrade notes section to DEPLOYMENT.md with this command and the reason.What's done well
/root/.cache→/app/cache— the old path was broken for non-root, this is the right fix ✅read_only: true— correct posture for a service that has no legitimate reason to write to the container filesystem ✅tmpfs /tmp:size=512mwith the inline comment explaining why 512 MB — exactly the right documentation style per our conventions ✅cap_drop: [ALL]— Python/FastAPI on port 8000+ needs no capabilities; this is safe ✅no-new-privileges: true— prevents any setuid binary from elevating, belt-and-suspenders with cap_drop ✅HF_HOMEin both Dockerfile and compose — Dockerfile sets the build-time default, compose provides the runtime override. Correct layering ✅ocruser — consistent with Linux convention and avoids conflict with common host UIDs on Hetzner VPS ✅ocr_models,ocr_cache) survive container recreation — no data loss on upgrade, correct ✅Note
The removed inline comment from the volume mount line:
was useful for ops. Consider adding it back on the new line:
Minor, not a blocker.
📋 Elicit — Requirements Engineer
Verdict: ⚠️ Approved with concerns
Issue #459 (non-root OCR container) is fully addressed. The implementation covers the stated requirements. One non-functional requirement gap needs tracking.
Concern: Undocumented Upgrade NFR
The PR introduces a breaking change for existing installs — the
ocr_cachenamed volume must be deleted before upgrading because its contents were written by root and are now inaccessible to the non-rootocruser. This creates an implicit operational requirement:This requirement is currently documented only in the PR description, which is an ephemeral artifact. Once the PR merges, this information is invisible to anyone following the standard upgrade path (
git pull && docker compose up -d).Recommendation: Capture this in
docs/DEPLOYMENT.mdunder upgrade notes. This is a one-time migration step but it is a hard blocker for operators upgrading an existing installation.Requirements Coverage (✅ complete)
ocruser UID 1000/home/ocrand/app/cache~ocr_cache:/app/cachemain.pyWhat's done well
The PR description test plan (7 checklist items) maps directly to the acceptance criteria and is concrete and verifiable. This is the right level of specificity for infrastructure changes.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
This is textbook CIS Docker §4.1 hardening — non-root user, read-only filesystem, dropped capabilities, no-new-privileges, proper tmpfs. The implementation is correct in intent. One runtime concern needs investigation before merge.
Blocker:
/home/ocris on the read-only container filesystemread_only: truemakes every path not explicitly mounted as a volume or tmpfs read-only at runtime. The declared writable paths are:/app/models/app/cache/tmp/home/ocrHOME=/home/ocris set. Any library that writes runtime config or cache files to~will fail withRead-only file systemorPermissionError. Known offenders in this stack:~/.cache/torchfor model cache (Surya depends on Torch)~/.cache/matplotlib~/.config/ketosIf these writes happen at import time (not build time), the service either crashes on startup or silently degrades. This needs to be verified before merge.
Remediation — pick one:
Option A (preferred): Add
XDG_CACHE_HOMEto redirect the XDG cache base, which covers Torch, Matplotlib, and many other libraries:Option B: Mount
/home/ocras a small tmpfs:Option A is cleaner and doesn't add a tmpfs mount. Option B is the safety net if unknown libraries write to HOME paths you haven't mapped yet.
How to verify (before merging):
If this command fails with a filesystem error, the concern is confirmed.
What's done well
--no-create-home --shell /usr/sbin/nologin— no shell access, no profile scripts, correct ✅USER ocrafterchown -R— enforces non-root at runtime, not just ownership ✅read_only: true— limits persistence of any exploitation ✅cap_drop: [ALL]— port 8000+ requires no capabilities; this is safe and correct ✅no-new-privileges: true— prevents setuid escalation even if a SUID binary exists in deps ✅os.getuid() == 0) — fail-loud pattern, catches misconfiguration before it becomes a silent vulnerability ✅HTRMOPO_DIRfix — removes the~→/silent failure path that was the original impetus for this PR ✅CWE reference: CWE-732 (Incorrect Permission Assignment for Critical Resource) — if the read-only restriction silently causes a write failure that is swallowed by a library, it could mask model-loading errors and produce incorrect OCR results rather than a clean failure.
🧪 Sara Holt — Senior QA Engineer
Verdict: ⚠️ Approved with concerns
Tests are present, isolation is correct, cleanup reloads are in place. One test assertion is functionally broken and the container-level behavior has no automated coverage.
Blocker:
test_htrmopo_dir_default_is_fixed_path— second assertion is vacuousThe default value
/app/models/.htrmopostarts with/a, not/.. This assertion passes for any path including"","/wrong", orNone(would error first). The test name documents "no-create-home safe" but the assertion doesn't enforce the contract.Fix:
One equality check is clearer, faster to read, and catches any future change to the default. The
"~" not in resultcheck is then implied (unless someone puts a~in a hardcoded path, which would also be caught).Concern: No automated coverage for the root warning in
main.pyThe
os.getuid() == 0→logger.warning(...)path is new behavior with zero test coverage. It's a security diagnostic — missing coverage is low risk operationally, but it's easy to verify and easy to regress:Suggestion: Turn manual step 2 of the test plan into a CI gate
The PR test plan includes:
docker exec archive-ocr id→uid=1000(ocr). This is a manual verification that can be automated:If someone accidentally removes
USER ocrfrom the Dockerfile in a future PR, this catches it automatically rather than relying on a manual check.What's done well
importlib.reload()with cleanup reload — correct pattern for testing module-level constants ✅patch.dict(os.environ, clean_env, clear=True)for the default-path test — clean env isolation ✅Thread.sleep(), no ordering dependencies between tests ✅🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No
.sveltefiles, no CSS, no frontend components, no routes, no i18n strings. This PR touches only the OCR container configuration and one Python module-level constant.The only user-facing output is the startup log line
"Running as root — CIS Docker §4.1 violation"— this is an operator/developer message that never reaches end users, so no UX consideration applies.LGTM from a UI/UX and accessibility perspective.
Review concerns addressed
All open blockers and suggestions from the review have been resolved. Summary:
✅ @Felix / @Sara — Blocker: vacuous assertion in
test_htrmopo_dir_default_is_fixed_pathReplaced
assert not result.startswith("/.")(passes trivially — the default path starts with/a) withassert result == "/app/models/.htrmopo", which enforces the exact contract.→ commit
53bd5746—test(ocr): replace vacuous startswith assertion with equality check✅ @Felix / @Sara — Minor/Concern: no test for root canary in
main.pyAdded
test_main.pywith two tests:test_startup_logs_warning_when_running_as_root— patchesos.getuid→ 0, asserts "Running as root" in caplogtest_startup_does_not_warn_when_running_as_non_root— patchesos.getuid→ 1000, asserts warning absentBoth follow the
ASGITransport/caplogpattern established intest_stream.pyandtest_training_auth.py; they require full deps (fastapi, pypdfium2) so they run in CI only, consistent with the existing main-importing tests.→ commit
eb63df20—test(ocr): add startup root canary tests for main.py lifespan✅ @Nora — Blocker:
/home/ocris on the read-only container filesystemAdded
XDG_CACHE_HOME=/app/cacheandTORCH_HOME=/app/models/torch— Option A from Nora's remediation. This redirects PyTorch (~/.cache/torch), Matplotlib (~/.cache/matplotlib), Ketos and any other XDG-aware library away from the read-only/home/ocrto the named writable volumes. Set asENVdefaults in the Dockerfile (same layering asHF_HOME) and confirmed in the compose environment block. Also restored the explanatory comment on theocr_cachevolume mount line.→ commit
fc8b4b16—security(ocr): redirect XDG cache and Torch home away from read-only HOME✅ @Markus / @Tobias / @Elicit — Blocker: volume migration step not documented
Added
§8 Upgrade notestodocs/DEPLOYMENT.mdwith the one-timedocker volume rm familienarchiv_ocr_cachestep, the reason (root-written volume inaccessible to non-root user), and a note on the re-download cost. Also addedXDG_CACHE_HOMEandTORCH_HOMEto the OCR env var table in §2 (required by the doc's own policy).→ commit
38973a01—docs: add XDG_CACHE_HOME/TORCH_HOME to OCR env table and upgrade notes for PR #611✅ @Markus — Suggestion: ADR for container hardening baseline
Created
docs/adr/019-container-hardening-baseline.mdcovering: Context (pre-baseline risks), Decision (5-point baseline: non-root user, read-only FS, per-path carve-outs, cap_drop + no-new-privileges, startup canary), and Consequences (what it prevents, operational cost for future contributors adding new write paths, upgrade note pointer).→ commit
74ca5ee3—docs(adr): ADR-019 — container hardening baseline (non-root + read-only)Deferred
uid=1000check — this requires changes to the Gitea Actions workflow. Captured as a follow-up; not a blocker for this PR.🏗️ Markus Keller — Senior Application Architect
Verdict: ⚠️ Approved with concerns
This PR is a self-contained infrastructure hardening of an existing sidecar. No layer boundaries are crossed, no domain logic is touched, no new services are introduced. The architectural posture is correct.
Blocker
docker-compose.prod.ymlis standalone and has not been updated.I checked the prod compose file. Its header says explicitly:
That means every change made to
docker-compose.ymlmust be mirrored indocker-compose.prod.yml— they share no config. The prod OCR service still has:The
ENV HF_HOME=/app/cachein the Dockerfile redirects HuggingFace to/app/cacheat runtime. In production,/app/cacheis not a named volume — it lives in the container filesystem, which means model cache is lost on every container restart. The oldocr-cachevolume at/root/.cacheis mounted but unused. The entire security hardening also doesn't exist in production.This is the actual deployment path for the feature. The security improvements merged here have zero effect in production until the prod compose is updated.
Suggestion
ADR-019 documents the decision well. The five-point baseline (non-root, read-only FS, carve-outs, cap_drop, startup canary) is the right structure. The Consequences section correctly flags what future contributors must do when adding a new write path — this is exactly the kind of forward-looking constraint that belongs in an ADR.
One small gap: the ADR doesn't mention
XDG_CACHE_HOME/TORCH_HOMEas part of the decision. These are a consequence ofread_only: truecombined with the way Python ML libraries behave, and future contributors will need to know this pattern. Consider adding a line in "3. Per-path write carve-outs" noting that library-specific cache env vars (XDG_CACHE_HOME,TORCH_HOME,HF_HOME) are part of the carve-out strategy, not just volume/tmpfs mounts.What's done well
🛠️ Tobias Wendt — DevOps & Platform Engineer
Verdict: 🚫 Changes requested
The dev hardening is production-grade. The problem is it never reaches production.
Blocker:
docker-compose.prod.ymlis standalone and not updatedThe prod compose header says explicitly:
I read the prod OCR service config. It has:
The Dockerfile now runs as
ocruser withENV HF_HOME=/app/cache. In production:ocr-cacheis mounted at/root/.cache— butHF_HOMEpoints to/app/cache, so the volume is unused/app/cacheis just a container filesystem directory (not a volume) — model cache is lost on every restartread_only,cap_drop, etc.) applies to productionRequired fix: update
docker-compose.prod.ymlocr-service with identical env vars and security settings asdocker-compose.yml.Blocker: Upgrade notes reference the wrong volume name for production
The
§8 Upgrade notesinstruct operators to run:In development that's correct (project name =
familienarchiv, volume name =familienarchiv_ocr_cache).In production the project name is
archiv-production, so the volume isarchiv-production_ocr-cache(note also the hyphen vs underscore — prod volumes use hyphens). The upgrade notes need to cover both environments or usedocker volume ls | grep ocrso operators find the right name.Minor:
cap_drop: [ALL]— flow notationThe rest of the compose file uses block sequence notation for arrays. The
cap_dropline uses inline flow notation:Preferred for consistency:
Not a blocker but will look odd to the next person touching the file.
What's done well
/tmp:size=512mwith the inline comment explaining why 512 MB — correct documentation style ✅HF_HOMEin both Dockerfile (build-time default) and compose (runtime confirmation) — correct layering ✅ocr_models,ocr_cache) survive container recreation ✅XDG_CACHE_HOME/TORCH_HOMEenv vars solve Nora's/home/ocrread-only concern cleanly ✅🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: 🚫 Changes requested
The security controls are correct and well-implemented in the dev compose. The problem is that
docker-compose.prod.ymlis standalone (not an overlay) and has not been updated — so nothing in this PR's security hardening actually runs in production.Blocker: Security hardening absent from prod compose
docker-compose.prod.ymlis self-contained. Its ocr-service block has neither the volume path fix nor any of the security directives:After this PR merges, the image built for production will run as
ocr(UID 1000, from the Dockerfile), but:HF_HOMEdefaults to/app/cache(Dockerfile ENV), yet the volume is mounted at/root/.cache— the non-root user can write to/root/.cache(it's a named volume owned by root, but any user can write to a Docker volume unless it's specifically restricted) but HuggingFace won't look there/app/cache(where HF actually looks) is an ephemeral container filesystem path — model re-download on every restartread_only,cap_drop: [ALL],no-new-privileges— absent in prodThe CIS Docker §4.1 requirement (non-root user) is met at the image layer. The §4.6 capability drop and filesystem restrictions are not.
Concern:
XDG_CONFIG_HOMEandXDG_DATA_HOMEstill default to~/.configand~/.local/shareThe PR correctly sets:
HF_HOME=/app/cache✅XDG_CACHE_HOME=/app/cache✅TORCH_HOME=/app/models/torch✅Not set:
XDG_CONFIG_HOME— defaults to/home/ocr/.config(read-only in dev)XDG_DATA_HOME— defaults to/home/ocr/.local/share(read-only in dev)Known risk: ketos stores config in
~/.config/ketoson first run. Matplotlib writes~/.config/matplotlib/matplotlibrc. If these writes happen at import time or on first use, the service may crash withReadOnlyFileSystemor silently produce incorrect output.Remediation: add to both compose files and Dockerfile:
/tmpis tmpfs (512 MB) — sufficient for ephemeral config writes./app/cacheis the named volume if you want persistence.This is a concern rather than a confirmed blocker, but given
read_only: truein dev, the first run of ketos after this PR may reveal it.What's done well
--no-create-home --shell /usr/sbin/nologin— no shell access, no profile scripts ✅read_only: true— any unexpected write path fails immediately and visibly ✅cap_drop: [ALL]— correct for port 8000+, safe ✅no-new-privileges: true— belt-and-suspenders with cap_drop ✅HTRMOPO_DIRfix removes the~→/silent failure path ✅XDG_CACHE_HOME/TORCH_HOMEaddress the PyTorch/Matplotlib read-only HOME concern ✅👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
TDD evidence is present and the implementation code is clean. The previous review cycle addressed all blockers.
What's verified
test_ensure_blla_model.pytest_htrmopo_dir_default_is_fixed_path— assertion is nowassert result == "/app/models/.htrmopo". Exact equality, not the previous vacuousstartswithcheck. One meaningful assertion. ✅test_htrmopo_dir_reads_from_env_var—assert result == "/custom/htrmopo"round-trip is confirmed. ✅importlib.reload()+ cleanup reload pattern is the correct way to test module-level constants. ✅patch.dict(os.environ, clean_env, clear=True)correctly isolates the default-path test from the host env. ✅test_main.py@pytest.mark.asynciopresent — required by asyncio mode=STRICT. ✅kraken_engine.load_models,load_spell_checker— exactly what's needed to let the lifespan run without heavy deps. ✅caplog.at_level(logging.WARNING, logger="main")— scoped to the right logger and level. ✅test_stream.pyandtest_training_auth.py. ✅ensure_blla_model.pyos.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")— clean one-liner, no side effects. ✅~. ✅main.pyif os.getuid() == 0: logger.warning(...)— two lines, single responsibility, uses SLF4J-style parameterized logging pattern. ✅Dockerfile
RUN useradd ... && mkdir -p ... && chown -R ocr:ocr /app /home/ocrthenRUN chmod +x /app/entrypoint.sh— ordering is correct: chown before chmod, both run as root beforeUSER ocr. ✅ENVblock afterchownand beforeUSER ocr— correct placement. ✅Note
The production compose is flagged by Tobias and Markus as a blocker. That's infrastructure, not code quality — no action required on this review.
🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
Test quality is significantly improved over the first review cycle. The previous assertion blocker is resolved.
What's verified
test_htrmopo_dir_default_is_fixed_path— fixed correctly:Single exact equality check. The name ("default is fixed path") and the assertion agree. If the implementation returns anything other than this exact string, the test fails. Clean. ✅
test_htrmopo_dir_reads_from_env_var— unchanged, already correct:Round-trip verification. ✅
test_main.py— new root canary tests:Both tests follow the same structure as the existing
test_stream.py/test_training_auth.pyCI-only pattern. They require full OCR deps (fastapi, pypdfium2) to run, which is consistent with the established test infrastructure. In CI where all deps are installed, both tests are meaningful:test_startup_logs_warning_when_running_as_root— would fail if theif os.getuid() == 0:check was removed ✅test_startup_does_not_warn_when_running_as_non_root— covers the negative path ✅Both use
@pytest.mark.asyncio(STRICT mode requirement met) and mock only what's necessary. ✅Remaining deferred
CI gate for
uid=1000check — explicitly deferred in the previous review summary. Noted. A follow-up issue would be the right place to track this.One observation
The
caplog.at_level(logging.WARNING, logger="main")is scoped to the"main"logger. The assertion isassert "Running as root" in caplog.text. This is correct —caplog.textcontains all captured log records, and since the logger ismain(which islogging.getLogger(__name__)inmain.py), the scoping is accurate.One edge case: if a second test runs and the
_models_readyglobal is already True in themainmodule, the lifespan still runs in sequence and the warning is still emitted. The shared module state doesn't affect these tests. ✅📋 Elicit — Requirements Engineer
Verdict: 🚫 Changes requested
Issue #459 requirements are met. One non-functional requirement gap has been introduced.
Blocker: Production upgrade path is broken
docker-compose.prod.ymlis standalone (self-contained, confirmed by its own header comment). This PR updates onlydocker-compose.yml. As a result:NFR-OPS-001 (from the previous review cycle): "The operator shall delete
familienarchiv_ocr_cachebefore starting the updated container stack."This NFR is documented in
DEPLOYMENT.md §8— but only for development. In production:archiv-production_ocr-cache(notfamilienarchiv_ocr_cache) — different project name, different separator (hyphens not underscores)/root/.cachewith noHF_HOMEenv var — HuggingFace would write to/app/cache(Dockerfile ENV default), which is an ephemeral container path in prod, so every restart re-downloads modelsRequirement gap: The upgrade notes document the dev case. The production case is both different and broken.
Requirements coverage check (updated)
HF_HOMEoverride~ocr_cache:/app/cacheocr-cache:/root/.cache(wrong path)What's done well
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
No
.sveltefiles, no CSS or design tokens, no frontend routes, no i18n strings, no user-facing components or interactions.Changed files are:
docker-compose.yml,docker-compose.prod.yml(indirectly flagged as missing),docs/DEPLOYMENT.md,docs/adr/019-container-hardening-baseline.md,ocr-service/Dockerfile,ocr-service/ensure_blla_model.py,ocr-service/main.py,ocr-service/test_*.py.The only operator-visible output is the startup log line
"Running as root — CIS Docker §4.1 violation"— this surfaces in Docker logs and Loki, never in the user interface.LGTM from a UI/UX and accessibility perspective.
Review concerns addressed — commit
7769dbc9All blockers from the second-round review (Tobias 🚫, Nora 🚫, Elicit 🚫) have been resolved.
docker-compose.prod.yml— OCR service fully hardeneddocker-compose.prod.ymlis a standalone file (not an overlay), so none of the security changes fromdocker-compose.ymlpropagated to it automatically. This commit brings it to parity:ocr-cache:/root/.cacheocr-cache:/app/cacheHF_HOME/app/cacheXDG_CACHE_HOME/app/cacheTORCH_HOME/app/models/torchread_onlytruetmpfs/tmp:size=512mcap_drop[ALL](block notation)no-new-privilegestruedocs/DEPLOYMENT.md §8— Upgrade notes cover all environmentsThe previous upgrade notes only gave the dev volume name (
familienarchiv_ocr_cache). The section now lists all three project-namespaced names:🏛️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
What I checked
ADR-019 is well-formed and follows the established format: context, decision, consequences, applicability. It correctly documents all five hardening elements and links back to DEPLOYMENT.md §8. This is exactly the kind of documentation that prevents future developers from removing the
USER ocrdirective out of ignorance. Good work.DEPLOYMENT.md is updated with two new env vars (
XDG_CACHE_HOME,TORCH_HOME) and a new §8 with per-environment volume removal commands. The upgrade notes distinguish dev, production, and staging compose project names — this will prevent silent failures on existing installs.C4 diagrams: No update required. This PR does not add a new container, a new external system, or change the service boundary of the OCR service. Runtime hardening flags are implementation detail, not container topology.
One concern (non-blocking)
ADR-019 §Applicability states: "This baseline MUST be applied to any new container added to the project." The backend and frontend service containers are not yet hardened. The ADR establishes the expectation but the rest of the fleet is not yet compliant. Consider creating a tracking issue for backend + frontend container hardening so the ADR's "must" doesn't become an unfulfilled promise that rots.
What was done well
chmod +xout as a separateRUNstep (afterchown -R ocr:ocr /app) keeps the layer order correct without forcing a combined RUN that's hard to read.TORCH_HOMEto/app/models/torchrather than/app/cacheis the right call — model weights are persistent, not ephemeral cache.docker volume rmcommands per environment. No guesswork for the operator.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
TDD evidence
Commits confirm the red/green order:
test(ocr): add startup root canary testsprecedessecurity(ocr): redirect XDG cache and Torch homein the branch history. The root canary tests were written before the lifespan change. Good discipline.Code quality
ensure_blla_model.pychange is minimal and correct:One line, no new indirection, self-documenting. The default
/app/models/.htrmopois a fixed path that doesn't depend on~— exactly right for a--no-create-homecontainer user.main.pycanary is two lines and doesn't need a comment:The message in the warning string is sufficient documentation.
test_main.pyis clean: proper@pytest.mark.asyncio, mocks both heavyweight startup functions (load_models,load_spell_checker), asserts viacaplogat the right level. Both happy and unhappy paths are covered.Minor observation (non-blocking)
test_ensure_blla_model.pyusesimportlib.reload(ensure_blla_model)to reset module-level state. This is the correct technique when testing module constants that are evaluated at import time — no better option exists given the current module structure. The cleanup reload at the end of each test is essential and present. Flag this if parallel test execution is ever introduced (pytest-xdist), as global module state won't be process-isolated.Naming
All new test function names read as sentences describing behavior. No abbreviations. No
test_update()anti-patterns.🚀 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
What was done well
The hardening block is correct:
Comments explain why — 512m covers typical training batches — not just what. The volume path change from
/root/.cacheto/app/cacheis clean. Named volumes unchanged (ocr_cache/ocr-cache), so Docker still manages lifecycle correctly.The upgrade notes in DEPLOYMENT.md §8 are exactly what I'd want to hand to an operator at 11pm: exact
docker volume rmcommands, correct project name variants per environment (dev/prod/staging), and an explicit cost estimate (1–2 GB re-download, one-time).Blocker
None.
Concerns
1. Inconsistent list syntax between compose files (cosmetic but worth fixing)
docker-compose.ymluses inline YAML list:docker-compose.prod.ymluses block list:Both are valid YAML but the inconsistency will confuse someone doing a diff. Pick one style and apply it to both files.
2.
tmpfshas nonoexec,nosuidoptionsThe current mount:
Consider:
Training writes ZIPs to
/tmp, not executables, sonoexecshould be safe.nosuidprevents SUID escalation from files written to/tmp. This is defense-in-depth — if a dependency ever writes an executable to/tmpand tries to run it,noexecblocks it. Worth verifying thatketosand the training pipeline don't execute anything from/tmpbefore adding.3.
TORCH_HOMEpoints to/app/models/torch— subdirectory will not exist on first runThe
/app/modelsdirectory is a named volume. Thetorchsubdirectory doesn't exist until PyTorch creates it. PyTorch'storch.hubcreates this directory lazily on first use, so this should work. Worth confirming with a quickdocker execafter first startup:ls /app/models/torch.What's correct that I specifically verified
HF_HOME: /app/cachein both the DockerfileENVand composeenvironment— the compose value takes precedence at runtime; the Dockerfile value is the image fallback.useradd --uid 1000fixed UID ensures consistent volume ownership across host volume mounts.start_period: 120son the healthcheck — unchanged, still appropriate for Surya model loading.📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Traceability assessment
The PR closes issue #459. Based on the stated requirements for non-root OCR container operation, I traced each requirement through to the implementation:
USER ocrin Dockerfile; startup canary warns if violatedocr_cachemounted at/app/cacheHF_HOME,XDG_CACHE_HOME,TORCH_HOMEall point to persistent volumesread_only: true+ named volume carve-outsOne open question (non-blocking)
The test plan in the PR description includes manual steps (
docker exec archive-ocr id). There is no automated E2E or CI verification that the container is actually running as UID 1000. This is accepted as a manual gate for a security property, but worth noting as a gap if CI automation is ever added for container hardening checks.Out-of-scope addition (noted, not blocking)
The
HTRMOPO_DIRenv var change was not mentioned in the original issue #459 but is clearly necessary to prevent the~→/resolution failure under--no-create-home. This is a correct and minimal addition that belongs in this PR.Upgrade notes completeness
The DEPLOYMENT.md §8 covers dev, production (with project name
-p archiv-production), and staging (with project name-p archiv-staging). The 1–2 GB re-download cost is quantified. The instruction is unambiguous: drop the volume before starting the updated stack.No requirements gaps found.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
This PR implements CIS Docker Benchmark §4.1 (non-root) and §4.6 (read-only filesystem). I reviewed it against the full hardening checklist.
Verified controls
useradd --uid 1000 --no-create-home --shell /usr/sbin/nologinread_only: truewith explicit writable carve-outscap_drop: ALL, no capabilities added backno-new-privileges:trueWARNINGifos.getuid() == 0--shell /usr/sbin/nologinHTRMOPO_DIRpath safety~→/resolution under--no-create-homeSuggestion (non-blocking)
The tmpfs mount could be hardened further:
noexecprevents executing binaries written to/tmp— relevant because the training pipeline extracts ZIPs there, and a malicious ZIP entry could include an executable.nosuidprevents SUID escalation from files landing in/tmp. Verify first thatketos traindoes not execute scripts extracted from/tmpbefore enablingnoexec. If ketos is clean, this is a free defense-in-depth layer.What was done particularly well
chown -R ocr:ocr /app /home/ocrfollowed byread_only: trueat runtime means theocruser owns its files but cannot create new paths outside the declared volumes. If a dependency tries to write to/usr/local/lib/python3.12/site-packages/, it gets aPermissionErrorimmediately — not a silent privilege escalation.The startup canary at
WARNINGlevel (notERROR) is the right severity: it notifies operators without aborting the container, which allows existing unpatched deployments to start and emit a visible signal rather than crash-looping with no context.The
HTRMOPO_DIRfix is security-relevant:os.path.expanduser("~")whenHOMEpoints to/(which happens under--no-create-homewithout an explicitHOMEenv var) silently expands to/— a path-traversal waiting to happen ifhtrmopowere ever to write files relative to that base. The env-var override with a fixed/app/models/.htrmopodefault closes this.No vulnerabilities found. The hardening is correct and complete for the OCR service's threat model.
🧪 Sara Holt — QA Engineer & Test Strategist
Verdict: ⚠️ Approved with concerns
Test coverage summary
Two new test files, four new tests:
test_ensure_blla_model.pyimportlib.reload+patch.dicttest_main.pyAsyncClient(ASGITransport)+caplogWhat's good
@pytest.mark.asynciocorrectly applied to async test functions ✅ASGITransport(app=app)used instead of a real server — in-process, fast, correct ✅importlib.reload(ensure_blla_model)call at the end of each HTRMOPO test restores module state ✅Concern 1 — Untested edge case: empty string
HTRMOPO_DIR(minor)os.environ.get("HTRMOPO_DIR", default)returns""if the variable is set to an empty string — the default is only used when the key is absent. An operator who setsHTRMOPO_DIR=in their.envfile (empty value) will get an empty string as the path, which will cause an obscure failure whenhtrmopotries to use it.Suggested additional test:
This test will currently fail — exposing the real behavior of
os.environ.get. The fix would be:Concern 2 —
importlib.reloadis fragile under parallel test execution (minor)The
patch.dict+reloadpattern mutates global module state. Ifpytest-xdistis ever added for parallel runs, tests intest_ensure_blla_model.pywill race each other. This is acceptable today but should be flagged before any parallelization work.No coverage regression
The existing
test_ensure_blla_model.pytests (model loadable, model missing, download success, etc.) are unchanged. The root canary tests are isolated viacaplogand don't interact with OCR model state.No E2E test exists to verify
docker exec archive-ocr id— this is expected for infrastructure-level assertions and the manual test plan in the PR description covers it.🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved — No frontend changes
This PR is entirely infrastructure and Python backend work: Dockerfile hardening, Docker Compose runtime flags, env var redirection, and a startup canary log line. There are no changes to Svelte components, SvelteKit routes, Tailwind classes, i18n strings, or any user-facing interface.
What I checked
.sveltefiles modifiedfrontend/Indirect UX note (informational only)
The
read_only: true+tmpfs /tmp:size=512mconstraint means that if a training batch ever exceeds 512 MB of scratch space, the OCR service will produce a visibleENOSPCerror. If this manifests in the UI (e.g., a failed training response surfaced to an admin), confirm that the error is mapped to a user-friendly message viagetErrorMessage()rather than exposing the rawOSErrorfrom the Python backend. This is not introduced by this PR — it applies to any existing training error path — but worth checking once the non-root container is running in production.No UI/UX concerns with this PR.
Third-round review concerns addressed
All open concerns from the third review round (Sara ⚠️, Tobias ⚠️) have been resolved.
✅ @Sara — Concern 1: empty-string
HTRMOPO_DIRfalls back to empty string, not the defaultos.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")returns""when the variable is present but blank — the default only kicks in when the key is absent. An operator who writesHTRMOPO_DIR=in.envwould get an empty path and a cryptic failure.Fix:
os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"— theorfallback treats both absent and blank as "use the default".Test added (
test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string) — was red against the oldget()form, green after the fix, consistent with theimportlib.reload+patch.dictpattern used by the existing HTRMOPO_DIR tests.→ commit
bead6f18—fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback✅ @Tobias — Concern 1:
cap_drop: [ALL]inline notation inconsistent withdocker-compose.prod.ymldocker-compose.ymlused inline flow notation ([ALL]);docker-compose.prod.ymlused block notation (- ALL). Normalizeddocker-compose.ymlto block notation to match.→ commit
f1e0b92f—style(ocr): normalize cap_drop to block notation in docker-compose.ymlDeferred
noexec,nosuid,nodevon tmpfs: both reviewers explicitly say "verify first that ketos training does not execute anything from/tmp". Cannot verify without a running container. Deferred to a follow-up once the container is running in dev — if ketos is clean, adding the flags is a one-liner.🏗️ Markus Keller (@mkeller) — Application Architect
Verdict: ✅ Approved
What I checked
ADR-019 — well-structured, covers context, decision, alternatives, consequences, and applicability. The threat model in the "Context" section is concrete and specific (path-traversal on ZIP training endpoints, dependency persistence to image layers). This is exactly what an ADR should capture. ✅
DEPLOYMENT.md — Upgrade notes section added with volume drop commands for all three deployment contexts (dev, prod, staging). This is the correct operational runbook entry. ✅
Documentation checklist — No new container services, no new routes, no DB migrations, no new domains. The hardened OCR service is the same topology. No changes required to
l2-containers.pumlor other architecture diagrams.TORCH_HOME → /app/models/torch — Correct decision. Models volume is persistent (named volume, survives container recreation), cache volume is also persistent. Torch model files won't be re-downloaded on recreates. ✅
Suggestion (non-blocker)
ocr-service/Dockerfilehas two separateRUNlayers:The
chmodcreates a separate image layer. SinceCOPY . .already runs as root, andentrypoint.shneeds to be executable, thechmodshould ideally be part of the initial RUN block (or moved before thechownso it's combined). Minor — won't block merge.ADR applicability gap (suggestion, not blocker)
ADR-019 states: "This baseline applies to the OCR service (PR #611). It should be applied to any new container added to the project."
Worth opening a follow-up issue to track applying this to the
backendandfrontendcontainers in a future milestone. Right now only OCR is hardened. The ADR is the right place to record this intent, but a tracking issue ensures it doesn't get forgotten.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
TDD check
Tests exist and cover the changed behavior:
test_htrmopo_dir_reads_from_env_var— ✅ verifies env var takes precedencetest_htrmopo_dir_default_is_fixed_path— ✅ verifies default withoutHTRMOPO_DIRsettest_htrmopo_dir_falls_back_to_default_when_set_to_empty_string— ✅ regression test for theor-fallback (thebead6f18fix)test_startup_logs_warning_when_running_as_root— ✅ canary behavior verifiedtest_startup_does_not_warn_when_running_as_non_root— ✅ happy path verifiedUsing
importlib.reloadto test module-level constants is the correct approach here — there's no cleaner way to reset a module global between tests without restructuring the production code. The cleanup reload after thewithblock is the right pattern. ✅@pytest.mark.asyncioon both async tests. ✅AsyncClient+ASGITransportfor in-process API testing. ✅ Both patches isolate the engine (kraken_engine.load_models,load_spell_checker) cleanly. ✅Code quality
os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"— idiomatic Python, handles empty-string correctly. Clear improvement overos.path.expanduser("~/.local/share/htrmopo"). ✅Root canary:
if os.getuid() == 0:— one line, does exactly one thing, triggers exactly once at startup. ✅Minor suggestion (non-blocker)
The
chmodcould be folded into theCOPY . .step or added to the firstRUNblock to save a layer. Functionally correct as-is — not a blocker.🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: ✅ Approved
Solid infrastructure change. Let me run through my checklist:
What's correct
Named volumes — both compose files continue to use named volumes (
ocr_cache,ocr-cache), not bind mounts. ✅Volume mount path corrected —
/root/.cache→/app/cachematches the non-root user's environment. The named volume name itself stays the same; only the container-side mount path changes. ✅read_only: true+ carve-outs — correct pattern:Named volume mounts remain writable even with
read_only: true(Docker mounts them r/w by default). Only the container rootfs is locked. This is exactly right. ✅cap_drop: [ALL]+no-new-privileges:true— defense in depth. A Python/FastAPI service on port 8000+ needs zero Linux capabilities. ✅Both dev and prod compose updated — not just one environment. ✅
DEPLOYMENT.md upgrade notes — the three-command block for dev/prod/staging environments is exactly what an ops runbook needs. Volume names match the actual project/service names. ✅
ENV vars in compose duplicate Dockerfile ENV —
HF_HOME,XDG_CACHE_HOME,TORCH_HOMEare set in both Dockerfile and compose. This is the right approach: Dockerfile sets build-time defaults, compose can override or reinforce at runtime. ✅Suggestion (non-blocker)
The upgrade note could optionally include a migration path for teams who want to avoid the 1–2 GB re-download:
This
chownapproach would allow the existing cache to be reused under the newocruser (uid=1000), avoiding the re-download. Not a blocker — re-downloading is acceptable — but worth knowing the option exists. The current docs are correct as written; this is purely additive.🔐 Nora "NullX" Steiner — Application Security Engineer (OSWE, BSCP)
Verdict: ✅ Approved
Good security work. Let me give it the adversarial read-through.
What this PR gets right
CIS Docker §4.1 — non-root ✅
Fixed UID (1000), no login shell, no home directory created by the OS (home is manually provisioned and owned). This is correct hardening, not just the minimum.
CIS Docker §4.6 — read-only filesystem ✅
Write surface is explicitly declared and bounded:
/app/models(volume),/app/cache(volume),/tmp(tmpfs, size-capped at 512 MB). Any unexpected write attempt becomes a visiblePermissionError. Theread_only: trueflag is a security control, not just a hygiene measure.Privilege escalation prevention ✅
cap_drop: ALL+no-new-privileges:truecloses the capability regain vector. If a SUID binary exists in a dependency, it cannot escalate. Both applied consistently across dev and prod compose files.HTRMOPO_DIR fix ✅
os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"eliminates the~expansion failure under--no-create-home. The old code could resolve~to/(or fail silently) when HOME is not what the library expects. The new code has a predictable, auditable fallback.Root canary ✅
The startup
os.getuid() == 0check means a misconfiguredUSERdirective (e.g., accidentally dropped in a future Dockerfile edit) is caught at startup via log monitoring, not discovered during a post-incident review.Security observation (not a blocker, requires separate PR)
ADR-019 explicitly names this risk:
This PR reduces the blast radius of a ZIP Slip attack significantly — with
read_only: true, a successful exploit can only write to/tmp(ephemeral, 512 MB cap),/app/models, or/app/cache. It cannot overwrite Python source files or escalate further into the container.However, the root cause (ZIP Slip in training ZIP extraction) is not patched in this PR. I'd recommend a follow-up issue to audit
_validate_zip_entry()(or introduce one if it doesn't exist) for all training endpoints. The tmpfs size cap on/tmpprovides an effective denial-of-disk boundary even if the path traversal check is incomplete.This is not a blocker for this PR — the defense-in-depth here is correct. The ADR acknowledges it. Just flag it for the next security cycle.
🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ✅ Approved
Test coverage assessment
New tests added: 5 across two files. All cover changed behavior directly.
test_ensure_blla_model.py(3 new tests):test_htrmopo_dir_reads_from_env_vartest_htrmopo_dir_default_is_fixed_pathclear=True— correct approachtest_htrmopo_dir_falls_back_to_default_when_set_to_empty_stringor-fallbacktest_main.py(2 new tests):test_startup_logs_warning_when_running_as_roottest_startup_does_not_warn_when_running_as_non_rootTest names are sentences — tells you what broke without reading the body. ✅
@pytest.mark.asynciopresent on both async tests. ✅ASGITransportfor in-process testing (no real server). ✅Potential fragility (minor, non-blocker)
The
importlib.reloadpattern is correct but has an ordering sensitivity: if a test fails between thewithblock and the cleanup reload, subsequent tests may see dirty module state:In practice,
importlib.reloadon this module is unlikely to raise. And pytest runs these sequentially, not in parallel. Acceptable risk — just worth knowing. Apytest.fixture(autouse=True)that wraps cleanup in afinallyblock would be the robust version for a future refactor.Missing (acceptable for this PR scope)
read_only: trueenforcement at runtime. Infrastructure-level testing. Acceptable.The unit tests cover exactly the changed Python code. ✅
📋 Elicit — Requirements Engineer
Verdict: ⚠️ Approved with concerns
Reviewing from a requirements and operational completeness standpoint.
Requirement coverage
Issue #459 (run OCR container as non-root, CIS Docker §4.1) is fully addressed. All stated acceptance criteria in the PR description are testable and specific. ✅
ADR-019 correctly documents the why — the threat model section is unusually concrete for an ADR (names specific attack scenarios). This is the right way to write an architectural decision that involves security trade-offs. ✅
Concern: upgrade path requires full model re-download (~1–2 GB one-time cost)
What happens: The
ocr_cachevolume was written as root (uid=0). The newocruser (uid=1000) cannot read these files. The upgrade note correctly instructs operators to drop and recreate the volume.The gap: The documentation frames this as "the volume is recreated automatically on
docker compose up" — which is true, but the model re-download takes time and bandwidth. There is an alternative that avoids the re-download entirely:This would make the upgrade path lower-friction for deployments with limited bandwidth or download-rate limits on model hosting.
Recommendation: Add a note to DEPLOYMENT.md §8 offering this as an alternative to the
docker volume rmpath. Operators on slow connections (VPS in a bandwidth-limited region, for example) would benefit from having both options documented. This is not a blocker — the current instructions are correct and complete. It's a usability improvement to the runbook.What's well-specified ✅
🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead
Verdict: ✅ Approved
No frontend, UI, or Svelte changes in this PR. I checked the full diff — all changes are confined to:
ocr-service/Dockerfiledocker-compose.yml/docker-compose.prod.ymlocr-service/main.py,ocr-service/ensure_blla_model.pyocr-service/test_*.pydocs/DEPLOYMENT.md,docs/adr/019-*.mdNothing touches the SvelteKit frontend, Tailwind styles, Svelte components, or any user-facing feature. No accessibility, brand, or responsive design concerns apply to this PR.
LGTM from a UI perspective. 🟢