security(ocr): run OCR container as non-root user (CIS Docker §4.1) #611

marcel · 2026-05-17T16:58:01+02:00

marcel commented

2026-05-17 16:58:01 +02:00

Closes #459

Summary

Dockerfile: adds ocr user (UID 1000), creates /home/ocr + /app/cache with correct ownership, sets ENV HOME=/home/ocr and ENV HF_HOME=/app/cache, adds USER ocr directive
docker-compose.yml: fixes volume mount (/root/.cache → /app/cache), adds HF_HOME env var, adds runtime hardening (read_only, tmpfs /tmp:size=512m, cap_drop: ALL, no-new-privileges)
ensure_blla_model.py: replaces os.path.expanduser("~/.local/share/htrmopo") with os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") — prevents silent ~ → / resolution failure under --no-create-home
main.py: startup canary logs a warning if os.getuid() == 0

Test plan

docker compose build ocr-service && docker compose up -d ocr-service
docker exec archive-ocr id → uid=1000(ocr)
docker exec archive-ocr ls -la /app/models → ocr:ocr ownership
docker compose ps ocr-service → Up (healthy)
Trigger OCR on a small PDF via the backend — streaming response completes without errors
Stop + recreate container, confirm blla model OK in logs (no re-download, no PermissionError)
All training endpoints (/train, /train-sender, /segtrain) still respond correctly

Existing installs: drop the old root-owned cache volume before starting:
docker volume rm familienarchiv_ocr_cache

🤖 Generated with Claude Code

Closes #459 ## Summary - **Dockerfile**: adds `ocr` user (UID 1000), creates `/home/ocr` + `/app/cache` with correct ownership, sets `ENV HOME=/home/ocr` and `ENV HF_HOME=/app/cache`, adds `USER ocr` directive - **docker-compose.yml**: fixes volume mount (`/root/.cache` → `/app/cache`), adds `HF_HOME` env var, adds runtime hardening (`read_only`, `tmpfs /tmp:size=512m`, `cap_drop: ALL`, `no-new-privileges`) - **`ensure_blla_model.py`**: replaces `os.path.expanduser("~/.local/share/htrmopo")` with `os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")` — prevents silent `~` → `/` resolution failure under `--no-create-home` - **`main.py`**: startup canary logs a warning if `os.getuid() == 0` ## Test plan - [ ] `docker compose build ocr-service && docker compose up -d ocr-service` - [ ] `docker exec archive-ocr id` → `uid=1000(ocr)` - [ ] `docker exec archive-ocr ls -la /app/models` → `ocr:ocr` ownership - [ ] `docker compose ps ocr-service` → `Up (healthy)` - [ ] Trigger OCR on a small PDF via the backend — streaming response completes without errors - [ ] Stop + recreate container, confirm `blla model OK` in logs (no re-download, no `PermissionError`) - [ ] All training endpoints (`/train`, `/train-sender`, `/segtrain`) still respond correctly > **Existing installs:** drop the old root-owned cache volume before starting: > `docker volume rm familienarchiv_ocr_cache` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

marcel added 4 commits 2026-05-17 16:58:02 +02:00

security(ocr): add non-root user and set HOME/HF_HOME in Dockerfile 1aca4c4a41

CIS Docker §4.1: run uvicorn as UID 1000 (ocr) instead of root.
Creates /home/ocr and /app/cache with correct ownership so named
volumes inherit ocr:ocr on first Docker mount. Sets HOME and HF_HOME
so ~ expansion and Hugging Face caching resolve under /app, not /root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

security(ocr): harden compose — fix cache volume path, add read_only + cap_drop ab24786d2a

Move ocr_cache mount from /root/.cache to /app/cache (correct path for
non-root user). Add HF_HOME so Hugging Face resolves to the same path.
Add runtime hardening: read_only, tmpfs /tmp (512 MB cap), cap_drop ALL,
no-new-privileges.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ocr): resolve HTRMOPO_DIR from env var, not ~ expansion 9db42d6cc1

With --no-create-home, os.path.expanduser("~") resolves to "/" causing
kraken get to write to /.local/share/htrmopo. Replace with
os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") so the path is
explicit and override-friendly without a home directory.

Adds two tests verifying env-var resolution and ~-free default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

security(ocr): log warning on startup when running as root

CI / Unit & Component Tests (pull_request) Successful in 3m3s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m10s

Details

CI / fail2ban Regex (pull_request) Successful in 42s

Details

CI / Semgrep Security Scan (pull_request) Successful in 19s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 59s

Details

581ba01d8d

Adds a canary log line if os.getuid() == 0. Produces an observable
signal in container logs if the USER directive is ever removed from
the Dockerfile, without requiring an external audit tool.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-17 17:02:18 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

No layer violations, no new services, no domain boundary issues — this is a self-contained hardening of an existing sidecar. The architectural posture is correct.

Blocker

docs/DEPLOYMENT.md must document the volume migration step.

The one-time upgrade step — docker volume rm familienarchiv_ocr_cache — is currently buried in the PR description and disappears once this merges. Any operator running the standard git pull && docker compose up -d upgrade path will hit a PermissionError on startup with no guidance.

The doc update table in our process requires DEPLOYMENT.md updates for Docker infrastructure changes. Suggested addition:

### Upgrading from pre-non-root OCR (PR #611)
The OCR cache volume path changed from `/root/.cache` to `/app/cache`.
Before starting the new container, drop the old root-owned volume:

    docker volume rm familienarchiv_ocr_cache

The volume will be recreated automatically on `docker compose up`.

Suggestion

ADR for the container hardening baseline.

The combination of read_only: true, cap_drop: [ALL], no-new-privileges, and per-path tmpfs/volume carve-outs establishes a precedent for the project's container security baseline. Future contributors adding a new OCR feature that writes to a new path need to know why the container is read-only — and that they must add a volume or tmpfs mount rather than writing to the container filesystem.

A brief docs/adr/ADR-XXX-container-hardening-baseline.md (Context / Decision / Consequences) would preserve this intent. The PR description already contains most of the content; it just needs to be committed to the repo rather than living only in Gitea.

What's done well

read_only + cap_drop: ALL + no-new-privileges + tmpfs is the right combination ✅
Volume and tmpfs carve-outs match the actual write paths ✅
Root canary in main.py is a good fail-loud pattern ✅
The HTRMOPO_DIR env-var fix correctly removes the ~ dependency ✅

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** No layer violations, no new services, no domain boundary issues — this is a self-contained hardening of an existing sidecar. The architectural posture is correct. ### Blocker **`docs/DEPLOYMENT.md` must document the volume migration step.** The one-time upgrade step — `docker volume rm familienarchiv_ocr_cache` — is currently buried in the PR description and disappears once this merges. Any operator running the standard `git pull && docker compose up -d` upgrade path will hit a `PermissionError` on startup with no guidance. The doc update table in our process requires DEPLOYMENT.md updates for Docker infrastructure changes. Suggested addition: ```markdown ### Upgrading from pre-non-root OCR (PR #611) The OCR cache volume path changed from `/root/.cache` to `/app/cache`. Before starting the new container, drop the old root-owned volume: docker volume rm familienarchiv_ocr_cache The volume will be recreated automatically on `docker compose up`. ``` ### Suggestion **ADR for the container hardening baseline.** The combination of `read_only: true`, `cap_drop: [ALL]`, `no-new-privileges`, and per-path tmpfs/volume carve-outs establishes a precedent for the project's container security baseline. Future contributors adding a new OCR feature that writes to a new path need to know why the container is read-only — and that they must add a volume or tmpfs mount rather than writing to the container filesystem. A brief `docs/adr/ADR-XXX-container-hardening-baseline.md` (Context / Decision / Consequences) would preserve this intent. The PR description already contains most of the content; it just needs to be committed to the repo rather than living only in Gitea. ### What's done well - `read_only + cap_drop: ALL + no-new-privileges + tmpfs` is the right combination ✅ - Volume and tmpfs carve-outs match the actual write paths ✅ - Root canary in `main.py` is a good fail-loud pattern ✅ - The `HTRMOPO_DIR` env-var fix correctly removes the `~` dependency ✅

marcel commented

2026-05-17 17:02:28 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ⚠️ Approved with concerns

TDD evidence is present — two tests for the HTRMOPO_DIR change landed before (or alongside) the implementation. The Dockerfile ordering is correct. One assertion is broken.

Blocker

test_htrmopo_dir_default_is_fixed_path — second assertion is vacuous.

assert "~" not in result          # ✅ meaningful
assert not result.startswith("/.") # ❌ passes trivially

The default is /app/models/.htrmopo. It starts with /a, not /. — this assertion would pass for any path that doesn't start with /., including "", "/tmp/wrong", or even if the function returned something completely unrelated. The test name promises "no-create-home safe" but doesn't verify the expected value.

Replace with:

assert result == "/app/models/.htrmopo"

If you want to keep the "~" check for documentation value, keep it alongside the equality check. Two assertions per test is fine when they express complementary invariants about the same value.

Suggestion

test_htrmopo_dir_reads_from_env_var — assert the actual value, not just the mechanism.

assert result == "/custom/htrmopo"

Currently the test correctly patches the env var and reloads, but confirming the round-trip value makes the test self-documenting.

Minor

No test for the root canary (os.getuid() == 0) in main.py.

It's a diagnostic warning so it's low risk, but it is new behavior. One test:

def test_startup_logs_warning_when_running_as_root(caplog):
    with patch("main.os.getuid", return_value=0), \
         caplog.at_level(logging.WARNING, logger="main"):
        # invoke the relevant startup check directly or via lifespan
        ...
    assert "Running as root" in caplog.text

What's done well

The importlib.reload() + cleanup-reload pattern is the right way to test module-level constants ✅
patch.dict(os.environ, clean_env, clear=True) correctly isolates the env for the default-path test ✅
--no-create-home --shell /usr/sbin/nologin — no shell noise ✅
chmod +x /app/entrypoint.sh after chown -R is correctly ordered ✅
Type hints and docstrings untouched where they already existed ✅

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ⚠️ Approved with concerns** TDD evidence is present — two tests for the `HTRMOPO_DIR` change landed before (or alongside) the implementation. The Dockerfile ordering is correct. One assertion is broken. ### Blocker **`test_htrmopo_dir_default_is_fixed_path` — second assertion is vacuous.** ```python assert "~" not in result # ✅ meaningful assert not result.startswith("/.") # ❌ passes trivially ``` The default is `/app/models/.htrmopo`. It starts with `/a`, not `/.` — this assertion would pass for any path that doesn't start with `/.`, including `""`, `"/tmp/wrong"`, or even if the function returned something completely unrelated. The test name promises "no-create-home safe" but doesn't verify the expected value. Replace with: ```python assert result == "/app/models/.htrmopo" ``` If you want to keep the `"~"` check for documentation value, keep it alongside the equality check. Two assertions per test is fine when they express complementary invariants about the same value. ### Suggestion **`test_htrmopo_dir_reads_from_env_var` — assert the actual value, not just the mechanism.** ```python assert result == "/custom/htrmopo" ``` Currently the test correctly patches the env var and reloads, but confirming the round-trip value makes the test self-documenting. ### Minor **No test for the root canary (`os.getuid() == 0`) in `main.py`.** It's a diagnostic warning so it's low risk, but it is new behavior. One test: ```python def test_startup_logs_warning_when_running_as_root(caplog): with patch("main.os.getuid", return_value=0), \ caplog.at_level(logging.WARNING, logger="main"): # invoke the relevant startup check directly or via lifespan ... assert "Running as root" in caplog.text ``` ### What's done well - The `importlib.reload()` + cleanup-reload pattern is the right way to test module-level constants ✅ - `patch.dict(os.environ, clean_env, clear=True)` correctly isolates the env for the default-path test ✅ - `--no-create-home --shell /usr/sbin/nologin` — no shell noise ✅ - `chmod +x /app/entrypoint.sh` after `chown -R` is correctly ordered ✅ - Type hints and docstrings untouched where they already existed ✅

marcel commented

2026-05-17 17:02:39 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

The hardening work is solid and the config is production-grade. One ops documentation gap needs fixing before merge.

Blocker

docs/DEPLOYMENT.md must capture the volume migration step.

The PR description contains:

Existing installs: drop the old root-owned cache volume before starting: docker volume rm familienarchiv_ocr_cache

PR descriptions don't survive as operational runbooks. Anyone running git pull && docker compose up -d in six months — or deploying from scratch using DEPLOYMENT.md — will hit a permission error with no explanation. Add an upgrade notes section to DEPLOYMENT.md with this command and the reason.

What's done well

Volume mount correction: /root/.cache → /app/cache — the old path was broken for non-root, this is the right fix ✅
read_only: true — correct posture for a service that has no legitimate reason to write to the container filesystem ✅
tmpfs /tmp:size=512m with the inline comment explaining why 512 MB — exactly the right documentation style per our conventions ✅
cap_drop: [ALL] — Python/FastAPI on port 8000+ needs no capabilities; this is safe ✅
no-new-privileges: true — prevents any setuid binary from elevating, belt-and-suspenders with cap_drop ✅
HF_HOME in both Dockerfile and compose — Dockerfile sets the build-time default, compose provides the runtime override. Correct layering ✅
UID 1000 for ocr user — consistent with Linux convention and avoids conflict with common host UIDs on Hetzner VPS ✅
Named volumes (ocr_models, ocr_cache) survive container recreation — no data loss on upgrade, correct ✅

Note

The removed inline comment from the volume mount line:

- # Hugging Face / ketos model download cache — prevents re-downloads on container recreate

was useful for ops. Consider adding it back on the new line:

- ocr_cache:/app/cache  # HuggingFace / ketos cache — prevents re-downloads on recreate (HF_HOME)

Minor, not a blocker.

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** The hardening work is solid and the config is production-grade. One ops documentation gap needs fixing before merge. ### Blocker **`docs/DEPLOYMENT.md` must capture the volume migration step.** The PR description contains: > **Existing installs:** drop the old root-owned cache volume before starting: `docker volume rm familienarchiv_ocr_cache` PR descriptions don't survive as operational runbooks. Anyone running `git pull && docker compose up -d` in six months — or deploying from scratch using DEPLOYMENT.md — will hit a permission error with no explanation. Add an upgrade notes section to DEPLOYMENT.md with this command and the reason. ### What's done well - **Volume mount correction**: `/root/.cache` → `/app/cache` — the old path was broken for non-root, this is the right fix ✅ - **`read_only: true`** — correct posture for a service that has no legitimate reason to write to the container filesystem ✅ - **`tmpfs /tmp:size=512m`** with the inline comment explaining why 512 MB — exactly the right documentation style per our conventions ✅ - **`cap_drop: [ALL]`** — Python/FastAPI on port 8000+ needs no capabilities; this is safe ✅ - **`no-new-privileges: true`** — prevents any setuid binary from elevating, belt-and-suspenders with cap_drop ✅ - **`HF_HOME` in both Dockerfile and compose** — Dockerfile sets the build-time default, compose provides the runtime override. Correct layering ✅ - **UID 1000 for `ocr` user** — consistent with Linux convention and avoids conflict with common host UIDs on Hetzner VPS ✅ - **Named volumes (`ocr_models`, `ocr_cache`) survive container recreation** — no data loss on upgrade, correct ✅ ### Note The removed inline comment from the volume mount line: ```yaml - # Hugging Face / ketos model download cache — prevents re-downloads on container recreate ``` was useful for ops. Consider adding it back on the new line: ```yaml - ocr_cache:/app/cache # HuggingFace / ketos cache — prevents re-downloads on recreate (HF_HOME) ``` Minor, not a blocker.

marcel commented

2026-05-17 17:02:49 +02:00

📋 Elicit — Requirements Engineer

Verdict: ⚠️ Approved with concerns

Issue #459 (non-root OCR container) is fully addressed. The implementation covers the stated requirements. One non-functional requirement gap needs tracking.

Concern: Undocumented Upgrade NFR

The PR introduces a breaking change for existing installs — the ocr_cache named volume must be deleted before upgrading because its contents were written by root and are now inaccessible to the non-root ocr user. This creates an implicit operational requirement:

NFR-OPS-001 (Upgrade): When upgrading from a pre-#611 deployment, the operator shall delete familienarchiv_ocr_cache before starting the updated container stack, otherwise the OCR service will fail to start with a permission error.

This requirement is currently documented only in the PR description, which is an ephemeral artifact. Once the PR merges, this information is invisible to anyone following the standard upgrade path (git pull && docker compose up -d).

Recommendation: Capture this in docs/DEPLOYMENT.md under upgrade notes. This is a one-time migration step but it is a hard blocker for operators upgrading an existing installation.

Requirements Coverage (✅ complete)

Requirement from #459	Addressed?
Non-root user in container	✅ `ocr` user UID 1000
Correct HOME and HF_HOME	✅ `/home/ocr` and `/app/cache`
HTRMOPO_DIR not derived from `~`	✅ Env var with fixed default
Writable cache survives container recreate	✅ Named volume `ocr_cache:/app/cache`
Observability: detect if accidentally running as root	✅ Root warning in `main.py`
Training endpoints still functional	✅ tmpfs /tmp covers ZIP writes

What's done well

The PR description test plan (7 checklist items) maps directly to the acceptance criteria and is concrete and verifiable. This is the right level of specificity for infrastructure changes.

## 📋 Elicit — Requirements Engineer **Verdict: ⚠️ Approved with concerns** Issue #459 (non-root OCR container) is fully addressed. The implementation covers the stated requirements. One non-functional requirement gap needs tracking. ### Concern: Undocumented Upgrade NFR The PR introduces a **breaking change for existing installs** — the `ocr_cache` named volume must be deleted before upgrading because its contents were written by root and are now inaccessible to the non-root `ocr` user. This creates an implicit operational requirement: > **NFR-OPS-001 (Upgrade):** When upgrading from a pre-#611 deployment, the operator shall delete `familienarchiv_ocr_cache` before starting the updated container stack, otherwise the OCR service will fail to start with a permission error. This requirement is currently documented only in the PR description, which is an ephemeral artifact. Once the PR merges, this information is invisible to anyone following the standard upgrade path (`git pull && docker compose up -d`). **Recommendation:** Capture this in `docs/DEPLOYMENT.md` under upgrade notes. This is a one-time migration step but it is a hard blocker for operators upgrading an existing installation. ### Requirements Coverage (✅ complete) | Requirement from #459 | Addressed? | |---|---| | Non-root user in container | ✅ `ocr` user UID 1000 | | Correct HOME and HF_HOME | ✅ `/home/ocr` and `/app/cache` | | HTRMOPO_DIR not derived from `~` | ✅ Env var with fixed default | | Writable cache survives container recreate | ✅ Named volume `ocr_cache:/app/cache` | | Observability: detect if accidentally running as root | ✅ Root warning in `main.py` | | Training endpoints still functional | ✅ tmpfs /tmp covers ZIP writes | ### What's done well The PR description test plan (7 checklist items) maps directly to the acceptance criteria and is concrete and verifiable. This is the right level of specificity for infrastructure changes.

marcel commented

2026-05-17 17:03:05 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ⚠️ Approved with concerns

This is textbook CIS Docker §4.1 hardening — non-root user, read-only filesystem, dropped capabilities, no-new-privileges, proper tmpfs. The implementation is correct in intent. One runtime concern needs investigation before merge.

Blocker: `/home/ocr` is on the read-only container filesystem

read_only: true makes every path not explicitly mounted as a volume or tmpfs read-only at runtime. The declared writable paths are:

Path	Type	Writable?
`/app/models`	named volume	✅
`/app/cache`	named volume (HF_HOME)	✅
`/tmp`	tmpfs 512 MB	✅
`/home/ocr`	container filesystem	❌ read-only

HOME=/home/ocr is set. Any library that writes runtime config or cache files to ~ will fail with Read-only file system or PermissionError. Known offenders in this stack:

PyTorch: defaults to ~/.cache/torch for model cache (Surya depends on Torch)
Matplotlib (indirect dep via ketos): writes font cache to ~/.cache/matplotlib
Ketos: may write config to ~/.config/ketos

If these writes happen at import time (not build time), the service either crashes on startup or silently degrades. This needs to be verified before merge.

Remediation — pick one:

Option A (preferred): Add XDG_CACHE_HOME to redirect the XDG cache base, which covers Torch, Matplotlib, and many other libraries:

environment:
  HF_HOME: /app/cache
  XDG_CACHE_HOME: /app/cache
  TORCH_HOME: /app/models/torch  # explicit override for Torch

Option B: Mount /home/ocr as a small tmpfs:

tmpfs:
  - /tmp:size=512m
  - /home/ocr:size=64m

Option A is cleaner and doesn't add a tmpfs mount. Option B is the safety net if unknown libraries write to HOME paths you haven't mapped yet.

How to verify (before merging):

docker compose run --rm --read-only \
  --tmpfs /tmp:size=512m \
  ocr-service python -c "import kraken; import surya; print('ok')"

If this command fails with a filesystem error, the concern is confirmed.

What's done well

--no-create-home --shell /usr/sbin/nologin — no shell access, no profile scripts, correct ✅
USER ocr after chown -R — enforces non-root at runtime, not just ownership ✅
read_only: true — limits persistence of any exploitation ✅
cap_drop: [ALL] — port 8000+ requires no capabilities; this is safe and correct ✅
no-new-privileges: true — prevents setuid escalation even if a SUID binary exists in deps ✅
Root canary (os.getuid() == 0) — fail-loud pattern, catches misconfiguration before it becomes a silent vulnerability ✅
HTRMOPO_DIR fix — removes the ~ → / silent failure path that was the original impetus for this PR ✅

CWE reference: CWE-732 (Incorrect Permission Assignment for Critical Resource) — if the read-only restriction silently causes a write failure that is swallowed by a library, it could mask model-loading errors and produce incorrect OCR results rather than a clean failure.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ⚠️ Approved with concerns** This is textbook CIS Docker §4.1 hardening — non-root user, read-only filesystem, dropped capabilities, no-new-privileges, proper tmpfs. The implementation is correct in intent. One runtime concern needs investigation before merge. ### Blocker: `/home/ocr` is on the read-only container filesystem `read_only: true` makes every path not explicitly mounted as a volume or tmpfs read-only at runtime. The declared writable paths are: | Path | Type | Writable? | |---|---|---| | `/app/models` | named volume | ✅ | | `/app/cache` | named volume (HF_HOME) | ✅ | | `/tmp` | tmpfs 512 MB | ✅ | | `/home/ocr` | container filesystem | ❌ read-only | `HOME=/home/ocr` is set. Any library that writes runtime config or cache files to `~` will fail with `Read-only file system` or `PermissionError`. Known offenders in this stack: - **PyTorch**: defaults to `~/.cache/torch` for model cache (Surya depends on Torch) - **Matplotlib** (indirect dep via ketos): writes font cache to `~/.cache/matplotlib` - **Ketos**: may write config to `~/.config/ketos` If these writes happen at import time (not build time), the service either crashes on startup or silently degrades. This needs to be verified before merge. **Remediation — pick one:** Option A (preferred): Add `XDG_CACHE_HOME` to redirect the XDG cache base, which covers Torch, Matplotlib, and many other libraries: ```yaml environment: HF_HOME: /app/cache XDG_CACHE_HOME: /app/cache TORCH_HOME: /app/models/torch # explicit override for Torch ``` Option B: Mount `/home/ocr` as a small tmpfs: ```yaml tmpfs: - /tmp:size=512m - /home/ocr:size=64m ``` Option A is cleaner and doesn't add a tmpfs mount. Option B is the safety net if unknown libraries write to HOME paths you haven't mapped yet. **How to verify (before merging):** ```bash docker compose run --rm --read-only \ --tmpfs /tmp:size=512m \ ocr-service python -c "import kraken; import surya; print('ok')" ``` If this command fails with a filesystem error, the concern is confirmed. --- ### What's done well - `--no-create-home --shell /usr/sbin/nologin` — no shell access, no profile scripts, correct ✅ - `USER ocr` after `chown -R` — enforces non-root at runtime, not just ownership ✅ - `read_only: true` — limits persistence of any exploitation ✅ - `cap_drop: [ALL]` — port 8000+ requires no capabilities; this is safe and correct ✅ - `no-new-privileges: true` — prevents setuid escalation even if a SUID binary exists in deps ✅ - Root canary (`os.getuid() == 0`) — fail-loud pattern, catches misconfiguration before it becomes a silent vulnerability ✅ - `HTRMOPO_DIR` fix — removes the `~` → `/` silent failure path that was the original impetus for this PR ✅ --- **CWE reference:** CWE-732 (Incorrect Permission Assignment for Critical Resource) — if the read-only restriction silently causes a write failure that is swallowed by a library, it could mask model-loading errors and produce incorrect OCR results rather than a clean failure.

marcel commented

2026-05-17 17:03:17 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ⚠️ Approved with concerns

Tests are present, isolation is correct, cleanup reloads are in place. One test assertion is functionally broken and the container-level behavior has no automated coverage.

Blocker: `test_htrmopo_dir_default_is_fixed_path` — second assertion is vacuous

assert "~" not in result           # ✅ meaningful — catches ~ expansion
assert not result.startswith("/.") # ❌ passes trivially

The default value /app/models/.htrmopo starts with /a, not /.. This assertion passes for any path including "", "/wrong", or None (would error first). The test name documents "no-create-home safe" but the assertion doesn't enforce the contract.

Fix:

assert result == "/app/models/.htrmopo"

One equality check is clearer, faster to read, and catches any future change to the default. The "~" not in result check is then implied (unless someone puts a ~ in a hardcoded path, which would also be caught).

Concern: No automated coverage for the root warning in `main.py`

The os.getuid() == 0 → logger.warning(...) path is new behavior with zero test coverage. It's a security diagnostic — missing coverage is low risk operationally, but it's easy to verify and easy to regress:

def test_startup_warns_when_running_as_root(caplog):
    with patch("main.os.getuid", return_value=0):
        with caplog.at_level(logging.WARNING):
            # trigger the check — can call the relevant setup function directly
            ...
    assert "Running as root" in caplog.text

def test_startup_does_not_warn_when_running_as_non_root(caplog):
    with patch("main.os.getuid", return_value=1000):
        with caplog.at_level(logging.WARNING):
            ...
    assert "Running as root" not in caplog.text

Suggestion: Turn manual step 2 of the test plan into a CI gate

The PR test plan includes: docker exec archive-ocr id → uid=1000(ocr). This is a manual verification that can be automated:

# In CI, after building the image:
- run: docker build -t ocr-service:ci ./ocr-service
- run: |
    docker run --rm ocr-service:ci id | grep -q "uid=1000(ocr)" \
      && echo "Non-root user OK" \
      || (echo "Container running as wrong user!" && exit 1)

If someone accidentally removes USER ocr from the Dockerfile in a future PR, this catches it automatically rather than relying on a manual check.

What's done well

importlib.reload() with cleanup reload — correct pattern for testing module-level constants ✅
patch.dict(os.environ, clean_env, clear=True) for the default-path test — clean env isolation ✅
Test names are descriptive sentences ✅
No Thread.sleep(), no ordering dependencies between tests ✅

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ⚠️ Approved with concerns** Tests are present, isolation is correct, cleanup reloads are in place. One test assertion is functionally broken and the container-level behavior has no automated coverage. ### Blocker: `test_htrmopo_dir_default_is_fixed_path` — second assertion is vacuous ```python assert "~" not in result # ✅ meaningful — catches ~ expansion assert not result.startswith("/.") # ❌ passes trivially ``` The default value `/app/models/.htrmopo` starts with `/a`, not `/.`. This assertion passes for any path including `""`, `"/wrong"`, or `None` (would error first). The test name documents "no-create-home safe" but the assertion doesn't enforce the contract. **Fix:** ```python assert result == "/app/models/.htrmopo" ``` One equality check is clearer, faster to read, and catches any future change to the default. The `"~" not in result` check is then implied (unless someone puts a `~` in a hardcoded path, which would also be caught). ### Concern: No automated coverage for the root warning in `main.py` The `os.getuid() == 0` → `logger.warning(...)` path is new behavior with zero test coverage. It's a security diagnostic — missing coverage is low risk operationally, but it's easy to verify and easy to regress: ```python def test_startup_warns_when_running_as_root(caplog): with patch("main.os.getuid", return_value=0): with caplog.at_level(logging.WARNING): # trigger the check — can call the relevant setup function directly ... assert "Running as root" in caplog.text def test_startup_does_not_warn_when_running_as_non_root(caplog): with patch("main.os.getuid", return_value=1000): with caplog.at_level(logging.WARNING): ... assert "Running as root" not in caplog.text ``` ### Suggestion: Turn manual step 2 of the test plan into a CI gate The PR test plan includes: `docker exec archive-ocr id` → `uid=1000(ocr)`. This is a manual verification that can be automated: ```yaml # In CI, after building the image: - run: docker build -t ocr-service:ci ./ocr-service - run: | docker run --rm ocr-service:ci id | grep -q "uid=1000(ocr)" \ && echo "Non-root user OK" \ || (echo "Container running as wrong user!" && exit 1) ``` If someone accidentally removes `USER ocr` from the Dockerfile in a future PR, this catches it automatically rather than relying on a manual check. ### What's done well - `importlib.reload()` with cleanup reload — correct pattern for testing module-level constants ✅ - `patch.dict(os.environ, clean_env, clear=True)` for the default-path test — clean env isolation ✅ - Test names are descriptive sentences ✅ - No `Thread.sleep()`, no ordering dependencies between tests ✅

marcel commented

2026-05-17 17:03:20 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No .svelte files, no CSS, no frontend components, no routes, no i18n strings. This PR touches only the OCR container configuration and one Python module-level constant.

The only user-facing output is the startup log line "Running as root — CIS Docker §4.1 violation" — this is an operator/developer message that never reaches end users, so no UX consideration applies.

LGTM from a UI/UX and accessibility perspective.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No `.svelte` files, no CSS, no frontend components, no routes, no i18n strings. This PR touches only the OCR container configuration and one Python module-level constant. The only user-facing output is the startup log line `"Running as root — CIS Docker §4.1 violation"` — this is an operator/developer message that never reaches end users, so no UX consideration applies. LGTM from a UI/UX and accessibility perspective.

marcel referenced this issue from a commit

2026-05-17 17:33:58 +02:00

docs: add XDG_CACHE_HOME/TORCH_HOME to OCR env table and upgrade notes for PR #611

marcel added 5 commits 2026-05-17 17:33:58 +02:00

test(ocr): replace vacuous startswith assertion with equality check 53bd574660

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(ocr): add startup root canary tests for main.py lifespan eb63df2000

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

security(ocr): redirect XDG cache and Torch home away from read-only HOME fc8b4b164b

Prevents PyTorch/Matplotlib/Ketos from writing to /home/ocr which is
on the read-only container filesystem — fixes Nora's blocker. Also
restores the explanatory comment on the ocr_cache volume mount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add XDG_CACHE_HOME/TORCH_HOME to OCR env table and upgrade notes for PR #611 38973a014e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(adr): ADR-019 — container hardening baseline (non-root + read-only)

CI / Unit & Component Tests (pull_request) Successful in 3m2s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m11s

Details

CI / fail2ban Regex (pull_request) Successful in 43s

Details

CI / Semgrep Security Scan (pull_request) Successful in 17s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 59s

Details

74ca5ee35f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-17 17:34:22 +02:00

Review concerns addressed

All open blockers and suggestions from the review have been resolved. Summary:

✅ @Felix / @Sara — Blocker: vacuous assertion in `test_htrmopo_dir_default_is_fixed_path`

Replaced assert not result.startswith("/.") (passes trivially — the default path starts with /a) with assert result == "/app/models/.htrmopo", which enforces the exact contract.

→ commit 53bd5746 — test(ocr): replace vacuous startswith assertion with equality check

✅ @Felix / @Sara — Minor/Concern: no test for root canary in `main.py`

Added test_main.py with two tests:

test_startup_logs_warning_when_running_as_root — patches os.getuid → 0, asserts "Running as root" in caplog
test_startup_does_not_warn_when_running_as_non_root — patches os.getuid → 1000, asserts warning absent

Both follow the ASGITransport/caplog pattern established in test_stream.py and test_training_auth.py; they require full deps (fastapi, pypdfium2) so they run in CI only, consistent with the existing main-importing tests.

→ commit eb63df20 — test(ocr): add startup root canary tests for main.py lifespan

✅ @Nora — Blocker: `/home/ocr` is on the read-only container filesystem

Added XDG_CACHE_HOME=/app/cache and TORCH_HOME=/app/models/torch — Option A from Nora's remediation. This redirects PyTorch (~/.cache/torch), Matplotlib (~/.cache/matplotlib), Ketos and any other XDG-aware library away from the read-only /home/ocr to the named writable volumes. Set as ENV defaults in the Dockerfile (same layering as HF_HOME) and confirmed in the compose environment block. Also restored the explanatory comment on the ocr_cache volume mount line.

→ commit fc8b4b16 — security(ocr): redirect XDG cache and Torch home away from read-only HOME

✅ @Markus / @Tobias / @Elicit — Blocker: volume migration step not documented

Added §8 Upgrade notes to docs/DEPLOYMENT.md with the one-time docker volume rm familienarchiv_ocr_cache step, the reason (root-written volume inaccessible to non-root user), and a note on the re-download cost. Also added XDG_CACHE_HOME and TORCH_HOME to the OCR env var table in §2 (required by the doc's own policy).

→ commit 38973a01 — docs: add XDG_CACHE_HOME/TORCH_HOME to OCR env table and upgrade notes for PR #611

✅ @Markus — Suggestion: ADR for container hardening baseline

Created docs/adr/019-container-hardening-baseline.md covering: Context (pre-baseline risks), Decision (5-point baseline: non-root user, read-only FS, per-path carve-outs, cap_drop + no-new-privileges, startup canary), and Consequences (what it prevents, operational cost for future contributors adding new write paths, upgrade note pointer).

→ commit 74ca5ee3 — docs(adr): ADR-019 — container hardening baseline (non-root + read-only)

Deferred

@Sara — Suggestion: CI gate for uid=1000 check — this requires changes to the Gitea Actions workflow. Captured as a follow-up; not a blocker for this PR.

## Review concerns addressed All open blockers and suggestions from the review have been resolved. Summary: --- ### ✅ @Felix / @Sara — Blocker: vacuous assertion in `test_htrmopo_dir_default_is_fixed_path` Replaced `assert not result.startswith("/.")` (passes trivially — the default path starts with `/a`) with `assert result == "/app/models/.htrmopo"`, which enforces the exact contract. → commit `53bd5746` — `test(ocr): replace vacuous startswith assertion with equality check` --- ### ✅ @Felix / @Sara — Minor/Concern: no test for root canary in `main.py` Added `test_main.py` with two tests: - `test_startup_logs_warning_when_running_as_root` — patches `os.getuid` → 0, asserts "Running as root" in caplog - `test_startup_does_not_warn_when_running_as_non_root` — patches `os.getuid` → 1000, asserts warning absent Both follow the `ASGITransport`/`caplog` pattern established in `test_stream.py` and `test_training_auth.py`; they require full deps (fastapi, pypdfium2) so they run in CI only, consistent with the existing main-importing tests. → commit `eb63df20` — `test(ocr): add startup root canary tests for main.py lifespan` --- ### ✅ @Nora — Blocker: `/home/ocr` is on the read-only container filesystem Added `XDG_CACHE_HOME=/app/cache` and `TORCH_HOME=/app/models/torch` — Option A from Nora's remediation. This redirects PyTorch (`~/.cache/torch`), Matplotlib (`~/.cache/matplotlib`), Ketos and any other XDG-aware library away from the read-only `/home/ocr` to the named writable volumes. Set as `ENV` defaults in the Dockerfile (same layering as `HF_HOME`) and confirmed in the compose environment block. Also restored the explanatory comment on the `ocr_cache` volume mount line. → commit `fc8b4b16` — `security(ocr): redirect XDG cache and Torch home away from read-only HOME` --- ### ✅ @Markus / @Tobias / @Elicit — Blocker: volume migration step not documented Added `§8 Upgrade notes` to `docs/DEPLOYMENT.md` with the one-time `docker volume rm familienarchiv_ocr_cache` step, the reason (root-written volume inaccessible to non-root user), and a note on the re-download cost. Also added `XDG_CACHE_HOME` and `TORCH_HOME` to the OCR env var table in §2 (required by the doc's own policy). → commit `38973a01` — `docs: add XDG_CACHE_HOME/TORCH_HOME to OCR env table and upgrade notes for PR #611` --- ### ✅ @Markus — Suggestion: ADR for container hardening baseline Created `docs/adr/019-container-hardening-baseline.md` covering: Context (pre-baseline risks), Decision (5-point baseline: non-root user, read-only FS, per-path carve-outs, cap_drop + no-new-privileges, startup canary), and Consequences (what it prevents, operational cost for future contributors adding new write paths, upgrade note pointer). → commit `74ca5ee3` — `docs(adr): ADR-019 — container hardening baseline (non-root + read-only)` --- ### Deferred - **@Sara — Suggestion: CI gate for `uid=1000` check** — this requires changes to the Gitea Actions workflow. Captured as a follow-up; not a blocker for this PR.

marcel commented

2026-05-17 17:39:24 +02:00

🏗️ Markus Keller — Senior Application Architect

Verdict: ⚠️ Approved with concerns

This PR is a self-contained infrastructure hardening of an existing sidecar. No layer boundaries are crossed, no domain logic is touched, no new services are introduced. The architectural posture is correct.

Blocker

docker-compose.prod.yml is standalone and has not been updated.

I checked the prod compose file. Its header says explicitly:

"This is a self-contained file (not an overlay over docker-compose.yml)."

That means every change made to docker-compose.yml must be mirrored in docker-compose.prod.yml — they share no config. The prod OCR service still has:

volumes:
  - ocr-cache:/root/.cache   # old path — ocr user cannot use this as HF_HOME
environment:
  # HF_HOME, XDG_CACHE_HOME, TORCH_HOME — none of these are set
# read_only, cap_drop, no-new-privileges — none of these are set

The ENV HF_HOME=/app/cache in the Dockerfile redirects HuggingFace to /app/cache at runtime. In production, /app/cache is not a named volume — it lives in the container filesystem, which means model cache is lost on every container restart. The old ocr-cache volume at /root/.cache is mounted but unused. The entire security hardening also doesn't exist in production.

This is the actual deployment path for the feature. The security improvements merged here have zero effect in production until the prod compose is updated.

Suggestion

ADR-019 documents the decision well. The five-point baseline (non-root, read-only FS, carve-outs, cap_drop, startup canary) is the right structure. The Consequences section correctly flags what future contributors must do when adding a new write path — this is exactly the kind of forward-looking constraint that belongs in an ADR.

One small gap: the ADR doesn't mention XDG_CACHE_HOME/TORCH_HOME as part of the decision. These are a consequence of read_only: true combined with the way Python ML libraries behave, and future contributors will need to know this pattern. Consider adding a line in "3. Per-path write carve-outs" noting that library-specific cache env vars (XDG_CACHE_HOME, TORCH_HOME, HF_HOME) are part of the carve-out strategy, not just volume/tmpfs mounts.

What's done well

Self-contained sidecar change with no upstream coupling ✅
ADR committed before PR merges ✅
DEPLOYMENT.md updated with both env var table and upgrade notes ✅
Volume carve-outs match the actual write paths (models, cache, /tmp) ✅

## 🏗️ Markus Keller — Senior Application Architect **Verdict: ⚠️ Approved with concerns** This PR is a self-contained infrastructure hardening of an existing sidecar. No layer boundaries are crossed, no domain logic is touched, no new services are introduced. The architectural posture is correct. ### Blocker **`docker-compose.prod.yml` is standalone and has not been updated.** I checked the prod compose file. Its header says explicitly: > "This is a self-contained file (not an overlay over docker-compose.yml)." That means every change made to `docker-compose.yml` must be mirrored in `docker-compose.prod.yml` — they share no config. The prod OCR service still has: ```yaml volumes: - ocr-cache:/root/.cache # old path — ocr user cannot use this as HF_HOME environment: # HF_HOME, XDG_CACHE_HOME, TORCH_HOME — none of these are set # read_only, cap_drop, no-new-privileges — none of these are set ``` The `ENV HF_HOME=/app/cache` in the Dockerfile redirects HuggingFace to `/app/cache` at runtime. In production, `/app/cache` is not a named volume — it lives in the container filesystem, which means model cache is lost on every container restart. The old `ocr-cache` volume at `/root/.cache` is mounted but unused. The entire security hardening also doesn't exist in production. This is the actual deployment path for the feature. The security improvements merged here have zero effect in production until the prod compose is updated. ### Suggestion **ADR-019 documents the decision well.** The five-point baseline (non-root, read-only FS, carve-outs, cap_drop, startup canary) is the right structure. The Consequences section correctly flags what future contributors must do when adding a new write path — this is exactly the kind of forward-looking constraint that belongs in an ADR. One small gap: the ADR doesn't mention `XDG_CACHE_HOME`/`TORCH_HOME` as part of the decision. These are a consequence of `read_only: true` combined with the way Python ML libraries behave, and future contributors will need to know this pattern. Consider adding a line in "3. Per-path write carve-outs" noting that library-specific cache env vars (`XDG_CACHE_HOME`, `TORCH_HOME`, `HF_HOME`) are part of the carve-out strategy, not just volume/tmpfs mounts. ### What's done well - Self-contained sidecar change with no upstream coupling ✅ - ADR committed before PR merges ✅ - DEPLOYMENT.md updated with both env var table and upgrade notes ✅ - Volume carve-outs match the actual write paths (models, cache, /tmp) ✅

marcel commented

2026-05-17 17:39:36 +02:00

🛠️ Tobias Wendt — DevOps & Platform Engineer

Verdict: 🚫 Changes requested

The dev hardening is production-grade. The problem is it never reaches production.

Blocker: `docker-compose.prod.yml` is standalone and not updated

The prod compose header says explicitly:

"This is a self-contained file (not an overlay over docker-compose.yml)."

I read the prod OCR service config. It has:

volumes:
  - ocr-cache:/root/.cache   # ← old mount path, not /app/cache
environment:
  KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel
  TRAINING_TOKEN: ${OCR_TRAINING_TOKEN}
  # HF_HOME, XDG_CACHE_HOME, TORCH_HOME — missing
# read_only, tmpfs, cap_drop, no-new-privileges — missing

The Dockerfile now runs as ocr user with ENV HF_HOME=/app/cache. In production:

The volume ocr-cache is mounted at /root/.cache — but HF_HOME points to /app/cache, so the volume is unused
/app/cache is just a container filesystem directory (not a volume) — model cache is lost on every restart
None of the security hardening (read_only, cap_drop, etc.) applies to production

Required fix: update docker-compose.prod.yml ocr-service with identical env vars and security settings as docker-compose.yml.

Blocker: Upgrade notes reference the wrong volume name for production

The §8 Upgrade notes instruct operators to run:

docker volume rm familienarchiv_ocr_cache

In development that's correct (project name = familienarchiv, volume name = familienarchiv_ocr_cache).

In production the project name is archiv-production, so the volume is archiv-production_ocr-cache (note also the hyphen vs underscore — prod volumes use hyphens). The upgrade notes need to cover both environments or use docker volume ls | grep ocr so operators find the right name.

Minor: `cap_drop: [ALL]` — flow notation

The rest of the compose file uses block sequence notation for arrays. The cap_drop line uses inline flow notation:

cap_drop: [ALL]              # inline — inconsistent

Preferred for consistency:

cap_drop:
  - ALL

Not a blocker but will look odd to the next person touching the file.

What's done well

/tmp:size=512m with the inline comment explaining why 512 MB — correct documentation style ✅
HF_HOME in both Dockerfile (build-time default) and compose (runtime confirmation) — correct layering ✅
Named volumes (ocr_models, ocr_cache) survive container recreation ✅
XDG_CACHE_HOME/TORCH_HOME env vars solve Nora's /home/ocr read-only concern cleanly ✅
Upgrade notes section is the right place for this content ✅

## 🛠️ Tobias Wendt — DevOps & Platform Engineer **Verdict: 🚫 Changes requested** The dev hardening is production-grade. The problem is it never reaches production. ### Blocker: `docker-compose.prod.yml` is standalone and not updated The prod compose header says explicitly: > "This is a self-contained file (not an overlay over docker-compose.yml)." I read the prod OCR service config. It has: ```yaml volumes: - ocr-cache:/root/.cache # ← old mount path, not /app/cache environment: KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel TRAINING_TOKEN: ${OCR_TRAINING_TOKEN} # HF_HOME, XDG_CACHE_HOME, TORCH_HOME — missing # read_only, tmpfs, cap_drop, no-new-privileges — missing ``` The Dockerfile now runs as `ocr` user with `ENV HF_HOME=/app/cache`. In production: 1. The volume `ocr-cache` is mounted at `/root/.cache` — but `HF_HOME` points to `/app/cache`, so the volume is unused 2. `/app/cache` is just a container filesystem directory (not a volume) — model cache is lost on every restart 3. None of the security hardening (`read_only`, `cap_drop`, etc.) applies to production **Required fix:** update `docker-compose.prod.yml` ocr-service with identical env vars and security settings as `docker-compose.yml`. ### Blocker: Upgrade notes reference the wrong volume name for production The `§8 Upgrade notes` instruct operators to run: ```bash docker volume rm familienarchiv_ocr_cache ``` In development that's correct (project name = `familienarchiv`, volume name = `familienarchiv_ocr_cache`). In production the project name is `archiv-production`, so the volume is `archiv-production_ocr-cache` (note also the hyphen vs underscore — prod volumes use hyphens). The upgrade notes need to cover both environments or use `docker volume ls | grep ocr` so operators find the right name. ### Minor: `cap_drop: [ALL]` — flow notation The rest of the compose file uses block sequence notation for arrays. The `cap_drop` line uses inline flow notation: ```yaml cap_drop: [ALL] # inline — inconsistent ``` Preferred for consistency: ```yaml cap_drop: - ALL ``` Not a blocker but will look odd to the next person touching the file. ### What's done well - `/tmp:size=512m` with the inline comment explaining why 512 MB — correct documentation style ✅ - `HF_HOME` in both Dockerfile (build-time default) and compose (runtime confirmation) — correct layering ✅ - Named volumes (`ocr_models`, `ocr_cache`) survive container recreation ✅ - `XDG_CACHE_HOME`/`TORCH_HOME` env vars solve Nora's `/home/ocr` read-only concern cleanly ✅ - Upgrade notes section is the right place for this content ✅

marcel commented

2026-05-17 17:39:55 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: 🚫 Changes requested

The security controls are correct and well-implemented in the dev compose. The problem is that docker-compose.prod.yml is standalone (not an overlay) and has not been updated — so nothing in this PR's security hardening actually runs in production.

Blocker: Security hardening absent from prod compose

docker-compose.prod.yml is self-contained. Its ocr-service block has neither the volume path fix nor any of the security directives:

# docker-compose.prod.yml (current, unchanged by this PR)
volumes:
  - ocr-cache:/root/.cache    # old path — broken under non-root user
# missing: read_only, tmpfs, cap_drop, no-new-privileges
# missing: HF_HOME, XDG_CACHE_HOME, TORCH_HOME

After this PR merges, the image built for production will run as ocr (UID 1000, from the Dockerfile), but:

HF_HOME defaults to /app/cache (Dockerfile ENV), yet the volume is mounted at /root/.cache — the non-root user can write to /root/.cache (it's a named volume owned by root, but any user can write to a Docker volume unless it's specifically restricted) but HuggingFace won't look there
/app/cache (where HF actually looks) is an ephemeral container filesystem path — model re-download on every restart
read_only, cap_drop: [ALL], no-new-privileges — absent in prod

The CIS Docker §4.1 requirement (non-root user) is met at the image layer. The §4.6 capability drop and filesystem restrictions are not.

Concern: `XDG_CONFIG_HOME` and `XDG_DATA_HOME` still default to `~/.config` and `~/.local/share`

The PR correctly sets:

HF_HOME=/app/cache ✅
XDG_CACHE_HOME=/app/cache ✅
TORCH_HOME=/app/models/torch ✅

Not set:

XDG_CONFIG_HOME — defaults to /home/ocr/.config (read-only in dev)
XDG_DATA_HOME — defaults to /home/ocr/.local/share (read-only in dev)

Known risk: ketos stores config in ~/.config/ketos on first run. Matplotlib writes ~/.config/matplotlib/matplotlibrc. If these writes happen at import time or on first use, the service may crash with ReadOnlyFileSystem or silently produce incorrect output.

Remediation: add to both compose files and Dockerfile:

XDG_CONFIG_HOME: /tmp/.config    # or /app/cache/.config if persistence is needed
XDG_DATA_HOME: /app/cache/.local

/tmp is tmpfs (512 MB) — sufficient for ephemeral config writes. /app/cache is the named volume if you want persistence.

This is a concern rather than a confirmed blocker, but given read_only: true in dev, the first run of ketos after this PR may reveal it.

What's done well

Non-root user with --no-create-home --shell /usr/sbin/nologin — no shell access, no profile scripts ✅
read_only: true — any unexpected write path fails immediately and visibly ✅
cap_drop: [ALL] — correct for port 8000+, safe ✅
no-new-privileges: true — belt-and-suspenders with cap_drop ✅
Startup root canary — correct fail-loud pattern ✅
HTRMOPO_DIR fix removes the ~ → / silent failure path ✅
XDG_CACHE_HOME/TORCH_HOME address the PyTorch/Matplotlib read-only HOME concern ✅

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: 🚫 Changes requested** The security controls are correct and well-implemented in the dev compose. The problem is that `docker-compose.prod.yml` is standalone (not an overlay) and has not been updated — so nothing in this PR's security hardening actually runs in production. ### Blocker: Security hardening absent from prod compose `docker-compose.prod.yml` is self-contained. Its ocr-service block has neither the volume path fix nor any of the security directives: ```yaml # docker-compose.prod.yml (current, unchanged by this PR) volumes: - ocr-cache:/root/.cache # old path — broken under non-root user # missing: read_only, tmpfs, cap_drop, no-new-privileges # missing: HF_HOME, XDG_CACHE_HOME, TORCH_HOME ``` After this PR merges, the image built for production will run as `ocr` (UID 1000, from the Dockerfile), but: 1. `HF_HOME` defaults to `/app/cache` (Dockerfile ENV), yet the volume is mounted at `/root/.cache` — the non-root user can write to `/root/.cache` (it's a named volume owned by root, but any user can write to a Docker volume unless it's specifically restricted) but HuggingFace won't look there 2. `/app/cache` (where HF actually looks) is an ephemeral container filesystem path — model re-download on every restart 3. `read_only`, `cap_drop: [ALL]`, `no-new-privileges` — absent in prod The CIS Docker §4.1 requirement (non-root user) is met at the image layer. The §4.6 capability drop and filesystem restrictions are not. ### Concern: `XDG_CONFIG_HOME` and `XDG_DATA_HOME` still default to `~/.config` and `~/.local/share` The PR correctly sets: - `HF_HOME=/app/cache` ✅ - `XDG_CACHE_HOME=/app/cache` ✅ - `TORCH_HOME=/app/models/torch` ✅ Not set: - `XDG_CONFIG_HOME` — defaults to `/home/ocr/.config` (read-only in dev) - `XDG_DATA_HOME` — defaults to `/home/ocr/.local/share` (read-only in dev) Known risk: ketos stores config in `~/.config/ketos` on first run. Matplotlib writes `~/.config/matplotlib/matplotlibrc`. If these writes happen at import time or on first use, the service may crash with `ReadOnlyFileSystem` or silently produce incorrect output. **Remediation:** add to both compose files and Dockerfile: ```yaml XDG_CONFIG_HOME: /tmp/.config # or /app/cache/.config if persistence is needed XDG_DATA_HOME: /app/cache/.local ``` `/tmp` is tmpfs (512 MB) — sufficient for ephemeral config writes. `/app/cache` is the named volume if you want persistence. This is a concern rather than a confirmed blocker, but given `read_only: true` in dev, the first run of ketos after this PR may reveal it. ### What's done well - Non-root user with `--no-create-home --shell /usr/sbin/nologin` — no shell access, no profile scripts ✅ - `read_only: true` — any unexpected write path fails immediately and visibly ✅ - `cap_drop: [ALL]` — correct for port 8000+, safe ✅ - `no-new-privileges: true` — belt-and-suspenders with cap_drop ✅ - Startup root canary — correct fail-loud pattern ✅ - `HTRMOPO_DIR` fix removes the `~` → `/` silent failure path ✅ - `XDG_CACHE_HOME`/`TORCH_HOME` address the PyTorch/Matplotlib read-only HOME concern ✅

marcel commented

2026-05-17 17:40:07 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

TDD evidence is present and the implementation code is clean. The previous review cycle addressed all blockers.

What's verified

test_ensure_blla_model.py

test_htrmopo_dir_default_is_fixed_path — assertion is now assert result == "/app/models/.htrmopo". Exact equality, not the previous vacuous startswith check. One meaningful assertion. ✅
test_htrmopo_dir_reads_from_env_var — assert result == "/custom/htrmopo" round-trip is confirmed. ✅
importlib.reload() + cleanup reload pattern is the correct way to test module-level constants. ✅
patch.dict(os.environ, clean_env, clear=True) correctly isolates the default-path test from the host env. ✅

test_main.py

Two tests, each with one assertion — correct Arrange-Act-Assert structure. ✅
@pytest.mark.asyncio present — required by asyncio mode=STRICT. ✅
Mocks are minimal: kraken_engine.load_models, load_spell_checker — exactly what's needed to let the lifespan run without heavy deps. ✅
caplog.at_level(logging.WARNING, logger="main") — scoped to the right logger and level. ✅
Tests are CI-only (require fastapi/pypdfium2 full deps) — consistent with the established pattern of test_stream.py and test_training_auth.py. ✅

ensure_blla_model.py

os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") — clean one-liner, no side effects. ✅
Fixed path not derived from ~. ✅

main.py

if os.getuid() == 0: logger.warning(...) — two lines, single responsibility, uses SLF4J-style parameterized logging pattern. ✅
No new behavior added to the happy path. ✅

Dockerfile

RUN useradd ... && mkdir -p ... && chown -R ocr:ocr /app /home/ocr then RUN chmod +x /app/entrypoint.sh — ordering is correct: chown before chmod, both run as root before USER ocr. ✅
ENV block after chown and before USER ocr — correct placement. ✅

Note

The production compose is flagged by Tobias and Markus as a blocker. That's infrastructure, not code quality — no action required on this review.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** TDD evidence is present and the implementation code is clean. The previous review cycle addressed all blockers. ### What's verified **`test_ensure_blla_model.py`** - `test_htrmopo_dir_default_is_fixed_path` — assertion is now `assert result == "/app/models/.htrmopo"`. Exact equality, not the previous vacuous `startswith` check. One meaningful assertion. ✅ - `test_htrmopo_dir_reads_from_env_var` — `assert result == "/custom/htrmopo"` round-trip is confirmed. ✅ - `importlib.reload()` + cleanup reload pattern is the correct way to test module-level constants. ✅ - `patch.dict(os.environ, clean_env, clear=True)` correctly isolates the default-path test from the host env. ✅ **`test_main.py`** - Two tests, each with one assertion — correct Arrange-Act-Assert structure. ✅ - `@pytest.mark.asyncio` present — required by asyncio mode=STRICT. ✅ - Mocks are minimal: `kraken_engine.load_models`, `load_spell_checker` — exactly what's needed to let the lifespan run without heavy deps. ✅ - `caplog.at_level(logging.WARNING, logger="main")` — scoped to the right logger and level. ✅ - Tests are CI-only (require fastapi/pypdfium2 full deps) — consistent with the established pattern of `test_stream.py` and `test_training_auth.py`. ✅ **`ensure_blla_model.py`** - `os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")` — clean one-liner, no side effects. ✅ - Fixed path not derived from `~`. ✅ **`main.py`** - `if os.getuid() == 0: logger.warning(...)` — two lines, single responsibility, uses SLF4J-style parameterized logging pattern. ✅ - No new behavior added to the happy path. ✅ **Dockerfile** - `RUN useradd ... && mkdir -p ... && chown -R ocr:ocr /app /home/ocr` then `RUN chmod +x /app/entrypoint.sh` — ordering is correct: chown before chmod, both run as root before `USER ocr`. ✅ - `ENV` block after `chown` and before `USER ocr` — correct placement. ✅ ### Note The production compose is flagged by Tobias and Markus as a blocker. That's infrastructure, not code quality — no action required on this review.

marcel commented

2026-05-17 17:40:16 +02:00

🧪 Sara Holt — Senior QA Engineer

Verdict: ✅ Approved

Test quality is significantly improved over the first review cycle. The previous assertion blocker is resolved.

What's verified

test_htrmopo_dir_default_is_fixed_path — fixed correctly:

assert result == "/app/models/.htrmopo"

Single exact equality check. The name ("default is fixed path") and the assertion agree. If the implementation returns anything other than this exact string, the test fails. Clean. ✅

test_htrmopo_dir_reads_from_env_var — unchanged, already correct:

assert result == "/custom/htrmopo"

Round-trip verification. ✅

test_main.py — new root canary tests:

Both tests follow the same structure as the existing test_stream.py/test_training_auth.py CI-only pattern. They require full OCR deps (fastapi, pypdfium2) to run, which is consistent with the established test infrastructure. In CI where all deps are installed, both tests are meaningful:

test_startup_logs_warning_when_running_as_root — would fail if the if os.getuid() == 0: check was removed ✅
test_startup_does_not_warn_when_running_as_non_root — covers the negative path ✅

Both use @pytest.mark.asyncio (STRICT mode requirement met) and mock only what's necessary. ✅

Remaining deferred

CI gate for uid=1000 check — explicitly deferred in the previous review summary. Noted. A follow-up issue would be the right place to track this.

One observation

The caplog.at_level(logging.WARNING, logger="main") is scoped to the "main" logger. The assertion is assert "Running as root" in caplog.text. This is correct — caplog.text contains all captured log records, and since the logger is main (which is logging.getLogger(__name__) in main.py), the scoping is accurate.

One edge case: if a second test runs and the _models_ready global is already True in the main module, the lifespan still runs in sequence and the warning is still emitted. The shared module state doesn't affect these tests. ✅

## 🧪 Sara Holt — Senior QA Engineer **Verdict: ✅ Approved** Test quality is significantly improved over the first review cycle. The previous assertion blocker is resolved. ### What's verified **`test_htrmopo_dir_default_is_fixed_path`** — fixed correctly: ```python assert result == "/app/models/.htrmopo" ``` Single exact equality check. The name ("default is fixed path") and the assertion agree. If the implementation returns anything other than this exact string, the test fails. Clean. ✅ **`test_htrmopo_dir_reads_from_env_var`** — unchanged, already correct: ```python assert result == "/custom/htrmopo" ``` Round-trip verification. ✅ **`test_main.py`** — new root canary tests: Both tests follow the same structure as the existing `test_stream.py`/`test_training_auth.py` CI-only pattern. They require full OCR deps (fastapi, pypdfium2) to run, which is consistent with the established test infrastructure. In CI where all deps are installed, both tests are meaningful: - `test_startup_logs_warning_when_running_as_root` — would fail if the `if os.getuid() == 0:` check was removed ✅ - `test_startup_does_not_warn_when_running_as_non_root` — covers the negative path ✅ Both use `@pytest.mark.asyncio` (STRICT mode requirement met) and mock only what's necessary. ✅ ### Remaining deferred **CI gate for `uid=1000` check** — explicitly deferred in the previous review summary. Noted. A follow-up issue would be the right place to track this. ### One observation The `caplog.at_level(logging.WARNING, logger="main")` is scoped to the `"main"` logger. The assertion is `assert "Running as root" in caplog.text`. This is correct — `caplog.text` contains all captured log records, and since the logger is `main` (which is `logging.getLogger(__name__)` in `main.py`), the scoping is accurate. One edge case: if a second test runs and the `_models_ready` global is already True in the `main` module, the lifespan still runs in sequence and the warning is still emitted. The shared module state doesn't affect these tests. ✅

marcel commented

2026-05-17 17:40:29 +02:00

📋 Elicit — Requirements Engineer

Verdict: 🚫 Changes requested

Issue #459 requirements are met. One non-functional requirement gap has been introduced.

Blocker: Production upgrade path is broken

docker-compose.prod.yml is standalone (self-contained, confirmed by its own header comment). This PR updates only docker-compose.yml. As a result:

NFR-OPS-001 (from the previous review cycle): "The operator shall delete familienarchiv_ocr_cache before starting the updated container stack."

This NFR is documented in DEPLOYMENT.md §8 — but only for development. In production:

The volume name is archiv-production_ocr-cache (not familienarchiv_ocr_cache) — different project name, different separator (hyphens not underscores)
The upgrade notes don't distinguish between dev and prod
Even after the volume drop, the prod compose would still mount it at /root/.cache with no HF_HOME env var — HuggingFace would write to /app/cache (Dockerfile ENV default), which is an ephemeral container path in prod, so every restart re-downloads models

Requirement gap: The upgrade notes document the dev case. The production case is both different and broken.

Requirements coverage check (updated)

Requirement from #459	Dev?	Prod?
Non-root user in container	✅ Dockerfile	✅ Dockerfile (shared image)
Correct HOME and HF_HOME	✅ Dockerfile + compose	⚠️ Dockerfile only — prod compose missing `HF_HOME` override
HTRMOPO_DIR not derived from `~`	✅	✅ (code change)
Writable cache survives container recreate	✅ named volume `ocr_cache:/app/cache`	❌ prod still mounts `ocr-cache:/root/.cache` (wrong path)
Observability: detect if accidentally running as root	✅	✅ (code change)
CIS Docker §4.6: cap_drop + no-new-privileges	✅ dev compose	❌ missing from prod compose
read_only filesystem	✅ dev compose	❌ missing from prod compose

What's done well

Upgrade notes section in DEPLOYMENT.md establishes the right pattern for future migration steps ✅
Requirements coverage is complete in the dev environment ✅
ADR-019 captures the hardening baseline as a durable architectural record ✅

## 📋 Elicit — Requirements Engineer **Verdict: 🚫 Changes requested** Issue #459 requirements are met. One non-functional requirement gap has been introduced. ### Blocker: Production upgrade path is broken `docker-compose.prod.yml` is standalone (self-contained, confirmed by its own header comment). This PR updates only `docker-compose.yml`. As a result: **NFR-OPS-001 (from the previous review cycle):** "The operator shall delete `familienarchiv_ocr_cache` before starting the updated container stack." This NFR is documented in `DEPLOYMENT.md §8` — but only for development. In production: 1. The volume name is `archiv-production_ocr-cache` (not `familienarchiv_ocr_cache`) — different project name, different separator (hyphens not underscores) 2. The upgrade notes don't distinguish between dev and prod 3. Even after the volume drop, the prod compose would still mount it at `/root/.cache` with no `HF_HOME` env var — HuggingFace would write to `/app/cache` (Dockerfile ENV default), which is an ephemeral container path in prod, so every restart re-downloads models **Requirement gap:** The upgrade notes document the dev case. The production case is both different and broken. ### Requirements coverage check (updated) | Requirement from #459 | Dev? | Prod? | |---|---|---| | Non-root user in container | ✅ Dockerfile | ✅ Dockerfile (shared image) | | Correct HOME and HF_HOME | ✅ Dockerfile + compose | ⚠️ Dockerfile only — prod compose missing `HF_HOME` override | | HTRMOPO_DIR not derived from `~` | ✅ | ✅ (code change) | | Writable cache survives container recreate | ✅ named volume `ocr_cache:/app/cache` | ❌ prod still mounts `ocr-cache:/root/.cache` (wrong path) | | Observability: detect if accidentally running as root | ✅ | ✅ (code change) | | CIS Docker §4.6: cap_drop + no-new-privileges | ✅ dev compose | ❌ missing from prod compose | | read_only filesystem | ✅ dev compose | ❌ missing from prod compose | ### What's done well - Upgrade notes section in DEPLOYMENT.md establishes the right pattern for future migration steps ✅ - Requirements coverage is complete in the dev environment ✅ - ADR-019 captures the hardening baseline as a durable architectural record ✅

marcel commented

2026-05-17 17:40:33 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Verdict: ✅ Approved

No .svelte files, no CSS or design tokens, no frontend routes, no i18n strings, no user-facing components or interactions.

Changed files are: docker-compose.yml, docker-compose.prod.yml (indirectly flagged as missing), docs/DEPLOYMENT.md, docs/adr/019-container-hardening-baseline.md, ocr-service/Dockerfile, ocr-service/ensure_blla_model.py, ocr-service/main.py, ocr-service/test_*.py.

The only operator-visible output is the startup log line "Running as root — CIS Docker §4.1 violation" — this surfaces in Docker logs and Loki, never in the user interface.

LGTM from a UI/UX and accessibility perspective.

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist **Verdict: ✅ Approved** No `.svelte` files, no CSS or design tokens, no frontend routes, no i18n strings, no user-facing components or interactions. Changed files are: `docker-compose.yml`, `docker-compose.prod.yml` (indirectly flagged as missing), `docs/DEPLOYMENT.md`, `docs/adr/019-container-hardening-baseline.md`, `ocr-service/Dockerfile`, `ocr-service/ensure_blla_model.py`, `ocr-service/main.py`, `ocr-service/test_*.py`. The only operator-visible output is the startup log line `"Running as root — CIS Docker §4.1 violation"` — this surfaces in Docker logs and Loki, never in the user interface. LGTM from a UI/UX and accessibility perspective.

marcel added 1 commit 2026-05-17 17:43:59 +02:00

security(ocr): apply container hardening baseline to docker-compose.prod.yml

CI / Unit & Component Tests (pull_request) Successful in 3m3s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m4s

Details

CI / fail2ban Regex (pull_request) Successful in 43s

Details

CI / Semgrep Security Scan (pull_request) Successful in 18s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 59s

Details

7769dbc9f4

Mirror the CIS Docker §4.1/§4.6 hardening from docker-compose.yml to the
production/staging compose file, which is standalone (not an overlay).

- Fix cache volume mount path: ocr-cache:/root/.cache → /app/cache (matches
  the non-root user's HF_HOME/XDG_CACHE_HOME, avoids PermissionError)
- Add HF_HOME, XDG_CACHE_HOME, TORCH_HOME env vars so HuggingFace, ketos,
  and PyTorch all write to the declared writable volumes, not HOME
- Add read_only: true, tmpfs (/tmp:512m), cap_drop: [ALL],
  no-new-privileges:true — matching the dev baseline

Also extend DEPLOYMENT.md §8 upgrade notes to cover all three environments
(dev/production/staging), each with its correct project-namespaced volume name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-17 17:44:08 +02:00

Review concerns addressed — commit `7769dbc9`

All blockers from the second-round review (Tobias 🚫, Nora 🚫, Elicit 🚫) have been resolved.

`docker-compose.prod.yml` — OCR service fully hardened

docker-compose.prod.yml is a standalone file (not an overlay), so none of the security changes from docker-compose.yml propagated to it automatically. This commit brings it to parity:

Change	Before	After
Cache volume mount path	`ocr-cache:/root/.cache`	`ocr-cache:/app/cache`
`HF_HOME`	missing	`/app/cache`
`XDG_CACHE_HOME`	missing	`/app/cache`
`TORCH_HOME`	missing	`/app/models/torch`
`read_only`	missing	`true`
`tmpfs`	missing	`/tmp:size=512m`
`cap_drop`	missing	`[ALL]` (block notation)
`no-new-privileges`	missing	`true`

`docs/DEPLOYMENT.md §8` — Upgrade notes cover all environments

The previous upgrade notes only gave the dev volume name (familienarchiv_ocr_cache). The section now lists all three project-namespaced names:

# Dev
docker volume rm familienarchiv_ocr_cache

# Production (-p archiv-production)
docker volume rm archiv-production_ocr-cache

# Staging (-p archiv-staging)
docker volume rm archiv-staging_ocr-cache

## Review concerns addressed — commit `7769dbc9` All blockers from the second-round review (Tobias 🚫, Nora 🚫, Elicit 🚫) have been resolved. ### `docker-compose.prod.yml` — OCR service fully hardened `docker-compose.prod.yml` is a standalone file (not an overlay), so none of the security changes from `docker-compose.yml` propagated to it automatically. This commit brings it to parity: | Change | Before | After | |--------|--------|-------| | Cache volume mount path | `ocr-cache:/root/.cache` | `ocr-cache:/app/cache` | | `HF_HOME` | missing | `/app/cache` | | `XDG_CACHE_HOME` | missing | `/app/cache` | | `TORCH_HOME` | missing | `/app/models/torch` | | `read_only` | missing | `true` | | `tmpfs` | missing | `/tmp:size=512m` | | `cap_drop` | missing | `[ALL]` (block notation) | | `no-new-privileges` | missing | `true` | ### `docs/DEPLOYMENT.md §8` — Upgrade notes cover all environments The previous upgrade notes only gave the dev volume name (`familienarchiv_ocr_cache`). The section now lists all three project-namespaced names: ```bash # Dev docker volume rm familienarchiv_ocr_cache # Production (-p archiv-production) docker volume rm archiv-production_ocr-cache # Staging (-p archiv-staging) docker volume rm archiv-staging_ocr-cache ```

marcel commented

2026-05-17 18:28:01 +02:00

🏛️ Markus Keller — Senior Application Architect

Verdict: ✅ Approved

What I checked

ADR-019 is well-formed and follows the established format: context, decision, consequences, applicability. It correctly documents all five hardening elements and links back to DEPLOYMENT.md §8. This is exactly the kind of documentation that prevents future developers from removing the USER ocr directive out of ignorance. Good work.

DEPLOYMENT.md is updated with two new env vars (XDG_CACHE_HOME, TORCH_HOME) and a new §8 with per-environment volume removal commands. The upgrade notes distinguish dev, production, and staging compose project names — this will prevent silent failures on existing installs.

C4 diagrams: No update required. This PR does not add a new container, a new external system, or change the service boundary of the OCR service. Runtime hardening flags are implementation detail, not container topology.

One concern (non-blocking)

ADR-019 §Applicability states: "This baseline MUST be applied to any new container added to the project." The backend and frontend service containers are not yet hardened. The ADR establishes the expectation but the rest of the fleet is not yet compliant. Consider creating a tracking issue for backend + frontend container hardening so the ADR's "must" doesn't become an unfulfilled promise that rots.

What was done well

The decision to break chmod +x out as a separate RUN step (after chown -R ocr:ocr /app) keeps the layer order correct without forcing a combined RUN that's hard to read.
Redirecting TORCH_HOME to /app/models/torch rather than /app/cache is the right call — model weights are persistent, not ephemeral cache.
Upgrade notes include the exact docker volume rm commands per environment. No guesswork for the operator.

## 🏛️ Markus Keller — Senior Application Architect **Verdict: ✅ Approved** ### What I checked **ADR-019** is well-formed and follows the established format: context, decision, consequences, applicability. It correctly documents all five hardening elements and links back to DEPLOYMENT.md §8. This is exactly the kind of documentation that prevents future developers from removing the `USER ocr` directive out of ignorance. Good work. **DEPLOYMENT.md** is updated with two new env vars (`XDG_CACHE_HOME`, `TORCH_HOME`) and a new §8 with per-environment volume removal commands. The upgrade notes distinguish dev, production, and staging compose project names — this will prevent silent failures on existing installs. **C4 diagrams**: No update required. This PR does not add a new container, a new external system, or change the service boundary of the OCR service. Runtime hardening flags are implementation detail, not container topology. ### One concern (non-blocking) ADR-019 §Applicability states: *"This baseline MUST be applied to any new container added to the project."* The backend and frontend service containers are not yet hardened. The ADR establishes the expectation but the rest of the fleet is not yet compliant. Consider creating a tracking issue for backend + frontend container hardening so the ADR's "must" doesn't become an unfulfilled promise that rots. ### What was done well - The decision to break `chmod +x` out as a separate `RUN` step (after `chown -R ocr:ocr /app`) keeps the layer order correct without forcing a combined RUN that's hard to read. - Redirecting `TORCH_HOME` to `/app/models/torch` rather than `/app/cache` is the right call — model weights are persistent, not ephemeral cache. - Upgrade notes include the exact `docker volume rm` commands per environment. No guesswork for the operator.

marcel commented

2026-05-17 18:28:13 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: ✅ Approved

TDD evidence

Commits confirm the red/green order: test(ocr): add startup root canary tests precedes security(ocr): redirect XDG cache and Torch home in the branch history. The root canary tests were written before the lifespan change. Good discipline.

Code quality

ensure_blla_model.py change is minimal and correct:

# Before
HTRMOPO_DIR = os.path.expanduser("~/.local/share/htrmopo")

# After
HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")

One line, no new indirection, self-documenting. The default /app/models/.htrmopo is a fixed path that doesn't depend on ~ — exactly right for a --no-create-home container user.

main.py canary is two lines and doesn't need a comment:

if os.getuid() == 0:
    logger.warning("Running as root — CIS Docker §4.1 violation")

The message in the warning string is sufficient documentation.

test_main.py is clean: proper @pytest.mark.asyncio, mocks both heavyweight startup functions (load_models, load_spell_checker), asserts via caplog at the right level. Both happy and unhappy paths are covered.

Minor observation (non-blocking)

test_ensure_blla_model.py uses importlib.reload(ensure_blla_model) to reset module-level state. This is the correct technique when testing module constants that are evaluated at import time — no better option exists given the current module structure. The cleanup reload at the end of each test is essential and present. Flag this if parallel test execution is ever introduced (pytest-xdist), as global module state won't be process-isolated.

Naming

All new test function names read as sentences describing behavior. No abbreviations. No test_update() anti-patterns.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### TDD evidence Commits confirm the red/green order: `test(ocr): add startup root canary tests` precedes `security(ocr): redirect XDG cache and Torch home` in the branch history. The root canary tests were written before the lifespan change. Good discipline. ### Code quality **`ensure_blla_model.py` change** is minimal and correct: ```python # Before HTRMOPO_DIR = os.path.expanduser("~/.local/share/htrmopo") # After HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") ``` One line, no new indirection, self-documenting. The default `/app/models/.htrmopo` is a fixed path that doesn't depend on `~` — exactly right for a `--no-create-home` container user. **`main.py` canary** is two lines and doesn't need a comment: ```python if os.getuid() == 0: logger.warning("Running as root — CIS Docker §4.1 violation") ``` The message in the warning string is sufficient documentation. **`test_main.py`** is clean: proper `@pytest.mark.asyncio`, mocks both heavyweight startup functions (`load_models`, `load_spell_checker`), asserts via `caplog` at the right level. Both happy and unhappy paths are covered. ### Minor observation (non-blocking) `test_ensure_blla_model.py` uses `importlib.reload(ensure_blla_model)` to reset module-level state. This is the correct technique when testing module constants that are evaluated at import time — no better option exists given the current module structure. The cleanup reload at the end of each test is essential and present. Flag this if parallel test execution is ever introduced (`pytest-xdist`), as global module state won't be process-isolated. ### Naming All new test function names read as sentences describing behavior. No abbreviations. No `test_update()` anti-patterns.

marcel commented

2026-05-17 18:28:29 +02:00

🚀 Tobias Wendt — DevOps & Platform Engineer

Verdict: ⚠️ Approved with concerns

What was done well

The hardening block is correct:

read_only: true
tmpfs:
  - /tmp:size=512m
cap_drop: [ALL]
security_opt:
  - no-new-privileges:true

Comments explain why — 512m covers typical training batches — not just what. The volume path change from /root/.cache to /app/cache is clean. Named volumes unchanged (ocr_cache/ocr-cache), so Docker still manages lifecycle correctly.

The upgrade notes in DEPLOYMENT.md §8 are exactly what I'd want to hand to an operator at 11pm: exact docker volume rm commands, correct project name variants per environment (dev/prod/staging), and an explicit cost estimate (1–2 GB re-download, one-time).

Blocker

None.

Concerns

1. Inconsistent list syntax between compose files (cosmetic but worth fixing)

docker-compose.yml uses inline YAML list:

cap_drop: [ALL]

docker-compose.prod.yml uses block list:

cap_drop:
  - ALL

Both are valid YAML but the inconsistency will confuse someone doing a diff. Pick one style and apply it to both files.

2. tmpfs has no noexec,nosuid options

The current mount:

tmpfs:
  - /tmp:size=512m

Consider:

tmpfs:
  - /tmp:size=512m,noexec,nosuid,nodev

Training writes ZIPs to /tmp, not executables, so noexec should be safe. nosuid prevents SUID escalation from files written to /tmp. This is defense-in-depth — if a dependency ever writes an executable to /tmp and tries to run it, noexec blocks it. Worth verifying that ketos and the training pipeline don't execute anything from /tmp before adding.

3. TORCH_HOME points to /app/models/torch — subdirectory will not exist on first run

The /app/models directory is a named volume. The torch subdirectory doesn't exist until PyTorch creates it. PyTorch's torch.hub creates this directory lazily on first use, so this should work. Worth confirming with a quick docker exec after first startup: ls /app/models/torch.

What's correct that I specifically verified

HF_HOME: /app/cache in both the Dockerfile ENV and compose environment — the compose value takes precedence at runtime; the Dockerfile value is the image fallback.
The useradd --uid 1000 fixed UID ensures consistent volume ownership across host volume mounts.
start_period: 120s on the healthcheck — unchanged, still appropriate for Surya model loading.

## 🚀 Tobias Wendt — DevOps & Platform Engineer **Verdict: ⚠️ Approved with concerns** ### What was done well The hardening block is correct: ```yaml read_only: true tmpfs: - /tmp:size=512m cap_drop: [ALL] security_opt: - no-new-privileges:true ``` Comments explain *why* — 512m covers typical training batches — not just what. The volume path change from `/root/.cache` to `/app/cache` is clean. Named volumes unchanged (`ocr_cache`/`ocr-cache`), so Docker still manages lifecycle correctly. The upgrade notes in DEPLOYMENT.md §8 are exactly what I'd want to hand to an operator at 11pm: exact `docker volume rm` commands, correct project name variants per environment (dev/prod/staging), and an explicit cost estimate (1–2 GB re-download, one-time). ### Blocker None. ### Concerns **1. Inconsistent list syntax between compose files (cosmetic but worth fixing)** `docker-compose.yml` uses inline YAML list: ```yaml cap_drop: [ALL] ``` `docker-compose.prod.yml` uses block list: ```yaml cap_drop: - ALL ``` Both are valid YAML but the inconsistency will confuse someone doing a diff. Pick one style and apply it to both files. **2. `tmpfs` has no `noexec,nosuid` options** The current mount: ```yaml tmpfs: - /tmp:size=512m ``` Consider: ```yaml tmpfs: - /tmp:size=512m,noexec,nosuid,nodev ``` Training writes ZIPs to `/tmp`, not executables, so `noexec` should be safe. `nosuid` prevents SUID escalation from files written to `/tmp`. This is defense-in-depth — if a dependency ever writes an executable to `/tmp` and tries to run it, `noexec` blocks it. Worth verifying that `ketos` and the training pipeline don't execute anything from `/tmp` before adding. **3. `TORCH_HOME` points to `/app/models/torch` — subdirectory will not exist on first run** The `/app/models` directory is a named volume. The `torch` subdirectory doesn't exist until PyTorch creates it. PyTorch's `torch.hub` creates this directory lazily on first use, so this should work. Worth confirming with a quick `docker exec` after first startup: `ls /app/models/torch`. ### What's correct that I specifically verified - `HF_HOME: /app/cache` in both the Dockerfile `ENV` and compose `environment` — the compose value takes precedence at runtime; the Dockerfile value is the image fallback. - The `useradd --uid 1000` fixed UID ensures consistent volume ownership across host volume mounts. - `start_period: 120s` on the healthcheck — unchanged, still appropriate for Surya model loading.

marcel commented

2026-05-17 18:28:40 +02:00

📋 Elicit — Requirements Engineer

Verdict: ✅ Approved

Traceability assessment

The PR closes issue #459. Based on the stated requirements for non-root OCR container operation, I traced each requirement through to the implementation:

Requirement	Addressed?	Evidence
OCR process must not run as root	✅	`USER ocr` in Dockerfile; startup canary warns if violated
Cache must persist across container restarts	✅	Named volume `ocr_cache` mounted at `/app/cache`
Model cache must not be re-downloaded on each restart	✅	`HF_HOME`, `XDG_CACHE_HOME`, `TORCH_HOME` all point to persistent volumes
Existing installs must have a documented upgrade path	✅	DEPLOYMENT.md §8 with exact commands per environment
Runtime must be isolated from container filesystem writes	✅	`read_only: true` + named volume carve-outs

One open question (non-blocking)

The test plan in the PR description includes manual steps (docker exec archive-ocr id). There is no automated E2E or CI verification that the container is actually running as UID 1000. This is accepted as a manual gate for a security property, but worth noting as a gap if CI automation is ever added for container hardening checks.

Out-of-scope addition (noted, not blocking)

The HTRMOPO_DIR env var change was not mentioned in the original issue #459 but is clearly necessary to prevent the ~ → / resolution failure under --no-create-home. This is a correct and minimal addition that belongs in this PR.

Upgrade notes completeness

The DEPLOYMENT.md §8 covers dev, production (with project name -p archiv-production), and staging (with project name -p archiv-staging). The 1–2 GB re-download cost is quantified. The instruction is unambiguous: drop the volume before starting the updated stack.

No requirements gaps found.

## 📋 Elicit — Requirements Engineer **Verdict: ✅ Approved** ### Traceability assessment The PR closes issue #459. Based on the stated requirements for non-root OCR container operation, I traced each requirement through to the implementation: | Requirement | Addressed? | Evidence | |---|---|---| | OCR process must not run as root | ✅ | `USER ocr` in Dockerfile; startup canary warns if violated | | Cache must persist across container restarts | ✅ | Named volume `ocr_cache` mounted at `/app/cache` | | Model cache must not be re-downloaded on each restart | ✅ | `HF_HOME`, `XDG_CACHE_HOME`, `TORCH_HOME` all point to persistent volumes | | Existing installs must have a documented upgrade path | ✅ | DEPLOYMENT.md §8 with exact commands per environment | | Runtime must be isolated from container filesystem writes | ✅ | `read_only: true` + named volume carve-outs | ### One open question (non-blocking) The test plan in the PR description includes manual steps (`docker exec archive-ocr id`). There is no automated E2E or CI verification that the container is actually running as UID 1000. This is accepted as a manual gate for a security property, but worth noting as a gap if CI automation is ever added for container hardening checks. ### Out-of-scope addition (noted, not blocking) The `HTRMOPO_DIR` env var change was not mentioned in the original issue #459 but is clearly necessary to prevent the `~` → `/` resolution failure under `--no-create-home`. This is a correct and minimal addition that belongs in this PR. ### Upgrade notes completeness The DEPLOYMENT.md §8 covers dev, production (with project name `-p archiv-production`), and staging (with project name `-p archiv-staging`). The 1–2 GB re-download cost is quantified. The instruction is unambiguous: drop the volume *before* starting the updated stack. No requirements gaps found.

marcel commented

2026-05-17 18:28:55 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer

Verdict: ✅ Approved

This PR implements CIS Docker Benchmark §4.1 (non-root) and §4.6 (read-only filesystem). I reviewed it against the full hardening checklist.

Verified controls

Control	CIS Ref	Status	Notes
Non-root user	§4.1	✅	`useradd --uid 1000 --no-create-home --shell /usr/sbin/nologin`
Read-only filesystem	§4.6	✅	`read_only: true` with explicit writable carve-outs
Capability drop	§5.3	✅	`cap_drop: ALL`, no capabilities added back
Privilege escalation block	§5.4	✅	`no-new-privileges:true`
Startup canary	custom	✅	Logs `WARNING` if `os.getuid() == 0`
No SUID shell	§4.1	✅	`--shell /usr/sbin/nologin`
`HTRMOPO_DIR` path safety	custom	✅	Eliminates `~` → `/` resolution under `--no-create-home`

Suggestion (non-blocking)

The tmpfs mount could be hardened further:

# Current
tmpfs:
  - /tmp:size=512m

# Suggested
tmpfs:
  - /tmp:size=512m,noexec,nosuid,nodev

noexec prevents executing binaries written to /tmp — relevant because the training pipeline extracts ZIPs there, and a malicious ZIP entry could include an executable. nosuid prevents SUID escalation from files landing in /tmp. Verify first that ketos train does not execute scripts extracted from /tmp before enabling noexec. If ketos is clean, this is a free defense-in-depth layer.

What was done particularly well

chown -R ocr:ocr /app /home/ocr followed by read_only: true at runtime means the ocr user owns its files but cannot create new paths outside the declared volumes. If a dependency tries to write to /usr/local/lib/python3.12/site-packages/, it gets a PermissionError immediately — not a silent privilege escalation.

The startup canary at WARNING level (not ERROR) is the right severity: it notifies operators without aborting the container, which allows existing unpatched deployments to start and emit a visible signal rather than crash-looping with no context.

The HTRMOPO_DIR fix is security-relevant: os.path.expanduser("~") when HOME points to / (which happens under --no-create-home without an explicit HOME env var) silently expands to / — a path-traversal waiting to happen if htrmopo were ever to write files relative to that base. The env-var override with a fixed /app/models/.htrmopo default closes this.

No vulnerabilities found. The hardening is correct and complete for the OCR service's threat model.

## 🔐 Nora "NullX" Steiner — Application Security Engineer **Verdict: ✅ Approved** This PR implements CIS Docker Benchmark §4.1 (non-root) and §4.6 (read-only filesystem). I reviewed it against the full hardening checklist. ### Verified controls | Control | CIS Ref | Status | Notes | |---|---|---|---| | Non-root user | §4.1 | ✅ | `useradd --uid 1000 --no-create-home --shell /usr/sbin/nologin` | | Read-only filesystem | §4.6 | ✅ | `read_only: true` with explicit writable carve-outs | | Capability drop | §5.3 | ✅ | `cap_drop: ALL`, no capabilities added back | | Privilege escalation block | §5.4 | ✅ | `no-new-privileges:true` | | Startup canary | custom | ✅ | Logs `WARNING` if `os.getuid() == 0` | | No SUID shell | §4.1 | ✅ | `--shell /usr/sbin/nologin` | | `HTRMOPO_DIR` path safety | custom | ✅ | Eliminates `~` → `/` resolution under `--no-create-home` | ### Suggestion (non-blocking) The tmpfs mount could be hardened further: ```yaml # Current tmpfs: - /tmp:size=512m # Suggested tmpfs: - /tmp:size=512m,noexec,nosuid,nodev ``` `noexec` prevents executing binaries written to `/tmp` — relevant because the training pipeline extracts ZIPs there, and a malicious ZIP entry could include an executable. `nosuid` prevents SUID escalation from files landing in `/tmp`. **Verify first** that `ketos train` does not execute scripts extracted from `/tmp` before enabling `noexec`. If ketos is clean, this is a free defense-in-depth layer. ### What was done particularly well **`chown -R ocr:ocr /app /home/ocr`** followed by **`read_only: true`** at runtime means the `ocr` user owns its files but cannot create new paths outside the declared volumes. If a dependency tries to write to `/usr/local/lib/python3.12/site-packages/`, it gets a `PermissionError` immediately — not a silent privilege escalation. **The startup canary** at `WARNING` level (not `ERROR`) is the right severity: it notifies operators without aborting the container, which allows existing unpatched deployments to start and emit a visible signal rather than crash-looping with no context. **The `HTRMOPO_DIR` fix** is security-relevant: `os.path.expanduser("~")` when `HOME` points to `/` (which happens under `--no-create-home` without an explicit `HOME` env var) silently expands to `/` — a path-traversal waiting to happen if `htrmopo` were ever to write files relative to that base. The env-var override with a fixed `/app/models/.htrmopo` default closes this. No vulnerabilities found. The hardening is correct and complete for the OCR service's threat model.

marcel commented

2026-05-17 18:29:11 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Verdict: ⚠️ Approved with concerns

Test coverage summary

Two new test files, four new tests:

File	Tests	Layer	Technique
`test_ensure_blla_model.py`	2 (HTRMOPO_DIR env resolution)	Unit	`importlib.reload` + `patch.dict`
`test_main.py`	2 (root canary warning)	Unit	`AsyncClient(ASGITransport)` + `caplog`

What's good

All four test names read as sentences describing behavior ✅
@pytest.mark.asyncio correctly applied to async test functions ✅
ASGITransport(app=app) used instead of a real server — in-process, fast, correct ✅
Both happy path (non-root, no warning) and unhappy path (root, warning emitted) covered ✅
Cleanup importlib.reload(ensure_blla_model) call at the end of each HTRMOPO test restores module state ✅

Concern 1 — Untested edge case: empty string `HTRMOPO_DIR` (minor)

HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")

os.environ.get("HTRMOPO_DIR", default) returns "" if the variable is set to an empty string — the default is only used when the key is absent. An operator who sets HTRMOPO_DIR= in their .env file (empty value) will get an empty string as the path, which will cause an obscure failure when htrmopo tries to use it.

Suggested additional test:

def test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string():
    """HTRMOPO_DIR='' should not produce an empty path."""
    with patch.dict(os.environ, {"HTRMOPO_DIR": ""}):
        importlib.reload(ensure_blla_model)
        result = ensure_blla_model.HTRMOPO_DIR
    importlib.reload(ensure_blla_model)
    assert result != ""

This test will currently fail — exposing the real behavior of os.environ.get. The fix would be:

HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"

Concern 2 — `importlib.reload` is fragile under parallel test execution (minor)

The patch.dict + reload pattern mutates global module state. If pytest-xdist is ever added for parallel runs, tests in test_ensure_blla_model.py will race each other. This is acceptable today but should be flagged before any parallelization work.

No coverage regression

The existing test_ensure_blla_model.py tests (model loadable, model missing, download success, etc.) are unchanged. The root canary tests are isolated via caplog and don't interact with OCR model state.

No E2E test exists to verify docker exec archive-ocr id — this is expected for infrastructure-level assertions and the manual test plan in the PR description covers it.

## 🧪 Sara Holt — QA Engineer & Test Strategist **Verdict: ⚠️ Approved with concerns** ### Test coverage summary Two new test files, four new tests: | File | Tests | Layer | Technique | |---|---|---|---| | `test_ensure_blla_model.py` | 2 (HTRMOPO_DIR env resolution) | Unit | `importlib.reload` + `patch.dict` | | `test_main.py` | 2 (root canary warning) | Unit | `AsyncClient(ASGITransport)` + `caplog` | ### What's good - All four test names read as sentences describing behavior ✅ - `@pytest.mark.asyncio` correctly applied to async test functions ✅ - `ASGITransport(app=app)` used instead of a real server — in-process, fast, correct ✅ - Both happy path (non-root, no warning) and unhappy path (root, warning emitted) covered ✅ - Cleanup `importlib.reload(ensure_blla_model)` call at the end of each HTRMOPO test restores module state ✅ ### Concern 1 — Untested edge case: empty string `HTRMOPO_DIR` (minor) ```python HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") ``` `os.environ.get("HTRMOPO_DIR", default)` returns `""` if the variable is set to an empty string — the default is only used when the key is **absent**. An operator who sets `HTRMOPO_DIR=` in their `.env` file (empty value) will get an empty string as the path, which will cause an obscure failure when `htrmopo` tries to use it. Suggested additional test: ```python def test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string(): """HTRMOPO_DIR='' should not produce an empty path.""" with patch.dict(os.environ, {"HTRMOPO_DIR": ""}): importlib.reload(ensure_blla_model) result = ensure_blla_model.HTRMOPO_DIR importlib.reload(ensure_blla_model) assert result != "" ``` This test will currently **fail** — exposing the real behavior of `os.environ.get`. The fix would be: ```python HTRMOPO_DIR = os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo" ``` ### Concern 2 — `importlib.reload` is fragile under parallel test execution (minor) The `patch.dict` + `reload` pattern mutates global module state. If `pytest-xdist` is ever added for parallel runs, tests in `test_ensure_blla_model.py` will race each other. This is acceptable today but should be flagged before any parallelization work. ### No coverage regression The existing `test_ensure_blla_model.py` tests (model loadable, model missing, download success, etc.) are unchanged. The root canary tests are isolated via `caplog` and don't interact with OCR model state. No E2E test exists to verify `docker exec archive-ocr id` — this is expected for infrastructure-level assertions and the manual test plan in the PR description covers it.

marcel commented

2026-05-17 18:29:19 +02:00

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Verdict: ✅ Approved — No frontend changes

This PR is entirely infrastructure and Python backend work: Dockerfile hardening, Docker Compose runtime flags, env var redirection, and a startup canary log line. There are no changes to Svelte components, SvelteKit routes, Tailwind classes, i18n strings, or any user-facing interface.

What I checked

No .svelte files modified
No changes to frontend/
No new routes, error pages, or user-visible error messages
No design token or CSS changes
No i18n keys added or removed

Indirect UX note (informational only)

The read_only: true + tmpfs /tmp:size=512m constraint means that if a training batch ever exceeds 512 MB of scratch space, the OCR service will produce a visible ENOSPC error. If this manifests in the UI (e.g., a failed training response surfaced to an admin), confirm that the error is mapped to a user-friendly message via getErrorMessage() rather than exposing the raw OSError from the Python backend. This is not introduced by this PR — it applies to any existing training error path — but worth checking once the non-root container is running in production.

No UI/UX concerns with this PR.

## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist **Verdict: ✅ Approved — No frontend changes** This PR is entirely infrastructure and Python backend work: Dockerfile hardening, Docker Compose runtime flags, env var redirection, and a startup canary log line. There are no changes to Svelte components, SvelteKit routes, Tailwind classes, i18n strings, or any user-facing interface. ### What I checked - No `.svelte` files modified - No changes to `frontend/` - No new routes, error pages, or user-visible error messages - No design token or CSS changes - No i18n keys added or removed ### Indirect UX note (informational only) The `read_only: true` + `tmpfs /tmp:size=512m` constraint means that if a training batch ever exceeds 512 MB of scratch space, the OCR service will produce a visible `ENOSPC` error. If this manifests in the UI (e.g., a failed training response surfaced to an admin), confirm that the error is mapped to a user-friendly message via `getErrorMessage()` rather than exposing the raw `OSError` from the Python backend. This is not introduced by this PR — it applies to any existing training error path — but worth checking once the non-root container is running in production. No UI/UX concerns with this PR.

marcel added 2 commits 2026-05-17 18:55:09 +02:00

fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback bead6f1811

os.environ.get(key, default) returns "" when the key exists but is blank —
the default is only used when the key is absent. The or-fallback treats both
absence and blank values as "use the default".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style(ocr): normalize cap_drop to block notation in docker-compose.yml

CI / Unit & Component Tests (pull_request) Successful in 3m3s

Details

CI / OCR Service Tests (pull_request) Successful in 18s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m10s

Details

CI / fail2ban Regex (pull_request) Successful in 44s

Details

CI / Semgrep Security Scan (pull_request) Successful in 19s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s

Details

CI / Unit & Component Tests (push) Successful in 3m2s

Details

CI / OCR Service Tests (push) Successful in 18s

Details

CI / Backend Unit Tests (push) Successful in 3m0s

Details

CI / fail2ban Regex (push) Successful in 42s

Details

CI / Semgrep Security Scan (push) Successful in 18s

Details

CI / Compose Bucket Idempotency (push) Successful in 1m1s

Details

f1e0b92f47

Aligns with the block sequence style used in docker-compose.prod.yml and
the rest of the compose file, removing the inline [ALL] inconsistency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

marcel commented

2026-05-17 18:55:33 +02:00

Third-round review concerns addressed

All open concerns from the third review round (Sara ⚠️, Tobias ⚠️) have been resolved.

✅ @Sara — Concern 1: empty-string `HTRMOPO_DIR` falls back to empty string, not the default

os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo") returns "" when the variable is present but blank — the default only kicks in when the key is absent. An operator who writes HTRMOPO_DIR= in .env would get an empty path and a cryptic failure.

Fix: os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo" — the or fallback treats both absent and blank as "use the default".

Test added (test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string) — was red against the old get() form, green after the fix, consistent with the importlib.reload + patch.dict pattern used by the existing HTRMOPO_DIR tests.

→ commit bead6f18 — fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback

✅ @Tobias — Concern 1: `cap_drop: [ALL]` inline notation inconsistent with `docker-compose.prod.yml`

docker-compose.yml used inline flow notation ([ALL]); docker-compose.prod.yml used block notation (- ALL). Normalized docker-compose.yml to block notation to match.

→ commit f1e0b92f — style(ocr): normalize cap_drop to block notation in docker-compose.yml

Deferred

@Tobias / @Nora — noexec,nosuid,nodev on tmpfs: both reviewers explicitly say "verify first that ketos training does not execute anything from /tmp". Cannot verify without a running container. Deferred to a follow-up once the container is running in dev — if ketos is clean, adding the flags is a one-liner.
@Markus — tracking issue for backend/frontend container hardening: management/planning task, not a code change on this PR.

## Third-round review concerns addressed All open concerns from the third review round (Sara ⚠️, Tobias ⚠️) have been resolved. --- ### ✅ @Sara — Concern 1: empty-string `HTRMOPO_DIR` falls back to empty string, not the default `os.environ.get("HTRMOPO_DIR", "/app/models/.htrmopo")` returns `""` when the variable is present but blank — the default only kicks in when the key is absent. An operator who writes `HTRMOPO_DIR=` in `.env` would get an empty path and a cryptic failure. **Fix:** `os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"` — the `or` fallback treats both absent and blank as "use the default". **Test added** (`test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string`) — was red against the old `get()` form, green after the fix, consistent with the `importlib.reload` + `patch.dict` pattern used by the existing HTRMOPO_DIR tests. → commit `bead6f18` — `fix(ocr): handle empty-string HTRMOPO_DIR env var with or-fallback` --- ### ✅ @Tobias — Concern 1: `cap_drop: [ALL]` inline notation inconsistent with `docker-compose.prod.yml` `docker-compose.yml` used inline flow notation (`[ALL]`); `docker-compose.prod.yml` used block notation (`- ALL`). Normalized `docker-compose.yml` to block notation to match. → commit `f1e0b92f` — `style(ocr): normalize cap_drop to block notation in docker-compose.yml` --- ### Deferred - **@Tobias / @Nora — `noexec,nosuid,nodev` on tmpfs**: both reviewers explicitly say "verify first that ketos training does not execute anything from `/tmp`". Cannot verify without a running container. Deferred to a follow-up once the container is running in dev — if ketos is clean, adding the flags is a one-liner. - **@Markus — tracking issue for backend/frontend container hardening**: management/planning task, not a code change on this PR.

marcel commented

2026-05-17 19:01:42 +02:00

🏗️ Markus Keller (@mkeller) — Application Architect

Verdict: ✅ Approved

What I checked

ADR-019 — well-structured, covers context, decision, alternatives, consequences, and applicability. The threat model in the "Context" section is concrete and specific (path-traversal on ZIP training endpoints, dependency persistence to image layers). This is exactly what an ADR should capture. ✅

DEPLOYMENT.md — Upgrade notes section added with volume drop commands for all three deployment contexts (dev, prod, staging). This is the correct operational runbook entry. ✅

Documentation checklist — No new container services, no new routes, no DB migrations, no new domains. The hardened OCR service is the same topology. No changes required to l2-containers.puml or other architecture diagrams.

TORCH_HOME → /app/models/torch — Correct decision. Models volume is persistent (named volume, survives container recreation), cache volume is also persistent. Torch model files won't be re-downloaded on recreates. ✅

Suggestion (non-blocker)

ocr-service/Dockerfile has two separate RUN layers:

RUN useradd ... && mkdir -p ... && chown -R ocr:ocr /app /home/ocr
RUN chmod +x /app/entrypoint.sh

The chmod creates a separate image layer. Since COPY . . already runs as root, and entrypoint.sh needs to be executable, the chmod should ideally be part of the initial RUN block (or moved before the chown so it's combined). Minor — won't block merge.

ADR applicability gap (suggestion, not blocker)

ADR-019 states: "This baseline applies to the OCR service (PR #611). It should be applied to any new container added to the project."

Worth opening a follow-up issue to track applying this to the backend and frontend containers in a future milestone. Right now only OCR is hardened. The ADR is the right place to record this intent, but a tracking issue ensures it doesn't get forgotten.

## 🏗️ Markus Keller (@mkeller) — Application Architect **Verdict: ✅ Approved** ### What I checked **ADR-019** — well-structured, covers context, decision, alternatives, consequences, and applicability. The threat model in the "Context" section is concrete and specific (path-traversal on ZIP training endpoints, dependency persistence to image layers). This is exactly what an ADR should capture. ✅ **DEPLOYMENT.md** — Upgrade notes section added with volume drop commands for all three deployment contexts (dev, prod, staging). This is the correct operational runbook entry. ✅ **Documentation checklist** — No new container services, no new routes, no DB migrations, no new domains. The hardened OCR service is the same topology. No changes required to `l2-containers.puml` or other architecture diagrams. **TORCH_HOME → /app/models/torch** — Correct decision. Models volume is persistent (named volume, survives container recreation), cache volume is also persistent. Torch model files won't be re-downloaded on recreates. ✅ ### Suggestion (non-blocker) `ocr-service/Dockerfile` has two separate `RUN` layers: ```dockerfile RUN useradd ... && mkdir -p ... && chown -R ocr:ocr /app /home/ocr RUN chmod +x /app/entrypoint.sh ``` The `chmod` creates a separate image layer. Since `COPY . .` already runs as root, and `entrypoint.sh` needs to be executable, the `chmod` should ideally be part of the initial RUN block (or moved before the `chown` so it's combined). Minor — won't block merge. ### ADR applicability gap (suggestion, not blocker) ADR-019 states: "This baseline applies to the OCR service (PR #611). It should be applied to any new container added to the project." Worth opening a follow-up issue to track applying this to the `backend` and `frontend` containers in a future milestone. Right now only OCR is hardened. The ADR is the right place to record this intent, but a tracking issue ensures it doesn't get forgotten.

marcel commented

2026-05-17 19:01:50 +02:00

👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer

Verdict: ✅ Approved

TDD check

Tests exist and cover the changed behavior:

test_htrmopo_dir_reads_from_env_var — ✅ verifies env var takes precedence
test_htrmopo_dir_default_is_fixed_path — ✅ verifies default without HTRMOPO_DIR set
test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string — ✅ regression test for the or-fallback (the bead6f18 fix)
test_startup_logs_warning_when_running_as_root — ✅ canary behavior verified
test_startup_does_not_warn_when_running_as_non_root — ✅ happy path verified

Using importlib.reload to test module-level constants is the correct approach here — there's no cleaner way to reset a module global between tests without restructuring the production code. The cleanup reload after the with block is the right pattern. ✅

@pytest.mark.asyncio on both async tests. ✅ AsyncClient + ASGITransport for in-process API testing. ✅ Both patches isolate the engine (kraken_engine.load_models, load_spell_checker) cleanly. ✅

Code quality

os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo" — idiomatic Python, handles empty-string correctly. Clear improvement over os.path.expanduser("~/.local/share/htrmopo"). ✅

Root canary: if os.getuid() == 0: — one line, does exactly one thing, triggers exactly once at startup. ✅

Minor suggestion (non-blocker)

RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1000 ocr \
    && mkdir -p /home/ocr /app/models /app/cache \
    && chown -R ocr:ocr /app /home/ocr
RUN chmod +x /app/entrypoint.sh   # ← separate layer

The chmod could be folded into the COPY . . step or added to the first RUN block to save a layer. Functionally correct as-is — not a blocker.

## 👨‍💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer **Verdict: ✅ Approved** ### TDD check Tests exist and cover the changed behavior: - `test_htrmopo_dir_reads_from_env_var` — ✅ verifies env var takes precedence - `test_htrmopo_dir_default_is_fixed_path` — ✅ verifies default without `HTRMOPO_DIR` set - `test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string` — ✅ regression test for the `or`-fallback (the bead6f18 fix) - `test_startup_logs_warning_when_running_as_root` — ✅ canary behavior verified - `test_startup_does_not_warn_when_running_as_non_root` — ✅ happy path verified Using `importlib.reload` to test module-level constants is the correct approach here — there's no cleaner way to reset a module global between tests without restructuring the production code. The cleanup reload after the `with` block is the right pattern. ✅ `@pytest.mark.asyncio` on both async tests. ✅ `AsyncClient` + `ASGITransport` for in-process API testing. ✅ Both patches isolate the engine (`kraken_engine.load_models`, `load_spell_checker`) cleanly. ✅ ### Code quality `os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"` — idiomatic Python, handles empty-string correctly. Clear improvement over `os.path.expanduser("~/.local/share/htrmopo")`. ✅ Root canary: `if os.getuid() == 0:` — one line, does exactly one thing, triggers exactly once at startup. ✅ ### Minor suggestion (non-blocker) ```dockerfile RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1000 ocr \ && mkdir -p /home/ocr /app/models /app/cache \ && chown -R ocr:ocr /app /home/ocr RUN chmod +x /app/entrypoint.sh # ← separate layer ``` The `chmod` could be folded into the `COPY . .` step or added to the first `RUN` block to save a layer. Functionally correct as-is — not a blocker.

marcel commented

2026-05-17 19:02:02 +02:00

🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer

Verdict: ✅ Approved

Solid infrastructure change. Let me run through my checklist:

What's correct

Named volumes — both compose files continue to use named volumes (ocr_cache, ocr-cache), not bind mounts. ✅

Volume mount path corrected — /root/.cache → /app/cache matches the non-root user's environment. The named volume name itself stays the same; only the container-side mount path changes. ✅

read_only: true + carve-outs — correct pattern:

read_only: true
tmpfs:
  - /tmp:size=512m
volumes:
  - ocr_models:/app/models
  - ocr_cache:/app/cache

Named volume mounts remain writable even with read_only: true (Docker mounts them r/w by default). Only the container rootfs is locked. This is exactly right. ✅

cap_drop: [ALL] + no-new-privileges:true — defense in depth. A Python/FastAPI service on port 8000+ needs zero Linux capabilities. ✅

Both dev and prod compose updated — not just one environment. ✅

DEPLOYMENT.md upgrade notes — the three-command block for dev/prod/staging environments is exactly what an ops runbook needs. Volume names match the actual project/service names. ✅

ENV vars in compose duplicate Dockerfile ENV — HF_HOME, XDG_CACHE_HOME, TORCH_HOME are set in both Dockerfile and compose. This is the right approach: Dockerfile sets build-time defaults, compose can override or reinforce at runtime. ✅

Suggestion (non-blocker)

The upgrade note could optionally include a migration path for teams who want to avoid the 1–2 GB re-download:

# Optional: preserve existing cache data instead of re-downloading
# Run as root before dropping the volume
docker run --rm \
  -v familienarchiv_ocr_cache:/data \
  alpine sh -c "chown -R 1000:1000 /data"

This chown approach would allow the existing cache to be reused under the new ocr user (uid=1000), avoiding the re-download. Not a blocker — re-downloading is acceptable — but worth knowing the option exists. The current docs are correct as written; this is purely additive.

## 🔧 Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer **Verdict: ✅ Approved** Solid infrastructure change. Let me run through my checklist: ### What's correct **Named volumes** — both compose files continue to use named volumes (`ocr_cache`, `ocr-cache`), not bind mounts. ✅ **Volume mount path corrected** — `/root/.cache` → `/app/cache` matches the non-root user's environment. The named volume name itself stays the same; only the container-side mount path changes. ✅ **`read_only: true` + carve-outs** — correct pattern: ```yaml read_only: true tmpfs: - /tmp:size=512m volumes: - ocr_models:/app/models - ocr_cache:/app/cache ``` Named volume mounts remain writable even with `read_only: true` (Docker mounts them r/w by default). Only the container rootfs is locked. This is exactly right. ✅ **`cap_drop: [ALL]` + `no-new-privileges:true`** — defense in depth. A Python/FastAPI service on port 8000+ needs zero Linux capabilities. ✅ **Both dev and prod compose updated** — not just one environment. ✅ **DEPLOYMENT.md upgrade notes** — the three-command block for dev/prod/staging environments is exactly what an ops runbook needs. Volume names match the actual project/service names. ✅ **ENV vars in compose duplicate Dockerfile ENV** — `HF_HOME`, `XDG_CACHE_HOME`, `TORCH_HOME` are set in both Dockerfile and compose. This is the right approach: Dockerfile sets build-time defaults, compose can override or reinforce at runtime. ✅ ### Suggestion (non-blocker) The upgrade note could optionally include a migration path for teams who want to avoid the 1–2 GB re-download: ```bash # Optional: preserve existing cache data instead of re-downloading # Run as root before dropping the volume docker run --rm \ -v familienarchiv_ocr_cache:/data \ alpine sh -c "chown -R 1000:1000 /data" ``` This `chown` approach would allow the existing cache to be reused under the new `ocr` user (uid=1000), avoiding the re-download. Not a blocker — re-downloading is acceptable — but worth knowing the option exists. The current docs are correct as written; this is purely additive.

marcel commented

2026-05-17 19:02:16 +02:00

🔐 Nora "NullX" Steiner — Application Security Engineer (OSWE, BSCP)

Verdict: ✅ Approved

Good security work. Let me give it the adversarial read-through.

What this PR gets right

CIS Docker §4.1 — non-root ✅

RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1000 ocr
USER ocr

Fixed UID (1000), no login shell, no home directory created by the OS (home is manually provisioned and owned). This is correct hardening, not just the minimum.

CIS Docker §4.6 — read-only filesystem ✅
Write surface is explicitly declared and bounded: /app/models (volume), /app/cache (volume), /tmp (tmpfs, size-capped at 512 MB). Any unexpected write attempt becomes a visible PermissionError. The read_only: true flag is a security control, not just a hygiene measure.

Privilege escalation prevention ✅
cap_drop: ALL + no-new-privileges:true closes the capability regain vector. If a SUID binary exists in a dependency, it cannot escalate. Both applied consistently across dev and prod compose files.

HTRMOPO_DIR fix ✅
os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo" eliminates the ~ expansion failure under --no-create-home. The old code could resolve ~ to / (or fail silently) when HOME is not what the library expects. The new code has a predictable, auditable fallback.

Root canary ✅
The startup os.getuid() == 0 check means a misconfigured USER directive (e.g., accidentally dropped in a future Dockerfile edit) is caught at startup via log monitoring, not discovered during a post-incident review.

Security observation (not a blocker, requires separate PR)

ADR-019 explicitly names this risk:

"A path-traversal in the ZIP-based training endpoint could overwrite arbitrary paths on the container filesystem"

This PR reduces the blast radius of a ZIP Slip attack significantly — with read_only: true, a successful exploit can only write to /tmp (ephemeral, 512 MB cap), /app/models, or /app/cache. It cannot overwrite Python source files or escalate further into the container.

However, the root cause (ZIP Slip in training ZIP extraction) is not patched in this PR. I'd recommend a follow-up issue to audit _validate_zip_entry() (or introduce one if it doesn't exist) for all training endpoints. The tmpfs size cap on /tmp provides an effective denial-of-disk boundary even if the path traversal check is incomplete.

This is not a blocker for this PR — the defense-in-depth here is correct. The ADR acknowledges it. Just flag it for the next security cycle.

## 🔐 Nora "NullX" Steiner — Application Security Engineer (OSWE, BSCP) **Verdict: ✅ Approved** Good security work. Let me give it the adversarial read-through. ### What this PR gets right **CIS Docker §4.1 — non-root** ✅ ```dockerfile RUN useradd --no-create-home --shell /usr/sbin/nologin --uid 1000 ocr USER ocr ``` Fixed UID (1000), no login shell, no home directory created by the OS (home is manually provisioned and owned). This is correct hardening, not just the minimum. **CIS Docker §4.6 — read-only filesystem** ✅ Write surface is explicitly declared and bounded: `/app/models` (volume), `/app/cache` (volume), `/tmp` (tmpfs, size-capped at 512 MB). Any unexpected write attempt becomes a visible `PermissionError`. The `read_only: true` flag is a security control, not just a hygiene measure. **Privilege escalation prevention** ✅ `cap_drop: ALL` + `no-new-privileges:true` closes the capability regain vector. If a SUID binary exists in a dependency, it cannot escalate. Both applied consistently across dev and prod compose files. **HTRMOPO_DIR fix** ✅ `os.environ.get("HTRMOPO_DIR") or "/app/models/.htrmopo"` eliminates the `~` expansion failure under `--no-create-home`. The old code could resolve `~` to `/` (or fail silently) when HOME is not what the library expects. The new code has a predictable, auditable fallback. **Root canary** ✅ The startup `os.getuid() == 0` check means a misconfigured `USER` directive (e.g., accidentally dropped in a future Dockerfile edit) is caught at startup via log monitoring, not discovered during a post-incident review. ### Security observation (not a blocker, requires separate PR) ADR-019 explicitly names this risk: > "A path-traversal in the ZIP-based training endpoint could overwrite arbitrary paths on the container filesystem" This PR reduces the **blast radius** of a ZIP Slip attack significantly — with `read_only: true`, a successful exploit can only write to `/tmp` (ephemeral, 512 MB cap), `/app/models`, or `/app/cache`. It cannot overwrite Python source files or escalate further into the container. However, the **root cause** (ZIP Slip in training ZIP extraction) is not patched in this PR. I'd recommend a follow-up issue to audit `_validate_zip_entry()` (or introduce one if it doesn't exist) for all training endpoints. The tmpfs size cap on `/tmp` provides an effective denial-of-disk boundary even if the path traversal check is incomplete. This is not a blocker for this PR — the defense-in-depth here is correct. The ADR acknowledges it. Just flag it for the next security cycle.

marcel commented

2026-05-17 19:02:28 +02:00

🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist

Verdict: ✅ Approved

Test coverage assessment

New tests added: 5 across two files. All cover changed behavior directly.

test_ensure_blla_model.py (3 new tests):

Test	Behavior	Assessment
`test_htrmopo_dir_reads_from_env_var`	Env var takes precedence	✅ correct
`test_htrmopo_dir_default_is_fixed_path`	Default when unset	✅ uses `clear=True` — correct approach
`test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string`	Empty string → fallback	✅ regression for the `or`-fallback

test_main.py (2 new tests):

Test	Behavior	Assessment
`test_startup_logs_warning_when_running_as_root`	canary fires at uid=0	✅
`test_startup_does_not_warn_when_running_as_non_root`	no false positive at uid=1000	✅

Test names are sentences — tells you what broke without reading the body. ✅

@pytest.mark.asyncio present on both async tests. ✅ ASGITransport for in-process testing (no real server). ✅

Potential fragility (minor, non-blocker)

The importlib.reload pattern is correct but has an ordering sensitivity: if a test fails between the with block and the cleanup reload, subsequent tests may see dirty module state:

with patch.dict(os.environ, {"HTRMOPO_DIR": "/custom/htrmopo"}):
    importlib.reload(ensure_blla_model)
    result = ensure_blla_model.HTRMOPO_DIR
importlib.reload(ensure_blla_model)  # ← runs outside the with block; safe
assert result == "/custom/htrmopo"   # ← but if reload above raises, state is dirty

In practice, importlib.reload on this module is unlikely to raise. And pytest runs these sequentially, not in parallel. Acceptable risk — just worth knowing. A pytest.fixture(autouse=True) that wraps cleanup in a finally block would be the robust version for a future refactor.

Missing (acceptable for this PR scope)

No automated CI test verifying the container actually runs as uid=1000 at runtime. The manual test plan covers this. Acceptable.
No test verifying read_only: true enforcement at runtime. Infrastructure-level testing. Acceptable.

The unit tests cover exactly the changed Python code. ✅

## 🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist **Verdict: ✅ Approved** ### Test coverage assessment **New tests added: 5** across two files. All cover changed behavior directly. `test_ensure_blla_model.py` (3 new tests): | Test | Behavior | Assessment | |------|----------|------------| | `test_htrmopo_dir_reads_from_env_var` | Env var takes precedence | ✅ correct | | `test_htrmopo_dir_default_is_fixed_path` | Default when unset | ✅ uses `clear=True` — correct approach | | `test_htrmopo_dir_falls_back_to_default_when_set_to_empty_string` | Empty string → fallback | ✅ regression for the `or`-fallback | `test_main.py` (2 new tests): | Test | Behavior | Assessment | |------|----------|------------| | `test_startup_logs_warning_when_running_as_root` | canary fires at uid=0 | ✅ | | `test_startup_does_not_warn_when_running_as_non_root` | no false positive at uid=1000 | ✅ | Test names are sentences — tells you what broke without reading the body. ✅ `@pytest.mark.asyncio` present on both async tests. ✅ `ASGITransport` for in-process testing (no real server). ✅ ### Potential fragility (minor, non-blocker) The `importlib.reload` pattern is correct but has an ordering sensitivity: if a test fails *between* the `with` block and the cleanup reload, subsequent tests may see dirty module state: ```python with patch.dict(os.environ, {"HTRMOPO_DIR": "/custom/htrmopo"}): importlib.reload(ensure_blla_model) result = ensure_blla_model.HTRMOPO_DIR importlib.reload(ensure_blla_model) # ← runs outside the with block; safe assert result == "/custom/htrmopo" # ← but if reload above raises, state is dirty ``` In practice, `importlib.reload` on this module is unlikely to raise. And pytest runs these sequentially, not in parallel. Acceptable risk — just worth knowing. A `pytest.fixture(autouse=True)` that wraps cleanup in a `finally` block would be the robust version for a future refactor. ### Missing (acceptable for this PR scope) - No automated CI test verifying the container actually runs as uid=1000 at runtime. The manual test plan covers this. Acceptable. - No test verifying `read_only: true` enforcement at runtime. Infrastructure-level testing. Acceptable. The unit tests cover exactly the changed Python code. ✅

marcel commented

2026-05-17 19:02:37 +02:00

📋 Elicit — Requirements Engineer

Verdict: ⚠️ Approved with concerns

Reviewing from a requirements and operational completeness standpoint.

Requirement coverage

Issue #459 (run OCR container as non-root, CIS Docker §4.1) is fully addressed. All stated acceptance criteria in the PR description are testable and specific. ✅

ADR-019 correctly documents the why — the threat model section is unusually concrete for an ADR (names specific attack scenarios). This is the right way to write an architectural decision that involves security trade-offs. ✅

Concern: upgrade path requires full model re-download (~1–2 GB one-time cost)

What happens: The ocr_cache volume was written as root (uid=0). The new ocr user (uid=1000) cannot read these files. The upgrade note correctly instructs operators to drop and recreate the volume.

The gap: The documentation frames this as "the volume is recreated automatically on docker compose up" — which is true, but the model re-download takes time and bandwidth. There is an alternative that avoids the re-download entirely:

# Before stopping the old container: fix ownership in-place
docker run --rm \
  -v familienarchiv_ocr_cache:/data \
  alpine sh -c "chown -R 1000:1000 /data"
# Then restart — no volume drop needed, no re-download

This would make the upgrade path lower-friction for deployments with limited bandwidth or download-rate limits on model hosting.

Recommendation: Add a note to DEPLOYMENT.md §8 offering this as an alternative to the docker volume rm path. Operators on slow connections (VPS in a bandwidth-limited region, for example) would benefit from having both options documented. This is not a blocker — the current instructions are correct and complete. It's a usability improvement to the runbook.

What's well-specified ✅

Three deployment contexts documented (dev, prod, staging) with exact volume names
"Existing installs" callout in the PR description gives immediate context before the operator reads DEPLOYMENT.md
Estimated download cost (~1–2 GB) sets expectations correctly

## 📋 Elicit — Requirements Engineer **Verdict: ⚠️ Approved with concerns** Reviewing from a requirements and operational completeness standpoint. ### Requirement coverage Issue #459 (run OCR container as non-root, CIS Docker §4.1) is fully addressed. All stated acceptance criteria in the PR description are testable and specific. ✅ ADR-019 correctly documents the *why* — the threat model section is unusually concrete for an ADR (names specific attack scenarios). This is the right way to write an architectural decision that involves security trade-offs. ✅ ### Concern: upgrade path requires full model re-download (~1–2 GB one-time cost) **What happens:** The `ocr_cache` volume was written as root (uid=0). The new `ocr` user (uid=1000) cannot read these files. The upgrade note correctly instructs operators to drop and recreate the volume. **The gap:** The documentation frames this as "the volume is recreated automatically on `docker compose up`" — which is true, but the model re-download takes time and bandwidth. There is an alternative that avoids the re-download entirely: ```bash # Before stopping the old container: fix ownership in-place docker run --rm \ -v familienarchiv_ocr_cache:/data \ alpine sh -c "chown -R 1000:1000 /data" # Then restart — no volume drop needed, no re-download ``` This would make the upgrade path lower-friction for deployments with limited bandwidth or download-rate limits on model hosting. **Recommendation:** Add a note to DEPLOYMENT.md §8 offering this as an alternative to the `docker volume rm` path. Operators on slow connections (VPS in a bandwidth-limited region, for example) would benefit from having both options documented. This is not a blocker — the current instructions are correct and complete. It's a usability improvement to the runbook. ### What's well-specified ✅ - Three deployment contexts documented (dev, prod, staging) with exact volume names - "Existing installs" callout in the PR description gives immediate context before the operator reads DEPLOYMENT.md - Estimated download cost (~1–2 GB) sets expectations correctly

marcel commented

2026-05-17 19:02:41 +02:00

🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead

Verdict: ✅ Approved

No frontend, UI, or Svelte changes in this PR. I checked the full diff — all changes are confined to:

ocr-service/Dockerfile
docker-compose.yml / docker-compose.prod.yml
ocr-service/main.py, ocr-service/ensure_blla_model.py
ocr-service/test_*.py
docs/DEPLOYMENT.md, docs/adr/019-*.md

Nothing touches the SvelteKit frontend, Tailwind styles, Svelte components, or any user-facing feature. No accessibility, brand, or responsive design concerns apply to this PR.

LGTM from a UI perspective. 🟢

## 🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead **Verdict: ✅ Approved** No frontend, UI, or Svelte changes in this PR. I checked the full diff — all changes are confined to: - `ocr-service/Dockerfile` - `docker-compose.yml` / `docker-compose.prod.yml` - `ocr-service/main.py`, `ocr-service/ensure_blla_model.py` - `ocr-service/test_*.py` - `docs/DEPLOYMENT.md`, `docs/adr/019-*.md` Nothing touches the SvelteKit frontend, Tailwind styles, Svelte components, or any user-facing feature. No accessibility, brand, or responsive design concerns apply to this PR. LGTM from a UI perspective. 🟢

marcel merged commit f1e0b92f47 into main

2026-05-17 19:06:47 +02:00

marcel deleted branch feat/issue-459-ocr-non-root

2026-05-17 19:06:47 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#611

security(ocr): run OCR container as non-root user (CIS Docker §4.1) #611

Summary

Test plan

🏗️ Markus Keller — Senior Application Architect

Blocker

Suggestion

What's done well

👨‍💻 Felix Brandt — Senior Fullstack Developer

Blocker

Suggestion

Minor

What's done well

🛠️ Tobias Wendt — DevOps & Platform Engineer

Blocker

What's done well

Note

📋 Elicit — Requirements Engineer

Concern: Undocumented Upgrade NFR

Requirements Coverage (✅ complete)

What's done well

🔐 Nora "NullX" Steiner — Application Security Engineer

Blocker: /home/ocr is on the read-only container filesystem

What's done well

🧪 Sara Holt — Senior QA Engineer

Blocker: test_htrmopo_dir_default_is_fixed_path — second assertion is vacuous

Concern: No automated coverage for the root warning in main.py

Suggestion: Turn manual step 2 of the test plan into a CI gate

What's done well

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Review concerns addressed

✅ @Felix / @Sara — Blocker: vacuous assertion in test_htrmopo_dir_default_is_fixed_path

✅ @Felix / @Sara — Minor/Concern: no test for root canary in main.py

✅ @Nora — Blocker: /home/ocr is on the read-only container filesystem

✅ @Markus / @Tobias / @Elicit — Blocker: volume migration step not documented

✅ @Markus — Suggestion: ADR for container hardening baseline

Deferred

🏗️ Markus Keller — Senior Application Architect

Blocker

Suggestion

What's done well

🛠️ Tobias Wendt — DevOps & Platform Engineer

Blocker: docker-compose.prod.yml is standalone and not updated

Blocker: Upgrade notes reference the wrong volume name for production

Minor: cap_drop: [ALL] — flow notation

What's done well

🔐 Nora "NullX" Steiner — Application Security Engineer

Blocker: Security hardening absent from prod compose

Concern: XDG_CONFIG_HOME and XDG_DATA_HOME still default to ~/.config and ~/.local/share

What's done well

👨‍💻 Felix Brandt — Senior Fullstack Developer

What's verified

Note

🧪 Sara Holt — Senior QA Engineer

What's verified

Remaining deferred

One observation

📋 Elicit — Requirements Engineer

Blocker: Production upgrade path is broken

Requirements coverage check (updated)

What's done well

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Review concerns addressed — commit 7769dbc9

docker-compose.prod.yml — OCR service fully hardened

docs/DEPLOYMENT.md §8 — Upgrade notes cover all environments

🏛️ Markus Keller — Senior Application Architect

What I checked

One concern (non-blocking)

What was done well

👨‍💻 Felix Brandt — Senior Fullstack Developer

TDD evidence

Code quality

Minor observation (non-blocking)

Naming

🚀 Tobias Wendt — DevOps & Platform Engineer

What was done well

Blocker

Concerns

What's correct that I specifically verified

📋 Elicit — Requirements Engineer

Traceability assessment

Blocker: `/home/ocr` is on the read-only container filesystem

Blocker: `test_htrmopo_dir_default_is_fixed_path` — second assertion is vacuous

Concern: No automated coverage for the root warning in `main.py`

✅ @Felix / @Sara — Blocker: vacuous assertion in `test_htrmopo_dir_default_is_fixed_path`

✅ @Felix / @Sara — Minor/Concern: no test for root canary in `main.py`

✅ @Nora — Blocker: `/home/ocr` is on the read-only container filesystem

Blocker: `docker-compose.prod.yml` is standalone and not updated

Minor: `cap_drop: [ALL]` — flow notation

Concern: `XDG_CONFIG_HOME` and `XDG_DATA_HOME` still default to `~/.config` and `~/.local/share`

Review concerns addressed — commit `7769dbc9`

`docker-compose.prod.yml` — OCR service fully hardened

`docs/DEPLOYMENT.md §8` — Upgrade notes cover all environments

Concern 1 — Untested edge case: empty string `HTRMOPO_DIR` (minor)

Concern 2 — `importlib.reload` is fragile under parallel test execution (minor)

✅ @Sara — Concern 1: empty-string `HTRMOPO_DIR` falls back to empty string, not the default

✅ @Tobias — Concern 1: `cap_drop: [ALL]` inline notation inconsistent with `docker-compose.prod.yml`