docs(legibility): DOC-5 — write docs/DEPLOYMENT.md #443
Reference in New Issue
Block a user
Delete Branch "feat/issue-399-deployment"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Adds
docs/DEPLOYMENT.md— the Day-1 operational reference and bootstrap checklist.What's covered (7 sections)
docker-compose.ymlandapplication.yaml, organized by service (backend, PostgreSQL, MinIO, OCR); includes two undocumented gaps:APP_ADMIN_USERNAME/APP_ADMIN_PASSWORD(not in.env.example), andALLOWED_PDF_HOSTS/BLLA_MODEL_PATH(not in compose); Sensitive column for secrets management/actuator/health)docker compose logscommands; log locations; Prometheus port note; future monitoring (Phase 7)pg_dump) vs Planned (phase 5); avoids implying backups existreset-db.sh,rebuild-frontend.sh,download-kraken-models.sh, mass import; includes caveats for hardcoded valuesDesign decisions implemented
docs/infrastructure/production-compose.mdfor full Compose file and VPS provisioning — does not duplicateTest plan
docs/DEPLOYMENT.mdin Gitea — verify Mermaid diagram and all tables renderdocs/infrastructure/production-compose.md,docs/adr/001-ocr-python-microservice.mdgrep -E '\$\{[A-Z_]+\}' docker-compose.yml backend/src/main/resources/application*.yaml | grep -oE '\$\{[A-Z_]+[^}:]*\}' | sort -uCloses #399
🤖 Generated with Claude Code
🛠️ Tobias Wendt (@tobiwendt) — DevOps & Platform Engineer
Verdict: 🚫 Changes requested
Good operational doc overall — the structure is solid, the security checklist is actionable, and the "no automated backup yet" honesty is exactly right. But there are two factual errors in the commands that would bite an operator on day one.
Blockers
1. Wrong container name in the backup command (§5)
The backup snippet uses:
But
docker-compose.ymlsets an explicitcontainer_name: archive-db. Docker Compose only generates the<project>-<service>-<n>pattern when there is no explicitcontainer_name. The correct command is:This is the kind of error that costs 10 minutes of debugging during an incident when you least want it.
2.
KRAKEN_MODEL_PATHmissing from the OCR env var table (§2)The doc lists
BLLA_MODEL_PATH(the segmentation model) but omitsKRAKEN_MODEL_PATH, which is the main HTR model path set indocker-compose.ymlline 92 (/app/models/german_kurrent.mlmodel) and read bymain.pyline 32. This is the more operationally critical of the two model paths — it controls which Kurrent model is loaded for every OCR run.Add a row:
|
KRAKEN_MODEL_PATH| Path to the Kurrent HTR model loaded at startup |/app/models/german_kurrent.mlmodel| — | — |Concerns (non-blocking but worth fixing)
3. Topology diagram shows
:5173— ambiguous for a production-oriented docThe Mermaid diagram shows
HTTP :5173for the frontend. That's the dev port (SvelteKit Vite dev server). The production overlay changes this to:3000(Node adapter). Since this doc is framed as a production operational reference, the diagram should clarify or split: either show the prod topology (:3000) or label the:5173as dev-only.4. Management port 8081 stated but not configured in
application.yaml§4 says "management port 8081 (Spring Actuator / Prometheus scrape) is internal only." The production compose overlay does expose
8081, butapplication.yamlhas nomanagement.server.port: 8081entry. Either this is configured in a prod-profile YAML not checked into this repo, or the statement is aspirational. Worth a note clarifying where the port separation is actually configured.5.
SPRING_PROFILES_ACTIVEdefault in the table isdev,e2eThe table correctly states the default. But the security checklist (§3) says set
SPRING_PROFILES_ACTIVE=prod— this guidance is correct but it would be helpful to note that this override goes in the prod overlay env, not in.env(the.envfile has noSPRING_PROFILES_ACTIVEentry).6. MinIO service name in log commands
The §4 log command lists
# services: frontend, backend, db, minio, ocr-service. The actual service name indocker-compose.ymlisminio✓ andocr-service✓. Those are correct — no action needed, just confirming I checked.What's done well
docs/infrastructure/production-compose.mdrather than duplicating is exactly the right call.reset-db.shcaveat about hardcoded values is accurate (DB_USER=archive_user,DB_NAME=family_archive_dbare indeed hardcoded in the script).🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
The security posture documented here is generally sound. The SSRF warning on
ALLOWED_PDF_HOSTS, the STARTTLS/SMTP-auth guidance, and the "use a dedicated MinIO service account" instruction are all the right things to say. A couple of items need sharpening before this doc can be trusted as the production security reference.Concerns
1. Internal inconsistency:
OCR_TRAINING_TOKENvsTRAINING_TOKEN(§2 vs §3)§2 env table (OCR service section) lists the container-level variable as
TRAINING_TOKEN. The §3 security checklist says:These are two different names.
OCR_TRAINING_TOKENis the.envfile variable;TRAINING_TOKENis what the OCR container actually reads (confirmed:docker-compose.ymlmapsTRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"). The doc mixes the two levels without explaining the mapping. An operator reading §3 might setTRAINING_TOKENdirectly in.envand wonder why it doesn't work.Recommend: in §3 clarify that
OCR_TRAINING_TOKENin.envmaps toTRAINING_TOKENinside the OCR container. Alternatively, add a note in the table row explaining the indirection.2.
APP_ADMIN_PASSWORDdefault ofadmin123inapplication.yamlis a hardcoded fallback credential (CWE-259)The doc correctly flags this as "⚠ ships as
admin123" and tells operators to change it. Butapplication.yamlline 67 has:That colon-default syntax means if
APP_ADMIN_PASSWORDis absent from the environment, Spring silently usesadmin123in production. A missed env var = live admin account with known password. This is a real risk, not just a documentation concern.The doc should escalate the severity here: this is a startup-blocking misconfiguration if left unchanged. Consider recommending that
application.yamlbe changed to${APP_ADMIN_PASSWORD}(no default) so the application fails to start — not silently launches with a weak credential — when the var is missing. That's the "fail closed" principle (CWE-636 is the alternative).3. No mention of HTTPS-only / HSTS configuration
§3 bootstrap checklist has 8 items but none of them mention ensuring TLS is enforced via Caddy and HSTS. The production Caddyfile in
docs/infrastructure/production-compose.mddoes includeStrict-Transport-Security, but an operator following only this doc wouldn't know to verify it. Add a checklist item: "Confirm the Caddyfile includesStrict-Transport-Security: max-age=31536000; includeSubDomains; preload."4. MinIO root credential used as S3 credentials in dev compose
The doc notes in §1 (dev vs prod differences): "MinIO service account for
S3_ACCESS_KEY/S3_SECRET_KEY" — good. But looking atdocker-compose.ymllines 133–134, the dev compose wiresS3_ACCESS_KEY: ${MINIO_ROOT_USER}— i.e. root credentials. The doc doesn't explain that dev intentionally uses root (acceptable for local dev) but production must use a service account. Adding a one-line note clarifying "dev uses root credentials by design — production must create a dedicated MinIO service account" would close this gap for operators who follow the dev → prod path.What's done well
ALLOWED_PDF_HOSTSSSRF warning is correctly framed and the default value (minio,localhost,127.0.0.1) matches whatmain.pyactually uses.python3 -c "import secrets; print(secrets.token_hex(32))"for the training token is the right way to generate a secure random token.SPRING_PROFILES_ACTIVE=prodis critical — dev profile exposes Swagger UI and/v3/api-docs. Getting this wrong would expose the full API schema externally.Caddy blocks /actuator/* externally) is accurate and important.🏗️ Markus Keller (@mkeller) — Senior Application Architect
Verdict: ✅ Approved
This is a documentation PR — no code changes, no architectural decisions introduced. My review focuses on whether the documented topology accurately represents the actual architecture and whether the ADR references are sound.
What I checked
Topology accuracy (§1): The Mermaid diagram matches the actual service graph in
docker-compose.yml. The OCR service hasexpose: ["8000"](no external port — correct). SSE going directly browser → backend without passing through SvelteKit SSR is correct: the SvelteKit frontend acts as a pass-through for the SSE connection, not a mediator. The Caddy → backend at:8080and Caddy → frontend at:5173are accurately represented (noting Tobias' comment that:5173is dev-only is valid — production switches to:3000).ADR references (§7): The Known Limitations table references
ADR-001for the single-node OCR constraint.docs/adr/001-ocr-python-microservice.mdexists in the repository. The link pathadr/001-ocr-python-microservice.md(relative fromdocs/) resolves correctly.Monolith-first framing: The doc correctly identifies that horizontal scaling of OCR would require a job queue, and notes this as a deliberate scope decision rather than a bug. ADR-001 captures the architectural rationale. The Known Limitations table structure — with reason and reference — is the right way to document deliberate constraints.
Domain layering: The mass import documentation in §6 correctly identifies
POST /api/admin/trigger-importandGET /api/admin/import-status— verified againstAdminController.java. The note that the import runs asynchronously is accurate (MassImportServiceuses@Async).Minor observation
The doc references
docs/infrastructure/production-compose.mdthroughout. That document exists and contains the prod overlay. The cross-linking strategy (summarise and link, don't duplicate) is exactly right for operational docs.No architectural concerns. The blockers flagged by Tobias (wrong container name, missing
KRAKEN_MODEL_PATH) are operational accuracy issues, not architectural ones.👨💻 Felix Brandt (@felixbrandt) — Senior Fullstack Developer
Verdict: ✅ Approved
This is a docs-only PR and outside my primary review domain (code quality, TDD, clean code). I read it through the lens of: does this accurately represent how the backend and frontend work, and would a developer following these instructions end up in trouble?
What I verified
/actuator/healthsmoke test (§3): The bootstrap commandcurl http://localhost:8080/actuator/healthis correct. The backend health check indocker-compose.ymlalso uses this endpoint:wget -qO- http://localhost:8080/actuator/health | grep -q UP. Consistent.Upload size limits (§7): The Known Limitations table says "50 MB per file (500 MB per request for multi-file)" — verified against
application.yaml:Accurate.
Mass import endpoint (§6):
POST /api/admin/trigger-importandGET /api/admin/import-statusmatchAdminController.java. ✓import/bind mount (§6): Step 1 says "Place the import file in theimport/bind mount on the backend container."docker-compose.ymlmounts- ./import:/importon the backend. ✓Spring profile note: The doc states the dev compose sets
SPRING_PROFILES_ACTIVE: dev,e2e. That's exactly whatdocker-compose.ymlline 137 shows. Thedevprofile enables OpenAPI/Swagger (verified inapplication-dev.yaml). Accurate.One minor naming nit
§4 "Log locations" says: "Backend application log: stdout (captured by Docker). Access inside the container at
/app/logs/viadocker exec." I can't confirm from the codebase that logging is also written to/app/logs/— the backend logs to stdout by default with Spring Boot's default Logback config, and there's nologback-spring.xmlwith a file appender visible in the repo. This might be aspirational rather than current. Low severity — worth confirming.Otherwise, nothing that a developer would stumble over. LGTM from a code-accuracy perspective.
🧪 Sara Holt (@saraholt) — QA Engineer & Test Strategist
Verdict: ✅ Approved
This is a documentation PR with no testable code — no test strategy is directly applicable. But I reviewed it for testability of the documented operational procedures: can an operator verify they followed the instructions correctly?
What I checked
Smoke test in bootstrap sequence (§3): The sequence ends with:
This is a concrete, machine-verifiable smoke test. Good. One minor gap: there is no verification that the frontend is serving correctly — the health check only confirms the backend is up. Adding a
curl -s http://localhost:5173 | grep -q "Familienarchiv"(or similar) as step 7 would give an end-to-end confirmation.Log commands (§4): All log service names (
frontend,backend,db,minio,ocr-service) are verified againstdocker-compose.yml. Correct.Backup restore procedure (§5): The restore snippet is:
This has the wrong container name (Tobias flagged this — should be
archive-db). More importantly: there is no verification step after restore. A complete operational procedure should include:Without this, an operator has no way to confirm the restore succeeded beyond "the command didn't error."
Security checklist (§3): All 8 checklist items are concrete and binary (can check/uncheck). This is good test design for a checklist. However, "Confirm
ALLOWED_PDF_HOSTSis locked to your MinIO/S3 hostname" has no verification command. Addingdocker exec archive-ocr printenv ALLOWED_PDF_HOSTSas an inline suggestion would close this.Summary
The doc is honest about what is and isn't automated. The main testability gap is the missing restore verification step in §5 — that's the highest-risk gap because a failed restore with no verification is effectively no backup at all.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing this as a requirements artifact: does the document fulfill the stated purpose (Day-1 operational reference and bootstrap checklist), is the audience clearly defined, and are all claims testable and complete?
Audience and scope
The stated audience ("operator bringing up a fresh instance, or Successor-X debugging a live incident") is well-defined and consistent throughout. The doc is correctly scoped: it summarises and links rather than duplicating
docs/infrastructure/production-compose.md. That scope decision is documented explicitly, which is good requirements hygiene.Completeness gaps
§2 env var table — stated rule not self-enforcing:
The table header says: "Any var found in
docker-compose.ymlorapplication*.yamlthat is not in this table is a blocking review comment on any PR that changes those files." This is a requirements-style enforcement claim. The verification command in the test plan (grep -E ...) is a good start, but the rule should also be in the PR template or a CI lint step — otherwise it's a statement of intent, not an enforceable requirement. Recommend adding a note that the rule is enforced by convention, not automation, until a lint step is added.SPRING_PROFILES_ACTIVEownership ambiguity:The table lists
SPRING_PROFILES_ACTIVEas a backend env var with defaultdev,e2e (compose). But.env.exampledoesn't include this variable — it's hardcoded indocker-compose.yml. An operator reading the table might look for this in.envand not find it. The "how to override for production" path is only implicit (set it in the prod overlay). A one-line clarification ("this variable is set directly in the Compose file, not in.env— override it in the prod overlay") would eliminate ambiguity.Requirement testability
The §3 bootstrap checklist items are correctly formulated as verifiable conditions (binary checkbox items). The Known Limitations table in §7 has the right columns (Limitation, Reason, Reference). The linking to ADRs for each limitation is excellent — this closes the "why is this a limitation and not a bug?" question for future readers.
Open question the doc doesn't surface
The doc says to call
POST /api/admin/trigger-importfor mass import. It doesn't specify what authentication the operator must use (requiresADMINpermission — noted in the CLAUDE.md but not here). An operator who doesn't have the admin user set up would get a 403 with no explanation. A one-sentence note ("authenticate as the admin user configured viaAPP_ADMIN_USERNAME/APP_ADMIN_PASSWORD") would close this.Overall
The document meets its stated purpose with good structure and honest scope. The gaps are all clarifications rather than missing sections. Approving with the above suggestions logged for the next iteration.
🎨 Leonie Voss (@leonievoss) — UI/UX Design Lead
Verdict: ✅ Approved
This PR adds an operational reference document — not UI, not a user-facing component, not a design spec. There is nothing here for me to review in terms of brand compliance, touch targets, contrast ratios, or component structure.
I did a quick scan for anything that could affect the frontend developer experience (since
rebuild-frontend.shis documented here), and found nothing that contradicts the established patterns.One observation for completeness: §3 bootstrap step 6 says "Open the app and log in with the admin credentials from .env." The frontend at dev runs on port 5173 (
http://localhost:5173). It would be helpful for a non-technical operator to have the explicit URL. Minor cosmetic suggestion, not a blocker.LGTM — no UI/UX concerns in this PR.
Follow-up commit
1b61af93addresses the two blockers:familienarchiv-db-1→archive-dbin §5 backup/restore commands (verified againstdocker-compose.ymlline 5:container_name: archive-db)KRAKEN_MODEL_PATHadded to the OCR service env vars table — this is the main HTR model directory, set indocker-compose.yml:92as/app/models/german_kurrent.mlmodeland read bymain.py:32🔧 Tobias Wendt (DevOps) — Platform Engineer
Verdict: ⚠️ Approved with concerns
This is a solid operational reference. The command-first format, the dev-vs-prod diff table, and the explicit "no automated backup" honesty are exactly the kind of pragmatic documentation a successor needs at 3am. A few things need attention before I'd call this production-ready.
Blockers
1. Bootstrap sequence references a
docker-compose.prod.ymlthat may not exist yetdocs/DEPLOYMENT.md§3 tells operators to run:But the PR description notes that full provisioning lives in
docs/infrastructure/production-compose.md(a docs file, not a compose file). Ifdocker-compose.prod.ymldoesn't exist in the repo yet, this command fails on day one — and there's no fallback or callout that it needs to be created first. Either the file exists (and should be referenced or linked directly), or the command needs a "create this first" note.Action: Verify
docker-compose.prod.ymlexists. If it's the file documented indocs/infrastructure/production-compose.md, say so explicitly: "Seedocs/infrastructure/production-compose.mdfor the prod overlay file you need to create/copy."Suggestions
2.
minio/minio:latestflagged but not resolvedThe dev-vs-prod table correctly calls out the unpinned MinIO tag in dev. The doc says prod uses "Pinned in prod overlay" — but it doesn't say what the pinned tag is. For a self-contained reference this matters; when the operator creates the prod overlay from scratch, they need a starting point.
Consider: add the recommended stable tag (e.g.,
minio/minio:RELEASE.2024-xx-xx) in a note, or link to the MinIO release page.3. Backup section is honest but missing the one-liner for MinIO in production
The backup snippet handles dev bind mounts (
./data/minio/). In production the doc says named volumes are used —docker cpormc mirroris different from a bind-mount copy. Add a production MinIO backup note, even a placeholder, so the successor isn't surprised.4. OCR model volume path in backup
The OCR models are stored in
/app/models/(fromKRAKEN_MODEL_PATH). They're not in the backup section. Models can be re-downloaded viadownload-kraken-models.sh, but a note like "OCR models are reproducible viadownload-kraken-models.sh; no backup needed" removes ambiguity.5. Security checklist item uses inconsistent env var name
§3 security checklist says:
But §2 env table names it
APP_OCR_TRAINING_TOKEN(backend) andTRAINING_TOKEN(OCR service). The checklist item should reference both names, or at minimum match the env table's names exactly to avoid operator confusion.6.
rebuild-frontend.shvolume name assumptionThe caveat already notes this (line: "If your project directory is not named
familienarchiv, edit line 16") — good. But Tobias would add: this is exactly the kind of hardcoding that causes incidents. A future improvement would be to read the volume name fromdocker compose configinstead of hardcoding it. Not a blocker for this docs PR, but worth a follow-up issue.What's done well
mem_limitguidance is exactly right — this prevents a common "I deployed on CX22 and OCR crashed" incident./actuator/*externally is documented and explained. Future operators will know why.👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
This is a documentation-only PR — no production code, no tests, no API changes. My review focuses on whether the doc serves its audience accurately and doesn't create confusion for developers who will follow it.
Observations (not blockers)
1.
generate:apiworkflow not mentioned in common tasks§6 covers
reset-db.sh,rebuild-frontend.sh,download-kraken-models.sh, and mass import. Missing from this list:npm run generate:apiafter any backend model/endpoint change. This is the most common source of TypeScript type errors per CLAUDE.md — a "Common operational tasks" section is exactly where a developer running the app locally would look for it.Not a blocker for a deployment reference doc (which is ops-focused), but worth a note in a follow-up or in a separate dev-runbook issue.
2. Mass import API paths should be verified
§6 says:
I haven't verified these exact paths against the current codebase. If these are stale (e.g., the path is
/api/importing/...or similar), an operator following this doc will get a 404. The test plan checklist in the PR description doesn't include an API path verification step.Suggestion: Add to the PR test plan: "Verify
/api/admin/trigger-importand/api/admin/import-statusendpoints exist in the current backend."3. Spring profiles note is accurate
The doc correctly distinguishes
dev,e2e(exposes Swagger UI +/v3/api-docs) fromprod. This matches CLAUDE.md and theapplication-dev.yamlpattern. Good.4.
docker-compose.ci.ymlwarning is valuableThis is a real footgun — including it prevents the "why isn't my bind mount working?" debugging session. Appreciated.
Summary
Clean documentation, accurate to the stack I know. The mass import API path is the only thing I'd verify before merge. Everything else is correct and genuinely useful for a developer picking up the project.
🔒 Nora "NullX" Steiner — Application Security Engineer
Verdict: ⚠️ Approved with concerns
This document does a lot of things right from a security standpoint. It surfaces default credentials, documents SSRF protection, and calls out the service-account requirement for MinIO. A few items need tightening before this ships as the canonical operational reference.
Blockers
1.
APP_ADMIN_PASSWORDdefaultadmin123is documented but the fail-safe is not enforcedThe env table documents:
And the security checklist says:
This is good documentation. But the security posture of documentation-only prevention for a default credential is weak. The doc should also call out explicitly that the application does not prevent startup with the default password — so an operator who skips the checklist has an exposed admin account with no warning. A future hardening issue should enforce either: startup failure when
APP_ADMIN_PASSWORD == "admin123"in a non-dev profile, or a forced password change on first login.Action: Add a note in the checklist item: "Note: the application does not currently prevent startup with the default password in prod profile — this is tracked as a known limitation / future hardening item." This sets expectations honestly and prevents operators thinking the system will protect them if they miss the step.
2.
ALLOWED_PDF_HOSTSSSRF guidance is good but incomplete on what "widening" meansThe env table warns:
But the bootstrap checklist says:
The document should be more explicit about the threat model here, since the SSRF surface is subtle: if an operator sets
ALLOWED_PDF_HOSTS=*because they're "just testing with a public URL," they've opened the internal Docker network to SSRF from any user who can trigger an OCR job. A one-liner explaining the real consequence would help:Suggestions
3. MinIO service account guidance is correct but asymmetric
§2 backend env table says:
The security checklist §3 says:
Good. But there's no guidance on what permissions the service account should have. "Service account" without a least-privilege scope could still result in an account with bucket-admin rights. Add: "The service account needs:
s3:GetObject,s3:PutObject,s3:DeleteObjecton the archive bucket only — no bucket creation, no policy management."4.
APP_OCR_TRAINING_TOKENgeneration command in checklist — good, but mismatchThe checklist provides a secure token generation command:
But the checklist item says
OCR_TRAINING_TOKENwhile the env table names itAPP_OCR_TRAINING_TOKEN(backend) andTRAINING_TOKEN(OCR service). This naming inconsistency is a security smell — an operator might set the wrong variable and leave the training endpoints unprotected. This echoes Tobias's concern; from a security perspective it's more serious because the training endpoint accepts file uploads.5.
SPRING_PROFILES_ACTIVE=prodprevents OpenAPI/Swagger UI exposure — this is correctly documentedThe note "that exposes Swagger UI and
/v3/api-docs" gives the right reason. No issue here.6.
MAIL_STARTTLS_ENABLEandMAIL_SMTP_AUTH— cleartext email risk not called outThe env table defaults
MAIL_STARTTLS_ENABLE=falseandMAIL_SMTP_AUTH=falsein dev. Both should betruein production, and the table already saysYES (prod). But there's no warning that leaving these asfalsein production means credentials and password-reset links travel in cleartext. A note: "Leaving STARTTLS disabled in production sends password-reset links in cleartext over SMTP — always enable in prod."What's done well
ALLOWED_PDF_HOSTSis documented, explained, and in the security checklist. Many teams miss this./actuator/*externally is explicitly documented with the reason (heapdump contains passwords/tokens). This is exactly the right threat model explanation.🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
This is a documentation-only PR with no test code changes, no new API surface, and no behavior changes. My review focuses on testability of the claims made in the document and whether the test plan in the PR description is sufficient.
Observations
1. PR test plan is partially verifiable but missing key steps
The PR test plan includes:
docs/DEPLOYMENT.mdin Gitea — verify Mermaid and tables render ✓grep✓The
grepcommand for env var verification is a good automated check. But the plan omits:POST /api/admin/trigger-import,GET /api/admin/import-status) — Felix raised this too; a dead link in the common tasks section is a real QA failurereset-db.sh,rebuild-frontend.sh,download-kraken-models.shshould be confirmed to exist atscripts/in the repoThese aren't blocking the merge of a docs PR, but I'd recommend completing them as part of the test plan execution before marking this done.
2. The actuator/health smoke test command in §3 is correct and testable
This is exactly the kind of automated smoke test that should be in CI post-deploy. The fact that it's documented here is good — but it would be better if this command was also in a CI smoke test step (not in scope for this PR, but worth a follow-up issue).
3. Backup/restore procedure is documented but untested
§5 gives:
These commands look correct. However, per Tobias's philosophy: a backup without a tested restore is a hope, not a backup. The PR description acknowledges no automation exists yet (Phase 5 of Production v1 milestone). I'd flag this as a follow-up: when Phase 5 ships, the backup procedure should include a CI-scheduled monthly restore test, not just documentation.
Not a blocker here — the doc is honest about the manual-only state.
4.
reset-db.shcaveat is correctly scopedThis matches the E2E test reset pattern and is correctly distinguished from a full migration re-run. Good.
Summary
The document's testable claims (commands, API paths, env var completeness) are mostly verifiable. The test plan execution should include the two missing items before closing #399. No test code is affected by this PR.
🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This PR adds an operational reference document — no UI, no frontend, no component changes. It's outside my primary domain. I'll give it a brief pass from a documentation usability and readability perspective, since docs are a product too.
Observations
1. Document structure and scannability are good
The emergency-first anchor:
...is excellent UX for a runbook. Operators under stress need the shortest path to the most critical action. This is the right instinct.
The numbered ToC, section headers, and command-first format (commands before explanations in §6) are all consistent with how operators actually use runbooks: they scan, find the command, run it, read the context if it fails.
2. Mermaid diagram aids topology comprehension
The Mermaid graph gives a visual mental model before the operator reads the detail. For a non-specialist successor trying to understand the system, this is more valuable than a bullet list of services.
3. One readability concern: the security checklist items mix two audiences
The security checklist in §3 starts with operator actions ("Set
APP_ADMIN_PASSWORD...") then mixes in dev-specific clarifications ("not in.env.example"). An operator bootstrapping production doesn't need the dev parenthetical — it may confuse them about which step applies to them.This is cosmetic, not a blocker. Consider separating dev notes from production operator instructions using a consistent visual treatment (e.g., a
> 💡 Dev note:callout vs. the main checklist items).4. Known limitations table is well-structured
The table format with Reason and Reference columns is exactly the right way to document intentional design constraints. It prevents future operators from "fixing" things that are deliberate. The ADR back-references are the right mechanism.
Summary
From a documentation-as-product perspective: clear structure, correct audience framing, good emergency-first orientation. Cosmetic note on the security checklist mixing two audience concerns. No UI impact. Approved.
🏗️ Markus Keller — Application Architect
Verdict: ✅ Approved
This is the deployment reference the project has needed since the OCR service was extracted. It correctly describes the current topology, is honest about what doesn't exist yet (backup automation, monitoring infrastructure), and links to rather than duplicates the production Compose details. A few architecture-level observations.
Observations
1. Topology diagram accurately reflects the deliberate service extraction
The Mermaid graph shows the OCR service correctly: internal-only, no external port, accessed via presigned URL to MinIO. This matches ADR-001's justification (different memory profile, different deployment cadence) and doesn't suggest further extraction. Good.
2. "No multi-tenancy" known limitation is correctly framed as deliberate scope, not oversight
This is the right framing. Documenting this explicitly prevents a future contributor from adding per-user document isolation "as an improvement" when it would break the intentional design. The ADR back-reference pattern should be applied here too — when the family-only scope ADR is written (if it isn't already), link it from this row.
3. SSE note is architecturally accurate
This is the correct topology for SSE in this stack (SvelteKit SSR can't relay a persistent SSE connection efficiently). Documenting it prevents the "why doesn't the frontend proxy the SSE?" question.
4. Management port separation (8080 app, 8081 actuator/Prometheus) is documented
This is the correct architecture: separate ports, separate exposure rules. Documented here so future contributors don't collapse them.
5. The "no message broker" implicit decision is good
By not mentioning RabbitMQ, Kafka, or Redis anywhere in the topology, the doc implicitly confirms: async work goes through PostgreSQL + Spring's
@Async. For current volume (family archive, single-family use), this is exactly right. The absence of infrastructure noise is itself an architectural statement.One question (not a blocker)
The doc references
docs/adr/001-ocr-python-microservice.mdin the Known Limitations table. The PR description mentions this file but doesn't include it in the diff. Is ADR-001 an existing file, or is this a forward reference to a document that doesn't exist yet? If it doesn't exist, the link is a dead reference that will confuse operators.Same question for
docs/infrastructure/production-compose.md— the Tobias review raises this more operationally. From my perspective: if these linked documents don't exist yet, this DEPLOYMENT.md is incomplete as a navigation hub. The "Summarise and link" design decision (D1) only works if the targets of the links exist.Suggested note: Add a small callout if any linked document is aspirational/upcoming: "Note:
docs/infrastructure/production-compose.mdis created as part of the Production v1 milestone Phase N."Summary
Architecture is correctly described, scope decisions are explicitly documented, and the single-node OCR constraint has an ADR reference. The document's value as a topology reference is high. Pending verification that linked documents (
ADR-001,production-compose.md) actually exist.📋 Elicit — Requirements Engineer
Verdict: ⚠️ Approved with concerns
Reviewing
docs/DEPLOYMENT.mdthrough a requirements lens: does this document deliver on what issue #399 asked for, are the claims testable, and are there gaps that would cause an operator to fail at their goal?Against Issue #399 Goal
The PR closes #399. Without reading the original issue text directly, the PR description defines scope as: "Day-1 checklist and operational reference." The document covers 7 stated sections and matches its own scope definition well. The design decisions (D1: summarise-and-link, D2: OCR sizing, D3: README link relaxed) are all documented as implemented. The acceptance criteria in the test plan are concrete and executable.
Coverage assessment: complete against stated scope.
Gaps and Ambiguities
1. "Audience" definition creates an implicit NFR that isn't enforced
The document states:
"Successor-X" implies someone who may have no prior context — they've inherited the system. This is a high-bar audience. But the document links to
docs/infrastructure/production-compose.mdanddocs/adr/001-ocr-python-microservice.mdwithout confirming these exist or are accessible. For Successor-X, a broken link on step 1 of bootstrap is a critical blocker.Requirement gap: The implied NFR is "all linked documents SHALL be accessible at the time this document is published." This isn't verified in the test plan.
Action: Add to test plan: "Confirm all internal document links resolve to existing files:
infrastructure/production-compose.md,adr/001-ocr-python-microservice.md."2. "Ownership" claim is untestable as written
This is a maintenance requirement embedded in prose, which means it will be followed inconsistently. It has no enforcement mechanism. An EARS phrasing would make it testable:
As a requirements engineer, I'd recommend adding this as a PR checklist item in
.gitea/ISSUE_TEMPLATE/or the contributing guide — not just in the document itself.3. Phase references are used but the phase numbering isn't stable
The document references:
If phase numbering changes in the milestone (phases reordered, merged, or renamed), these references become stale. The document would be better served by linking to specific Gitea milestone issues or using feature names rather than phase numbers:
rather than "phase 5" which is an internal planning artifact, not a stable reference.
4. "Common operational tasks" §6 is missing a trigger for the OCR model update case
The document covers: reset DB, rebuild frontend, download models, mass import. But there's no guidance for: what to do when an OCR model is updated/replaced. The
download-kraken-models.shnote says "Run once after a fresh clone or when models are updated" — but what's the operator workflow for the "models are updated" case? Is there a model version file? Does the service need to be restarted? Is there a zero-downtime path?This may be out of scope for this PR, but it's a gap in the operational reference. I'd recommend a follow-up issue: "Document OCR model update procedure in DEPLOYMENT.md."
5. "Last reviewed" header creates a review obligation without a tracking mechanism
This is a good practice. However, without a Gitea issue template or milestone checklist item that says "review DEPLOYMENT.md," this date will drift. The date itself is useful; the mechanism to keep it current is absent.
What's done well
grepcommand for env var completeness is a practical, executable acceptance criterion — the kind of thing I'd write into a formal acceptance test..env.example" annotations in the env table are gap-flagging done correctly inline.Overall
The document achieves its stated purpose. The concerns are about maintenance durability (phase references, stale links, ownership enforcement) rather than correctness of the current content. I'd address the linked-documents verification gap before merge and file follow-up issues for the phase reference and OCR update procedure gaps.
🏗️ Markus Keller — Senior Application Architect
Verdict: ✅ Approved
Documentation PR — I reviewed for architecture accuracy and ADR traceability.
Topology correctness
The Mermaid diagram matches the actual system exactly:
/actuator/*externally ✅ADR back-references
The known limitations table links ADR-001 for the single-node OCR constraint. That's exactly the right pattern — a limitation without an ADR ref looks like an oversight; with one, it's a documented decision. The "deliberate scope decision" designations for no-multi-tenancy and no-multi-region are clean: they tell a future operator these were choices, not omissions.
Scope separation
Cross-referencing
docs/infrastructure/production-compose.mdrather than duplicating prod Compose configuration is correct. This document's job is to orient the operator; the infrastructure docs contain the full config. No duplication.Dev-vs-prod differences table
Calling out bind mounts (dev) vs named volumes (prod), unpinned MinIO tag (dev) vs pinned (prod), and
dev,e2eSpring profile (dev) vsprod— these are the exact architecture-level differences a new operator needs to understand before touching production. Good artifact.No architectural concerns. Ship it.
👨💻 Felix Brandt — Senior Fullstack Developer
Verdict: ✅ Approved
Pure documentation PR — no production code changed. Checking for developer usability.
Commands and copy-paste quality
All code blocks are formatted correctly and copy-pasteable. The bootstrap sequence steps are numbered and sequentially logical. The
curl http://localhost:8080/actuator/healthsmoke test is specific — it tells you exactly what a healthy response looks like ({"status":"UP"}), which is more useful than just "check the logs."Caveats are accurate
reset-db.shcaveat about hardcodedDB_USERandDB_NAME— correct, the script does hardcode these.rebuild-frontend.shcaveat about the volume name depending on the project directory name — correct, the script usesfamilienarchiv_frontend_node_modules.These are the kind of foot-guns that waste an hour of developer time if undocumented.
Mass import workflow
The import trigger steps (place file →
POST /api/admin/trigger-import→ pollGET /api/admin/import-status) match the actualAdminController.javaendpoints at/api/admin. Verified.No concerns from my side.
🔧 Tobias Wendt — DevOps & Platform Engineer
Verdict: ⚠️ Approved with concerns
This is the doc I'd want to have on Day 1 of running this stack. Most of it is solid. A few non-blocking notes.
What's right
docker-compose.prod.ymlnow correctly noted as not committed, with the guide referenceSPRING_PROFILES_ACTIVE=prodreminder prevents Swagger UI exposure on prodSuggestions (non-blocking)
1. Smoke test port clarification
Step 5 in the bootstrap sequence:
This works when run directly on the server shell (bypasses Caddy), but an operator who reads this from their laptop may be confused about why they're hitting
:8080directly instead of going through the proxy. A one-line comment helps:2.
pg_dumpcommand uses shell variables that aren't exported by defaultA freshly SSH'd shell won't have
POSTGRES_USERandPOSTGRES_DBexported from.env. In practice the operator will need tosource .envfirst, or use the literal values. Consider showing the literal defaults with a substitution hint:Neither of these is a blocker — the doc is accurate. Just friction for a first-time operator.
🔐 Nora "NullX" Steiner — Application Security Engineer
Verdict: ✅ Approved
The security-critical naming inconsistency from the first review round has been correctly fixed in this latest commit.
Checklist coverage
APP_OCR_TRAINING_TOKEN(backend) andTRAINING_TOKEN(OCR service) now both listed, with the critical note that both must be the same valueALLOWED_PDF_HOSTSwarning is explicit — "Do not widen to*"admin123,change-mefor Postgres and MinIOSPRING_PROFILES_ACTIVE=prodreminder prevents Swagger UI and/v3/api-docsexposure in prod/actuator/*— this is the correct defense patternNon-blocking observation
The
APP_ADMIN_USERNAMEenv var is marked "⚠ not in.env.example" in the env table. This means a first-time operator may not know to set it, leaving the admin account at the defaultadminusername. A predictable admin username combined with a weak password (before the checklist is run) is a weak point. This is a pre-existing gap outside the scope of this documentation PR — but worth tracking as a separate issue: addAPP_ADMIN_USERNAME=changemeas a placeholder line in.env.exampleso operators are prompted to change it at env setup time.No blockers on this PR. The security checklist is now accurate and complete.
🧪 Sara Holt — Senior QA Engineer
Verdict: ✅ Approved
Documentation PR — test coverage doesn't apply directly. Reviewing the document for testability of its own claims.
Test plan in PR description
The three test plan items are appropriate for a documentation PR:
The grep command (
grep -E '\$\{[A-Z_]+\}' docker-compose.yml backend/src/main/resources/application*.yaml) is a concrete, reproducible verification step. This is good documentation quality practice — the PR description tells you how to verify it, not just that it was verified.Smoke test quality
The bootstrap smoke test (
curl http://localhost:8080/actuator/health, expected:{"status":"UP"}) is specific and verifiable. This is better than "check the logs and see if it looks healthy."Backup procedure
The backup and restore commands in §5 are concrete enough to test — they would fail predictably if the container name or database name was wrong. The restore command correctly shows input redirection rather than a copy-paste-difficult alternative.
No QA concerns. The doc gives operators concrete, verifiable verification steps rather than hand-wavy instructions.
📋 Elicit — Requirements Engineer
Verdict: ✅ Approved
Reviewing for documentation requirements quality: audience clarity, scope definition, completeness, and traceability.
Audience and ownership
Completeness governance
The statement in §2 — "Any var found in docker-compose.yml or application*.yaml that is not in this table is a blocking review comment" — creates a concrete, auditable completeness contract. This is good requirements practice: the document defines its own verification criterion.
State vs planned distinction
The backup section correctly distinguishes current state ("no automated backup") from planned state ("Phase 5 of Production v1 milestone"). This prevents an operator from assuming automated backups exist when they don't. ✅
Traceability
The known limitations table links ADR-001 for single-node OCR — limitation → architectural decision record. ✅
Suggestion (non-blocking)
The Phase 5 and Phase 7 references ("planned in the Production v1 milestone phase 5", "Phase 7 adds Prometheus + Loki + Grafana") assume the reader knows the milestone structure. If those phases are completed in the future, a future operator has no way to know from this document whether those items are still "planned" or have shipped. Consider linking to the Gitea milestone or epic so the reader can check current status, or note that these references should be updated when the phases complete.
🎨 Leonie Voss — UX Designer & Accessibility Strategist
Verdict: ✅ Approved
This is a Markdown documentation file — no Svelte components, no UI, no accessibility implementation to review. Checking document readability as a UX concern.
Document UX
> ⚠️blockquote convention for warnings is consistent with the project's documentation style ✅The document is well-structured and scannable. The jump link at the top and the numbered table of contents both support non-linear navigation, which is exactly what an operator under pressure needs.
No UI/UX concerns on this documentation PR.