docs(legibility): write docs/DEPLOYMENT.md (production runtime + env vars) #399
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Part of Epic #394 — Documentation. This is DOC-5: the production-runtime reference. Successor-X reads this when something breaks in production or when they need to redeploy. Anja reads this to understand the operational shape of the system.
Per the Legibility Rubric, this addresses C9.1, C9.2, C9.4 (all Major or Minor).
Note: the existing Production v1 milestone (#1) covers actual deployment work in 7 phases. This issue is about documenting what exists and is planned, not doing the deployment itself.
Required content
A single
docs/DEPLOYMENT.mdcontaining:1. Deployment topology
What runs where in production:
ASCII or Mermaid diagram preferred.
2. Environment variables (per service)
For each container, a table:
Cover at least: DB credentials, MinIO credentials, OCR service URL, Spring profile, frontend public API URL, mailer config.
3. Bootstrap from scratch
The exact sequence to bring a fresh production environment up. References:
docker-compose.yml(dev) and the eventualdocker-compose.prod.yml(phase-3 of Production v1)create-bucketshelper container)4. Logs + observability
Where to look when things break:
docker compose logs <service>5. Backup + recovery
What gets backed up, how, how often, and how to restore. References the phase-5 work in Production v1.
6. Common operational tasks
scripts/reset-db.sh)scripts/rebuild-frontend.sh)scripts/download-kraken-models.sh)7. Known limitations
Acceptance criteria
docs/DEPLOYMENT.mdexists with all 7 sectionsapplication*.ymlis listedREADME.md(DOC-1) and from each phase-N issue in the Production v1 milestoneDependency
Soft dependency on AUDIT-5 (#392) for findings about repo hygiene + infra/.
Definition of Done
docs/DEPLOYMENT.mdcommitted onmain. Closing comment links to it and notes any deployment phase that this doc reveals as still-undocumented (those become follow-up issues in Production v1).🏗️ Markus Keller — Senior Application Architect
Observations
docs/infrastructure/production-compose.md(fulldocker-compose.prod.yml, Caddyfile, cost breakdown) anddocs/infrastructure/s3-migration.md. This material must be referenced or synthesised, not duplicated. Two canonical sources of truth for the same topology will diverge.ARCHITECTURE.mdfor the single-node OCR limitation (ADR-001) — that file doesn't exist yet as a standalone doc, butdocs/adr/001-ocr-python-microservice.mddoes. The link in Section 7 should resolve to the ADR, not to a non-existentARCHITECTURE.md.docker-compose.prod.ymlbootstrap instructions, but this file doesn't yet exist as a committed file — it lives documented indocs/infrastructure/production-compose.md. Section 3 ("Bootstrap from scratch") should clarify whether readers are following the documented overlay approach or a not-yet-committed prod compose file.Recommendations
DEPLOYMENT.mdwith a one-sentence architecture map, then link todocs/infrastructure/production-compose.mdfor the full Compose file anddocs/adr/for each design decision. Don't repeat what's already in those files.docs/adr/001-ocr-python-microservice.mdso links resolve fromdocs/DEPLOYMENT.md.docker-compose.ymlsetsmem_limit: 12gon the OCR service. Section 1 (topology) should note this — it is a direct constraint on VPS sizing (CX32 has 8 GB total RAM, which means OCR cannot actually be mem-limited to 12 GB there; that's a prod-sizing gap worth calling out explicitly).👨💻 Felix Brandt — Senior Fullstack Developer
Observations
scripts/rebuild-frontend.shin Section 6. That script exists at/scripts/rebuild-frontend.sh— good. Butscripts/reset-db.shhardcodesDB_USER=archive_userandDB_NAME=family_archive_dbinstead of reading from.env. If someone customises those in.env, the script silently operates on wrong values. Worth noting in the doc as a "gotcha" rather than leaving it to discovery..env.example(the canonical dev list),docker-compose.yml(the wiring), andapplication.yaml(backend resolution). There's alsoAPP_ADMIN_USERNAME/APP_ADMIN_PASSWORDinapplication.yamlthat doesn't appear in.env.exampleat all — this is a gap the doc should surface.application.yamlhasapp.admin.username: ${APP_ADMIN_USERNAME:admin}andapp.admin.password: ${APP_ADMIN_PASSWORD:admin123}with insecure defaults. The bootstrap procedure must include explicitly overriding these, otherwise new deployments will ship withadmin/admin123.SPRING_PROFILES_ACTIVE: dev,e2eindocker-compose.ymlis developer-facing only. Section 3 should note that production usesprodprofile (perproduction-compose.md). This is documented elsewhere but easy to miss.Recommendations
.env.example(already well-documented) andapplication.yamlrather than writing it from scratch. Any var inapplication.yamlnot in.env.exampleis a documentation gap —APP_ADMIN_USERNAME/APP_ADMIN_PASSWORDis the main one I found.POSTGRES_PASSWORD,MINIO_ROOT_PASSWORD,APP_ADMIN_PASSWORD. These are the three that ship withchange-me/admin123.scripts/reset-db.shcorrectly warns before truncating — document its scope limitation (it truncates data but doesn't drop the schema or re-run Flyway; use it for E2E resets, not full reinstalls).🛠️ Tobias Wendt — DevOps & Platform Engineer
Observations
docker-compose.ymlhas two production-unfriendly patterns the doc must flag explicitly:minio/minio:latest(unpinned tag) and bind mounts for both PostgreSQL (./data/postgres) and MinIO (./data/minio). The production overlay indocs/infrastructure/production-compose.mdcorrectly uses named volumes — the doc should make this contrast explicit so an operator doesn't accidentally ship the dev compose to prod.mem_limit: 12gin the dev compose exceeds the recommended CX32's 8 GB total RAM. This is fine in dev (your home NAS likely has more), but deploying to a CX32 without adjusting this limit will result in the container failing to start or being killed by OOM. Section 1 should flag the OCR service memory requirements explicitly.docker-compose.ymluseswgetto hit/actuator/health. The actuator endpoint isn't exposed beyond the internal Docker network in production (per the overlay), but the doc should mention that Prometheus scrapes port 8081 (management port) internally, not 8080.create-bucketscontainer (usesminio/mc) runs without a pinned image tag and isn't in the prod overlay (correctly excluded viaprofiles: ["dev"]). Section 3 should document what replaces it in production — pers3-migration.md, production uses Hetzner Object Storage, where the bucket is pre-created manually in the Hetzner console.scripts/rebuild-frontend.shby name but the script assumes the volume is namedfamilienarchiv_frontend_node_modules(hardcoded on line 16). If the project directory isn't namedfamilienarchiv, this silently fails. Worth a doc note.Recommendations
create-bucketshelper in prod.S3_ACCESS_KEY/S3_SECRET_KEYshould be a dedicated MinIO service account in dev and a scoped Hetzner credential in prod — not the root credentials. The current.env.exampleusesMINIO_ROOT_USER/MINIO_ROOT_PASSWORDwired through to the app, which is the root-credential antipattern (seedocs/infrastructure/production-compose.mdfor the correct prod approach).docker compose logs --follow --tail=100 backendas the first-response command. Also note that the backend log is at/app/logs/inside the container — useful fordocker execforensics.🔒 Nora "NullX" Steiner — Application Security Engineer
Observations
application.yamlcontainsapp.admin.password: ${APP_ADMIN_PASSWORD:admin123}. The fallback defaultadmin123ships to any deployment that doesn't setAPP_ADMIN_PASSWORDin.env. The doc must treat this as a security-critical bootstrap step, not just a configuration note. A new operator who skips Section 3 will have anadmin/admin123account in production..env.examplecomment forOCR_TRAINING_TOKENcorrectly says "Must not be empty in production." This deserves its own row in the Section 2 env vars table withRequired? = YES (prod)andSensitive? = YES— it controls model training endpoints that accept file uploads.ALLOWED_PDF_HOSTSenv var on the OCR service (default:minio,localhost,127.0.0.1) is a critical SSRF control. It is not mentioned anywhere in.env.exampleordocker-compose.yml. The Section 2 env vars table must include it with a note explaining why it exists — an operator who doesn't understand it might widen it to*to unblock a bug.docs/infrastructure/production-compose.mdblocks/actuator/*— this is important and should be referenced in the security context of Section 4 (observability).BLLA_MODEL_PATHenv var in the OCR service (os.environ.get("BLLA_MODEL_PATH", "/app/models/blla.mlmodel")) is not documented anywhere. It controls which baseline layout analysis model Kraken uses — worth including in the env vars table.Recommendations
APP_ADMIN_PASSWORD, (2) set a strongOCR_TRAINING_TOKEN, (3) rotatePOSTGRES_PASSWORDandMINIO_ROOT_PASSWORDfrom.env.exampledefaults, (4) confirmALLOWED_PDF_HOSTSis locked to your MinIO/S3 hostname only.POSTGRES_PASSWORD,MINIO_ROOT_PASSWORD,S3_SECRET_KEY,APP_ADMIN_PASSWORD,OCR_TRAINING_TOKEN, andMAIL_PASSWORDasSensitive? = YESin the table — this signals to operators to use secrets injection, not hardcode them in env files.SPRING_PROFILES_ACTIVE: dev,e2ein dev compose enables OpenAPI (/v3/api-docs) and Swagger UI. Section 3 should confirm this is replaced byprodprofile in production, which disables both. An accidentaldevprofile in production exposes the full API schema publicly.🧪 Sara Holt — QA Engineer & Test Strategist
Observations
application*.ymlis listed." This is testable — I'd suggest a lightweight CI check:grep -E "\$\{[A-Z_]+\}" docker-compose.yml backend/src/main/resources/application*.yaml | grep -oE "\$\{[A-Z_]+[^}:]*\}" | sort -uproduces the full list of referenced vars. Running this as part of a PR that touches compose or yaml files catches omissions. The doc should note this pattern so it can be used to audit completeness after the initial write.README.md(DOC-1)".README.mddoesn't exist at the repo root — only infrontend/and in.pytest_cache/. The acceptance criteria require a link from it, which means either DOC-1 must land first, or this issue's criteria should say "link from DOC-1 when merged." Marking this criterion as unverifiable until DOC-1 exists.Recommendations
docker compose up, runcurl http://localhost:8080/actuator/healthand confirm{"status":"UP"}before proceeding. This is the smoke test that confirms the stack came up correctly.🎨 Leonie Voss — UX Designer & Accessibility Strategist
Observations
reset-db.sh,rebuild-frontend.sh) are well-structured — the doc should show the exact invocation command, not a description of what the script does.Recommendations
📋 Elicit — Requirements Engineer
Observations
main, closing comment links to it). This is ready to implement — no major requirements ambiguity.application*.ymlis listed." This is a completeness claim that a reviewer cannot easily verify by reading the PR. The verifiable test is a diff command (see Sara's comment for the grep approach). Without it, this criterion is assessed by trust, not evidence.README.md(DOC-1)" but noREADME.mdexists at the repo root. If DOC-1 isn't merged first, this acceptance criterion is unachievable. The issue should clarify: either (a) create a minimaldocs/README.mdstub as part of this issue, or (b) change the AC to "linked from DOC-1 when both are merged."Recommendations
docker-compose.ymlorapplication*.yamlthat is not in the table is a blocking review comment." This makes the completeness criterion actionable for the reviewer.README.mdwhen DOC-1 is merged, tracked in closing comment."🗳️ Decision Queue — Action Required
3 decisions need your input before implementation starts.
Architecture
Duplicate vs. reference
docs/infrastructure/production-compose.md— The topology anddocker-compose.prod.ymlare already documented in detail indocs/infrastructure/production-compose.md. Section 1 and Section 3 ofDEPLOYMENT.mdwill either (a) duplicate that content (two sources of truth that will diverge) or (b) summarise and link to it (leaner, stays in sync automatically). Option (b) is strongly preferred architecturally, but changes the scope of this issue:DEPLOYMENT.mdbecomes a nav/summary doc, not a self-contained reference. (Raised by: Markus)OCR
mem_limit: 12gvs CX32 target VPS (8 GB RAM) — The dev compose setsmem_limit: 12gon the OCR service. The recommended production VPS is CX32 (8 GB RAM total). These are incompatible: a CX32 cannot honour a 12 GB mem limit. This is a real sizing gap. Options: (a) lowermem_limitto e.g. 6 GB in prod overlay and accept reduced batch sizes, (b) recommend CX42 (16 GB) as the production target for deployments with OCR, (c) make OCR optional in the prod compose with a note. The DEPLOYMENT.md should document whichever is chosen — this decision needs to be made first. (Raised by: Tobias, Markus)Requirements
README.md(DOC-1), but no root-levelREADME.mdexists yet. Options: (a) resolve by adding a stubREADME.mdas part of this issue's scope (minimal: title + link todocs/DEPLOYMENT.md), (b) drop the README link from this issue's AC and add it as a task in the DOC-1 issue, (c) createREADME.mdas a separate pre-req commit in the same PR. Without a decision, the PR will have one unverifiable acceptance criterion. (Raised by: Sara, Elicit)✅ Decision Queue — Resolved
The 3 decisions raised in #399#issuecomment-6340:
1. Duplicate vs reference
docs/infrastructure/production-compose.md→ summarise and link (Option B)This is the same answer as epic-level D1: DOC-5 is an entry-point Day-1 checklist, not a duplicate of
docs/infrastructure/. Two canonical sources for the same topology will diverge — Markus's concern is right.DOC-5 owns:
docs/infrastructure/production-compose.mdowns: the full Compose file, Caddyfile, and step-by-step VPS provisioning. DOC-5 links to it; does not copy it.2. OCR
mem_limit: 12gvs CX32 (8 GB) → document both options, recommend CX42 for OCR-enabled prod (Option B), with a note that this is the operator's callThe
12gvalue in dev compose is sized for the home NAS, not a CX32. The doc must not gloss over this — Tobias's flag is real. Recommended treatment in DOC-5 Section 1:This documents the tradeoff so the operator chooses with eyes open. The doc is descriptive; the actual sizing decision is for the project owner to make — flag it on the issue as a follow-up if a definitive recommendation is needed before DOC-5 ships.
3. README dependency → relax AC; track link as follow-up when DOC-1 lands
Same shape as DOC-3's resolution. DOC-5 is independently writable. Resolution:
docs/DEPLOYMENT.md. Does not require an existingREADME.mdlink to merge.README.md" is recorded but verifiable only after DOC-1 merges. Track in closing comment as a follow-up checkbox.The "linked from each phase-N issue in the Production v1 milestone" criterion (per Sara): treat as a closing-comment checklist, with explicit follow-up tickets for each phase issue that needs the back-link.
📌 Additional persona feedback to fold into implementation
docs/adr/001-ocr-python-microservice.md(real path). Add ADRs for "no multi-tenancy" and "no multi-region" if they're deliberate constraints (which they are, per family-only project frame)..env.example+application.yaml, not from scratch. Any var inapplication.yamlnot in.env.exampleis a doc gap. Confirmed gap:APP_ADMIN_USERNAME/APP_ADMIN_PASSWORDship withadmin/admin123defaults — Section 3 must include "change these before first boot" callout listingPOSTGRES_PASSWORD,MINIO_ROOT_PASSWORD,APP_ADMIN_PASSWORD.scripts/reset-db.shscope — truncates data, doesn't drop schema or re-run Flyway. HardcodedDB_USER=archive_user,DB_NAME=family_archive_db— note as "gotcha" if those are customised in.env.minio/minio:latest(dev, unpinned) vs pinned tag in prod overlay; bind mounts (./data/postgres,./data/minio) in dev vs named volumes in prod.create-bucketsMinIO MC helper).production-compose.md's Caddy block on/actuator/*).scripts/rebuild-frontend.shassumes volume namefamilienarchiv_frontend_node_modules— flag if directory is renamed.APP_ADMIN_PASSWORDfromadmin123.OCR_TRAINING_TOKEN(must not be empty in prod).POSTGRES_PASSWORDandMINIO_ROOT_PASSWORDfrom.env.exampledefaults.ALLOWED_PDF_HOSTS(default:minio,localhost,127.0.0.1) is locked to your MinIO/S3 hostname only — widening to*is an SSRF.SPRING_PROFILES_ACTIVE=prod(notdev,e2e) —devexposes Swagger UI and/v3/api-docs.Sensitive? = YESonPOSTGRES_PASSWORD,MINIO_ROOT_PASSWORD,S3_SECRET_KEY,APP_ADMIN_PASSWORD,OCR_TRAINING_TOKEN,MAIL_PASSWORD. Recommend secrets injection (Docker secrets / Kubernetes secrets), not env files.ALLOWED_PDF_HOSTSandBLLA_MODEL_PATHto env vars table (currently undocumented).pg_dumpprocedure") and Planned ("phase-5 of Production v1 milestone — link to issue"). Don't imply backups exist when they don't.grep -E '\$\{[A-Z_]+\}' docker-compose.yml backend/src/main/resources/application*.yaml | grep -oE '\$\{[A-Z_]+[^}:]*\}' | sort -uproduces the canonical env-var list. Run as part of any PR touching compose/yaml. Document this pattern.curl http://localhost:8080/actuator/healthshould return{"status":"UP"}before continuing.docker-compose.ymlorapplication*.yamlthat is not in the table is a blocking review comment."Status: Ready for implementation. The OCR mem_limit / VPS sizing item flagged above (D2) is the only thing the project owner may want to nail down before DOC-5 ships.
DOC-5 implemented — PR #443
docs/DEPLOYMENT.mdcommitted with all 7 sections. Key points:Undocumented gaps surfaced:
APP_ADMIN_USERNAME/APP_ADMIN_PASSWORDship withadmin/admin123defaults and are not in.env.example— both are now in the env vars table withSensitive? = YESALLOWED_PDF_HOSTS(SSRF guard) andBLLA_MODEL_PATH(Kraken model) were absent from compose and.env.example— both added to tableSecurity checklist (8 items): must complete before first boot —
APP_ADMIN_PASSWORD,OCR_TRAINING_TOKEN,POSTGRES_PASSWORD,MINIO_ROOT_PASSWORD,ALLOWED_PDF_HOSTS,SPRING_PROFILES_ACTIVE=prod, dedicated S3 service accountDeployment phase follow-ups (still undocumented in Production v1):
pg_dumpis the only recovery optionREADME link: tracked here — will be filled in when PR #440 (DOC-1) merges.
PR: http://heim-nas:3005/marcel/familienarchiv/pulls/443