ADR-024 records the deliberate cross-domain link (obs-grafana joins
archiv-net to query archive-db via the SELECT-only grafana_reader role),
the rejected alternatives (Prometheus exporter, read replica, versioned
migration + flyway repair, hardcoded fallback), and the consequences —
specifically that a Grafana compromise gains TCP reach to archive-db
but is bounded by the role's least-privilege grants.
The DEPLOYMENT.md runbook documents the rotation procedure that
R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD,
restart backend (Flyway re-applies because the resolved checksum
changed), restart obs-grafana (datasource picks up the new env var).
Also calls out the fail-closed startup behavior so operators who hit
IllegalStateException know it is deliberate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original 4 tests asserted SELECT existed on the three granted tables
and was absent on app_users. That left two gaps a future migration could
slip through silently:
- INSERT/UPDATE/DELETE on the granted tables — if someone GRANTed write
access on, say, documents to grafana_reader, the SELECT positives stay
green and the boundary is breached invisibly.
- Other PII / sensitive tables — the single app_users negative checks
one table; a wildcard "GRANT SELECT ON ALL TABLES IN SCHEMA public"
would still leave it green by accident if app_users wasn't the only
sensitive table.
Switch to a hasPrivilege(table, privilege) helper, add three write-deny
tests (INSERT/UPDATE/DELETE on each granted table), and replace the
single app_users negative with a parameterized sweep over app_users,
user_groups, persons, notifications, document_comments,
document_annotations, geschichten. New sensitive tables get added to
that list as they appear.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
V68 used to set the role's password in a versioned migration, which Flyway
applies exactly once per database. Rotating GRAFANA_DB_PASSWORD therefore
had no effect on the DB role — operators would need a manual ALTER ROLE
or a `flyway repair` that nobody documented. The shape conflated two
lifecycles: schema migration (one-shot, immutable) and credential
provisioning (rotatable).
Split into:
- V68 (versioned, immutable): creates the role and applies SELECT grants
on audit_log, documents, transcription_blocks.
- R__grafana_reader_password.sql (repeatable): issues ALTER ROLE … PASSWORD
with the placeholder. Flyway computes the checksum on the resolved
content, so any change to GRAFANA_DB_PASSWORD changes the checksum and
re-applies the migration on the next boot. Rotation becomes "bump env
var + restart backend".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
FlywayConfig used to fall back to a hardcoded "changeme-grafana-db-password"
string when the env var was missing. That published a known credential for
the grafana_reader role (SELECT on audit_log, documents, transcription_blocks)
into git history and made silent fail-open the default for any deploy that
forgot the secret. Now resolution goes through Spring's Environment and
throws IllegalStateException at startup when the value is unset or blank —
same shape as UserDataInitializer's refusal to seed default admin creds.
Tests inject via the global GRAFANA_DB_PASSWORD entry in test-resources
application.properties so existing Flyway-loading test classes keep
booting without per-class TestPropertySource boilerplate. FlywayConfigTest
covers both branches against MockEnvironment without a Spring context.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the new GRAFANA_DB_PASSWORD secret through the deploy pipeline:
- docker-compose.prod.yml: backend env now passes GRAFANA_DB_PASSWORD
through so Flyway V68 can resolve the ${grafanaDbPassword} placeholder
in production and staging (it already worked in local dev via
docker-compose.yml).
- release.yml + nightly.yml: declare GRAFANA_DB_PASSWORD as a required
Gitea secret, write it into .env.production / .env.staging (consumed
by archive-backend), and into /opt/familienarchiv/obs-secrets.env
(consumed by obs-grafana's PostgreSQL datasource).
Operator action before the next deploy: add a GRAFANA_DB_PASSWORD value
to the Gitea repo secrets (openssl rand -hex 32).
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the new read-only connection from Grafana to archive-db (via the
grafana_reader role) introduced by the PO Overview dashboard.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds GRAFANA_DB_PASSWORD to the observability-stack env-var table, the
Gitea secrets table, and the obs-secrets.env reference, so operators see
the variable wherever they look for related secrets.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Provisioned dashboard for the product owner's weekly check-in: system
health (Prometheus + Loki), user activity (PostgreSQL audit_log), archive
progress (PostgreSQL transcription_blocks + audit_log), and OCR quality
(Prometheus ocr-service metrics). Default range 7d, manual refresh,
thresholds per the issue spec.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
.env.example: declare GRAFANA_DB_PASSWORD with an openssl rand -hex 32 hint
so a missing value fails loudly (NFR-OPS-02). obs.env: add a comment
explaining that the real value comes from CI's obs-secrets.env, matching
the pattern used for other secrets in that file.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a read-only datasource pointing at archive-db using the grafana_reader
role (provisioned by Flyway V68). The password is interpolated from the
GRAFANA_DB_PASSWORD env var passed to obs-grafana, and the connection is
locked to editable: false so the credential cannot be inspected via the UI.
sslmode=disable is intentional: traffic stays inside archiv-net.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Flyway runs inside the backend container at startup; V68's
${grafanaDbPassword} placeholder is resolved from this env var.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
obs-grafana now joins archiv-net so it can resolve archive-db:5432 for the
PO Overview dashboard's PostgreSQL datasource, and receives GRAFANA_DB_PASSWORD
so provisioning can interpolate it into the datasource config.
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add Flyway V68 migration that provisions a read-only PostgreSQL role
scoped to audit_log, documents, and transcription_blocks. The role's
password is injected via the new ${grafanaDbPassword} Flyway placeholder,
which FlywayConfig reads from the GRAFANA_DB_PASSWORD env var. The
migration is idempotent: CREATE on first run, ALTER on re-run.
Adds a Testcontainers integration test asserting positive grants on the
three intended tables and a negative grant on app_users (NFR-SEC-01).
Refs #651.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sara's cycle-2 S2: clarify the latent (but not actual) cross-test state
risk on the two metrics tests that hit the global REGISTRY instead of
the per-test fresh_metrics fixture. Migrating them would actually break
them — the /metrics endpoint is served by prometheus-fastapi-instrumentator
which binds to the default REGISTRY at app-construction time, and the
http_requests_total assertion only finds counters on that global
registry. Both tests already assert response shape only (status code,
content-type substring, body substrings), not numeric values, so the
shared-registry caveat is documented for future readers rather than
treated as a bug to fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sara's cycle-2 S1: the wall-clock assertion at < 0.05s could trip on a
slow CI runner under load even when the timer correctly excludes
spell-check. Sara's preferred structural fix (patch main.time.monotonic
with a deterministic sequence) proved awkward — the patched attribute is
the *global* time.monotonic which httpx and asyncio consume, exhausting
the sequence before the request reaches the engine loop.
Take the documented fallback: widen the bound to 0.09s and explain why.
The failure mode the test guards against (spell-check inside the timer)
would add 0.1s (2 × 0.05s sleep), so 0.09s catches the bug while leaving
~90ms of headroom for slow CI runners. Verified red→green by temporarily
moving correct_text inside the timer block: bound trips at 0.101s; the
fixed code reads ~0.001s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ADR-023 captures why prometheus-fastapi-instrumentator was chosen,
the build_metrics(registry) factory pattern, and the test rebinding
seam. The glossary gains four ops-aligned terms — illegible word,
models-ready gauge, recognition vs segmentation accuracy — so the
metrics documentation in OBSERVABILITY.md has a vocabulary to lean on.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- L2 container diagram now shows the Prometheus -> ocr:8000 scrape edge
(plus the previously-undrawn Prometheus -> backend edge for symmetry).
- OBSERVABILITY.md gains a full ocr_* metrics table with labels, units,
and the canonical example queries from issue #652.
- New "Internal-only endpoints" subsection captures the unauthenticated
/metrics caveat and provides the Caddy block snippet for the case
where the service ever gets a host port.
- Explicit note that MetricsPathFilter only quiets uvicorn stdout, and
the OCR metrics must never carry PII or document content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three small drops that pay back later:
- Note that main.metrics is import-time bound and tests must
monkeypatch `main.metrics`, not the registry.
- Flag the /metrics endpoint as unauthenticated and cross-link the
Caddy-block snippet in docs/OBSERVABILITY.md.
- Pin prometheus-client to the exact 0.25.0 patch version already
resolved by prometheus-fastapi-instrumentator 7.0.0, so an upstream
bump cannot silently slip in.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each metrics test was repeating the same five-line block — patch
kraken_engine.load_models, patch load_spell_checker, instantiate the
AsyncClient, force _models_ready True, restore it. Lift the lot into a
single async context manager so each test body shrinks to its real
arrange / act / assert intent.
Tests that drive the lifespan directly (models_ready gauge) or stub
asyncio.to_thread for /train (which already patches _models_ready) stay
unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The /train, /train-sender, and /segtrain endpoints each duplicated the
same eight-line try/except + counter + gauge block around the
asyncio.to_thread call. Lift it into _record_training(runner, kind),
which accepts a sync- or async-returning callable for flexibility.
Each endpoint now ends with a single return line. Behaviour preserved —
status codes, error propagation, and metric labels stay identical.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two block-iteration loops (/ocr and /ocr/stream's standard generator)
both ran the same word-total and illegible-word increments. Lift them
into a single helper so each call site becomes one line and the counter
intent reads cleanly. Pure refactor — no behavior change, tests stay green.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The asyncio.to_thread patch stubbed out the entire _run_training call,
hiding the real error path. Replacing it with a failing CompletedProcess
from subprocess.run exercises the actual ketos-failed branch and keeps
the test's intent — error counter bumps, 500 surfaces — intact.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If uvicorn's access log format ever changes (args=None, or shorter
than 3 elements), the filter must keep forwarding records rather than
silently dropping them. Two extra LogRecords cover both edge cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two regression tests:
- /train-sender hitting the success path bumps the recognition counter
(previously only /train and /segtrain were covered).
- A successful run whose result.accuracy is None must not call set() on
ocr_model_accuracy — the gauge stays at its default 0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the guided generator's page_started timer wrapped the entire
region loop including the synchronous correct_text() call, inflating
ocr_processing_seconds with spell-check latency. Sum the per-region
engine.extract_region_text durations instead so the histogram matches
the unguided stream's "engine only" semantic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks in the post-download placement of the counter increment so a
regression that moves it back above _download_and_convert_pdf would fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The TODO was a placeholder for this work — the OCR service now exposes
/metrics so the target will flip from DOWN to UP on next image rebuild.
Refs #652
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a logging.Filter on uvicorn.access that drops records whose request
path is /metrics or /health. Each is hit on a tight schedule (Prometheus
scrape interval and Docker healthcheck), so unfiltered they dominate the
access log without carrying any information about real traffic.
Refs #652 (Nora's recommendation)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the existing _models_ready bool so Prometheus has a time-series
liveness/readiness signal for future alerting rules (e.g.
ocr_models_ready < 1 for 2m).
Refs #652 (AC7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hits /train then /segtrain through the same test, each with a distinct
mocked accuracy, and asserts the labelled gauges reflect the two values.
Locks down the kind-label separation between recognition and segmentation
accuracy (decision #2).
Refs #652 (AC6)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps the await asyncio.to_thread(_run_*) calls in /train, /train-sender,
and /segtrain with try/except. Recognition training (/train, /train-sender)
shares kind="recognition"; /segtrain uses kind="segmentation". The
ocr_model_accuracy gauge is set per kind on success.
Refs #652 (AC6, decision #2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps every asyncio.to_thread(engine.extract_*) call with time.monotonic()
deltas in /ocr (per document) and in both /ocr/stream generators (per page).
Streaming buckets are the useful operational signal; the non-streaming
observation is a bonus.
Refs #652 (AC5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Walks block["words"] before apply_confidence_markers strips the list, then
increments ocr_words_total by len(words) and ocr_illegible_words_total by
the count below threshold. Same pattern in both /ocr and /ocr/stream so the
ratio illegible/words is a faithful quality signal across endpoints.
Refs #652 (AC4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bumps the counter in both /ocr/stream except blocks (standard and guided
generators) so the existing skipped_pages local variable now also flows
into Prometheus.
Refs #652 (AC3b)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bumps the counter inside both the standard and guided /ocr/stream
generators after a page yields its blocks, before the per-page json line is
emitted. Also moves the ocr_jobs_total increment for /ocr/stream right after
engine selection so the counter still fires when a page later errors out.
Refs #652 (AC3a)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks down AC2 for the non-Kurrent path. The same code branch in /ocr that
sets engine_name from script_type now has explicit coverage for both
HANDWRITING_KURRENT → kraken and TYPEWRITER → surya.
Refs #652 (AC2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pick engine="kraken" for HANDWRITING_KURRENT, engine="surya" otherwise,
then increment after the blocks have been extracted.
Refs #652 (AC2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Encapsulates every custom OCR metric in an OcrMetrics frozen dataclass and
exposes a `build_metrics(registry)` factory. Production main.py binds against
the default REGISTRY; tests construct a fresh CollectorRegistry per case and
monkeypatch main.metrics, so counter values stay isolated between tests
(decision #3 on issue #652, Option A).
Refs #652
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks down AC1: prometheus-fastapi-instrumentator must keep auto-exposing
http_requests_total and http_request_duration_seconds for application
traffic, not just register the /metrics endpoint.
Refs #652
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mount the instrumentator immediately after FastAPI app creation, excluding
/health and /metrics from request metrics to keep http_requests_total focused
on real application traffic.
Refs #652
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A symlink placed inside importDir pointing to a file outside it would pass
isValidImportFilename (no forbidden chars in the symlink name) and be found
by Files.walk. Now checks candidate.getCanonicalPath() against
baseDir.getCanonicalPath() — if the resolved path escapes importDir,
throws DomainException.internal and aborts the import. Adds regression
test using @TempDir + Files.createSymbolicLink.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces MassImportService.SkipReason with all five values —
INVALID_FILENAME_PATH_TRAVERSAL, INVALID_PDF_SIGNATURE, FILE_READ_ERROR,
ALREADY_EXISTS, S3_UPLOAD_FAILED — making the full set of reasons greppable
and type-safe. SkippedFile.reason changes from String to SkipReason;
importSingleDocument return type updated accordingly. JSON serialisation
is unchanged (Jackson serialises enums by name). All tests updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents that .hidden.pdf and "Brief an Oma.pdf" correctly pass the
isValidImportFilename guard — both are valid basenames common in the archive.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds checks for U+2215 DIVISION SLASH (∕), U+FF0F FULLWIDTH SOLIDUS (/),
and U+29F5 REVERSE SOLIDUS OPERATOR (⧵) — all of which bypass the existing
ASCII separator checks on Linux path resolution. Adds a clarifying comment on
the Paths.get().isAbsolute() call explaining its InvalidPathException safety
boundary. Adds 3 regression tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rejects path-traversal filenames before findFileRecursive runs.
Guard runs on the derived filename (after the ternary) as specified.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codifies the path-traversal constraint that was previously safe by
accident (findFileRecursive's getFileName() strip) but had no explicit
guard or test coverage. Fixes issue #530.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All route spec files that mock $lib/shared/api.server were missing
extractErrorCode from the mock factory, causing a vitest "No export defined"
error after the refactor introduced the new export.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>