Adds tags.py module implementing a three-outcome heuristic:
- Individual-to-individual correspondence tags ("Clara an Herbert") → dropped
- Group/collective correspondence ("Clara an Kinder", "Walter an Geschwister") → Briefwechsel/<value>
- Semantic/event tags ("Brautbriefe", "Alltag", "zur Hochzeit") → Themen/<value>
Three correspondence patterns detected: space-an-space, starts-with-"an ",
and abbreviated-sender form ("Maria W.an Clara").
COLLECTIVE_TERMS in config.py extended with 17 plural/group relational terms
(söhne, brüder, schwiegereltern, cousinen, etc.) confirmed against the full Excel.
Also adds two-phase summary mining: every run emits review/tag-candidates.csv;
subsequent runs apply keywords from overrides/approved-themes.csv as Themen tags.
Outputs: canonical-documents.xlsx gets pipe-separated "Parent/Child" tag paths;
canonical-tag-tree.xlsx provides the full tag hierarchy for backend pre-import.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Spanish month names (Mexican-branch letters) to config.MONTHS and let
the month-first matcher accept a hyphen (not just a dot) before the year, so
"Mayo 18-1929"/"Junio 7-904" parse without manual overrides. Also bound
4-digit years to 1700-2100 so gross typos ("23-9003") stay in review instead
of producing a bogus year. Cuts unknown-date rate 9.2% -> 7.9%.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The unmatched list was just non-family correspondents (expected noise);
their count stays in summary.txt and they remain in canonical-persons.xlsx.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
page.server.spec.ts picked up an unrelated reader-dashboard test case via
a cross-session staging race; restore it to match main so this PR only
touches the import-normalizer tool + docs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Recovered from an entangled commit: these files were correct but had been
bundled into an unrelated reader-dashboard commit by a concurrent session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add bool guard before the int branch in _cell_to_str so True/False
cells are preserved as "True"/"False" instead of "1"/"0". Add two
regression tests covering the fix and missing-sheet error.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_preprocess now sets approx=True when a leading marker is stripped; add
_match_year_only so bare years (e.g. "nach 1900" -> "1900") resolve to
1900-01-01/YEAR before being upgraded to APPROX. Strengthen
test_parse_approx_marker_upgrades_precision and add
test_parse_leading_qualifier_is_approx (11 tests, all pass).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the raw archive spreadsheet findings (IMP-01..12) and a
requirements spec for an offline normalizer that produces a clean
canonical dataset before import. Local docs only; no Gitea issue yet.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Restore JavaDoc on DocumentSearchResult.of() and .paged() factory methods
- Remove redundant null guards on @Builder.Default collections in toListItem()
- Map DocumentListItem fields explicitly in DocumentMultiSelect before cast
- Add DocumentListItem required fields to docFactory in spec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use documentService.getDocumentById() in detail_stillReturnsTrainingLabels
so the Document.full entity graph eager-loads trainingLabels
- Flatten makeItem() factory in DocumentList.svelte.test.ts (nested
document: {} overrides broke item.id / item.documentDate access)
- Remove { document: {} } wrapper from DocumentMultiSelect.svelte.spec.ts
mock responses — component now reads body.items directly as flat items
- Flatten single nested item in page.svelte.test.ts document list test
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All components, specs, and the generated API client now use the new
DocumentListItem shape — flat access (item.title, item.sender) instead of
the removed item.document.* nesting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove trainingLabels from Document.list entity graph now that DocumentListItem
does not touch that association. Integration tests guard against future
LazyInitializationException regressions and confirm Document.full still
loads trainingLabels for the detail endpoint.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminates excessive data exposure (OWASP API3:2023) — transcription,
filePath, fileHash, thumbnailKey, scriptType and other detail-only fields
are no longer serialised in the list API response.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ADR-024 records the deliberate cross-domain link (obs-grafana joins
archiv-net to query archive-db via the SELECT-only grafana_reader role),
the rejected alternatives (Prometheus exporter, read replica, versioned
migration + flyway repair, hardcoded fallback), and the consequences —
specifically that a Grafana compromise gains TCP reach to archive-db
but is bounded by the role's least-privilege grants.
The DEPLOYMENT.md runbook documents the rotation procedure that
R__grafana_reader_password.sql now enables: bump GRAFANA_DB_PASSWORD,
restart backend (Flyway re-applies because the resolved checksum
changed), restart obs-grafana (datasource picks up the new env var).
Also calls out the fail-closed startup behavior so operators who hit
IllegalStateException know it is deliberate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original 4 tests asserted SELECT existed on the three granted tables
and was absent on app_users. That left two gaps a future migration could
slip through silently:
- INSERT/UPDATE/DELETE on the granted tables — if someone GRANTed write
access on, say, documents to grafana_reader, the SELECT positives stay
green and the boundary is breached invisibly.
- Other PII / sensitive tables — the single app_users negative checks
one table; a wildcard "GRANT SELECT ON ALL TABLES IN SCHEMA public"
would still leave it green by accident if app_users wasn't the only
sensitive table.
Switch to a hasPrivilege(table, privilege) helper, add three write-deny
tests (INSERT/UPDATE/DELETE on each granted table), and replace the
single app_users negative with a parameterized sweep over app_users,
user_groups, persons, notifications, document_comments,
document_annotations, geschichten. New sensitive tables get added to
that list as they appear.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
V68 used to set the role's password in a versioned migration, which Flyway
applies exactly once per database. Rotating GRAFANA_DB_PASSWORD therefore
had no effect on the DB role — operators would need a manual ALTER ROLE
or a `flyway repair` that nobody documented. The shape conflated two
lifecycles: schema migration (one-shot, immutable) and credential
provisioning (rotatable).
Split into:
- V68 (versioned, immutable): creates the role and applies SELECT grants
on audit_log, documents, transcription_blocks.
- R__grafana_reader_password.sql (repeatable): issues ALTER ROLE … PASSWORD
with the placeholder. Flyway computes the checksum on the resolved
content, so any change to GRAFANA_DB_PASSWORD changes the checksum and
re-applies the migration on the next boot. Rotation becomes "bump env
var + restart backend".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>