audit(db): score Flyway migrations + DB schema against legibility rubric #391

New Issue

marcel · 2026-05-04T16:04:29+02:00

marcel commented

2026-05-04 16:04:29 +02:00

Context

Part of Epic #387 — Codebase Legibility Audit. This is AUDIT-4: read-only assessment of the database schema and migration history. Tobias (PM-with-CS) will look at this to understand the data model; Anja (Backend Dev) will look to understand evolution and conventions.

The output is a Markdown report under docs/audits/audit-db-report.md following the per-subsystem orient template.

Inputs (read first)

See the first comment of #387 for the complete reference bundle.

Scope (in)

Every Flyway migration under backend/src/main/resources/db/migration/ (V*__*.sql)
The composite/effective schema as a logical whole — what tables exist, what columns, what indexes, what FKs
Migration naming convention compliance
scripts/schema.sql (the snapshot for reference)
scripts/large-data.sql (note its purpose; flag if confusing)

Scope (out)

JPA entity Java code (AUDIT-2 covers that — but flag mismatches between entity field names and column names in §5)
Application data, test fixtures
Database performance tuning (out of scope for legibility)

Rubric checks particularly relevant to this subsystem

C3.3 (table naming consistency, glossary alignment — users vs app_users vs persons is exactly the kind of clarity test Tobias will run), C7.5 (per-table comments OR a domain-grouped schema overview), C9.3 (migration naming and non-trivial migration explanations), C10.4 (any rolled-back / commented-out / abandoned migration patterns).

Specific things to verify (subsystem-specific)

Domain grouping clarity. Group every table under one of the canonical domains (document, person, tag, user, geschichte, notification, ocr, audit). Flag any table whose domain isn't obvious from its name.
Naming consistency. Singular vs plural? snake_case consistent? Junction tables follow a pattern?
Migration narrative. Read migrations in order — does the schema's evolution tell a coherent story, or does it look like an accumulation of fixes?
Junction tables. Document_tags, document_receivers, geschichten_persons, geschichten_documents, group_permissions, transcription_block_mentioned_persons — verify these are documented somewhere or self-evidently named.
Audit log retention. audit_log table — is the append-only constraint enforced at DB level (REVOKE UPDATE/DELETE) and documented?

Acceptance criteria

File docs/audits/audit-db-report.md exists on a feature branch
§2 contains a complete table inventory with apparent domain assignment per the canonical domain set
§3 scorecard covers C3.3, C7.5, C9.3, C10.4 at minimum
§5 lists any table whose domain assignment is ambiguous
§8 recommends either per-table comments OR a domain-grouped docs/SCHEMA.md overview document
§10 lists top-5 recommendations
PR opened and merged before this issue is closed

Definition of Done

Report committed under docs/audits/audit-db-report.md on main. Top-5 recommendations summarized as a closing comment.

Dispatch

Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit backend/src/main/resources/db/migration/ and the effective DB schema per the orient template, producing docs/audits/audit-db-report.md. Read-only.

## Context Part of **Epic #387** — Codebase Legibility Audit. This is **AUDIT-4**: read-only assessment of the database schema and migration history. Tobias (PM-with-CS) will look at this to understand the data model; Anja (Backend Dev) will look to understand evolution and conventions. The output is a Markdown report under `docs/audits/audit-db-report.md` following the per-subsystem orient template. ## Inputs (read first) See **the first comment of #387** for the complete reference bundle. ## Scope (in) - Every Flyway migration under `backend/src/main/resources/db/migration/` (`V*__*.sql`) - The composite/effective schema as a logical whole — what tables exist, what columns, what indexes, what FKs - Migration naming convention compliance - `scripts/schema.sql` (the snapshot for reference) - `scripts/large-data.sql` (note its purpose; flag if confusing) ## Scope (out) - JPA entity Java code (AUDIT-2 covers that — but flag mismatches between entity field names and column names in §5) - Application data, test fixtures - Database performance tuning (out of scope for legibility) ## Rubric checks particularly relevant to this subsystem C3.3 (table naming consistency, glossary alignment — `users` vs `app_users` vs `persons` is exactly the kind of clarity test Tobias will run), C7.5 (per-table comments OR a domain-grouped schema overview), C9.3 (migration naming and non-trivial migration explanations), C10.4 (any rolled-back / commented-out / abandoned migration patterns). ## Specific things to verify (subsystem-specific) - **Domain grouping clarity.** Group every table under one of the canonical domains (document, person, tag, user, geschichte, notification, ocr, audit). Flag any table whose domain isn't obvious from its name. - **Naming consistency.** Singular vs plural? snake_case consistent? Junction tables follow a pattern? - **Migration narrative.** Read migrations in order — does the schema's evolution tell a coherent story, or does it look like an accumulation of fixes? - **Junction tables.** Document_tags, document_receivers, geschichten_persons, geschichten_documents, group_permissions, transcription_block_mentioned_persons — verify these are documented somewhere or self-evidently named. - **Audit log retention.** `audit_log` table — is the append-only constraint enforced at DB level (REVOKE UPDATE/DELETE) and documented? ## Acceptance criteria - [ ] File `docs/audits/audit-db-report.md` exists on a feature branch - [ ] §2 contains a complete table inventory with apparent domain assignment per the canonical domain set - [ ] §3 scorecard covers C3.3, C7.5, C9.3, C10.4 at minimum - [ ] §5 lists any table whose domain assignment is ambiguous - [ ] §8 recommends either per-table comments OR a domain-grouped `docs/SCHEMA.md` overview document - [ ] §10 lists top-5 recommendations - [ ] PR opened and merged before this issue is closed ## Definition of Done Report committed under `docs/audits/audit-db-report.md` on `main`. Top-5 recommendations summarized as a closing comment. ## Dispatch > Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit `backend/src/main/resources/db/migration/` and the effective DB schema per the orient template, producing `docs/audits/audit-db-report.md`. Read-only.

marcel added this to the Codebase Legibility milestone 2026-05-04 16:04:29 +02:00

marcel added the P2-medium audit legibility labels 2026-05-04 16:05:43 +02:00

marcel referenced this issue

2026-05-04 16:05:50 +02:00

audit(rollup): produce global readiness scorecard from subsystem audits #393

marcel referenced this issue

2026-05-04 16:07:41 +02:00

docs(legibility): write docs/ARCHITECTURE.md with diagram and domain list #396

marcel referenced this issue

2026-05-04 16:17:02 +02:00

cleanup(legibility): remove dead code identified by audits #412

marcel commented

2026-05-04 16:29:20 +02:00

AUDIT-4 — Database schema & Flyway migrations

Scope: backend/src/main/resources/db/migration/V*.sql (V1–V59, 57 files), scripts/schema.sql snapshot, scripts/large-data.sql.

Reference bundle: issue #387 first comment (Legibility Rubric + Per-Subsystem Orient Template).

§1 Subsystem profile

The database layer is PostgreSQL 16 managed by Flyway with versioned Vn__description.sql migrations. The schema is owned by the Spring Boot backend; JPA/Hibernate maps entities onto these tables. Migrations are append-only by convention, but the layer also includes:

one DB-level baseline (V1__initial_schema.sql) that is a verbatim pg_dump of an early schema (Hibernate-generated fkXXXXXXXX constraint names, no gen_random_uuid() defaults),
two trigger-driven features (FTS in V34/V35, no other DDL triggers),
one append-only audit table (V46) plus one corrective re-REVOKE (V47),
two explicit data backfills (V35 FTS, V50 comments→audit, V51 annotation_id, V59 grant seed).

backend/src/main/resources/db/migration/
├── V1__initial_schema.sql           — pg_dump baseline
├── V2…V47                            — incremental DDL + small data fixes
├── V48…V59                           — newer features (audit, thumbnails, family tree, mentions, geschichten)
scripts/
├── schema.sql                        — STALE pg_dump (≈V1+V2 only)
└── large-data.sql                    — perf seed (1321 lines of INSERT)

Counts. 57 migration files, V1–V59 with two gaps (V37, V43). V37 was created and later removed (needsExpert rolled back; see §7). V43 was never assigned (silent gap).

§2 Inventory — every table mapped to a canonical domain

Table	Created in	Domain	Notes
`documents`	V1	document	Core entity. Pluralized.
`document_receivers`	V1	document (junction)	`documents ⇄ persons` (M:N)
`document_tags`	V1	document (junction)	`documents ⇄ tag` (M:N)
`document_versions`	V9	document	Versioned snapshot (JSONB)
`document_annotations`	V10	document.annotation	Sub-package
`document_comments`	V12	document.comment	Sub-package; threads via `parent_id`
`comment_mentions`	V17	document.comment (junction)	`document_comments ⇄ users`
`transcription_blocks`	V18	document.transcription	Sub-package
`transcription_block_versions`	V19	document.transcription	History rows
`transcription_block_mentioned_persons`	V56	document.transcription (junction)	No FK on `person_id` by design (V56 header)
`document_training_labels`	V29	document (or ocr?)	Ambiguous — see §5
`document_thumbnails` (cols)	V52/V53	document	Inline cols on `documents`
`persons`	V1	person
`person_name_aliases`	V21	person	Historical names (married/widowed)
`person_relationships`	V54	person.relationship	Family tree edges
`sender_models`	V40	person + ocr	Per-person Kraken HTR model — dual-domain, see §5
`tag`	V1	tag	Singular — outlier; self-FK `parent_id` from V39
`users`	V1	user	Maps to entity `AppUser` — name mismatch, see §5
`user_groups`	V1	user	Roles
`users_groups`	V1	user (junction)	`users ⇄ user_groups`
`group_permissions`	V1	user (junction)	`user_groups ⇄ permission_string` (no FK to a permissions table; `Set<String>`)
`password_reset_tokens`	V8	user
`invite_tokens`	V45	user
`invite_token_group_ids`	V45	user (junction)	Awkward name — see §5
`geschichten`	V58	geschichte	Blog-like stories
`geschichten_persons`	V58	geschichte (junction)	M:N to `persons`
`geschichten_documents`	V58	geschichte (junction)	M:N to `documents`
`notifications`	V16	notification
`ocr_jobs`	V25	ocr
`ocr_job_documents`	V25	ocr (junction)	`ocr_jobs ⇄ documents`
`ocr_training_runs`	V30	ocr
`audit_log`	V46	shared/audit	Append-only (see §3 / C7.5)

Total: 27 tables (1 outlier-named, 1 ambiguous, 1 dual-domain — see §5).

§3 Rubric scorecard (focus C3.3, C7.5, C9.3, C10.4)

Check	Result	Severity	Evidence	Recommendation
C3.3 — table naming/glossary	FAIL	Critical	`users` table is mapped to entity `AppUser` (`backend/src/main/java/.../model/AppUser.java:27`); table name says "users" but the domain glossary explicitly insists `Person ≠ AppUser`. A Tobias-style reader who opens `users` first will conflate it with `persons`.	Rename `users` → `app_users` in a dedicated migration (also rename `users_groups` → `app_users_groups`, `comment_mentions.user_id` → `app_user_id`). OR add a `COMMENT ON TABLE users IS 'AppUser accounts — NOT historical persons. See persons.'`
C3.3 — singular vs plural	FAIL	Major	`public.tag` is singular while every other table is plural (`documents`, `persons`, `tags`-as-junction-prefix in `document_tags`). `Tag.java` has no `@Table` annotation so JPA naming defaults leaked through.	Migration `Vn__rename_tag_to_tags.sql`: `ALTER TABLE tag RENAME TO tags;` then add `@Table(name="tags")` to `Tag.java`. Update `document_tags.tag_id` FK target — implicit, but verify.
C3.3 — junction naming	FAIL	Minor	Junctions are inconsistent: most use `<plural>_<plural>` (`document_tags`, `document_receivers`, `geschichten_persons`, `geschichten_documents`, `users_groups`), but `group_permissions` uses `<singular>_<plural>` (no second table — it's `@ElementCollection`), `comment_mentions` is `<singular>_<plural>` (also implicit), `invite_token_group_ids` ends in `_ids` which no other junction does (`backend/src/main/resources/db/migration/V45__add_invite_tokens.sql:18`). `transcription_block_mentioned_persons` is descriptive but breaks the pattern.	Rename `invite_token_group_ids` → `invite_token_groups`. Document the `@ElementCollection` pattern (`group_permissions`, `comment_mentions`) explicitly so readers don't expect a sister entity.
C7.5 — per-table comment OR overview	FAIL	Minor	Zero `COMMENT ON TABLE` / `COMMENT ON COLUMN` statements anywhere in the migration set. No `docs/SCHEMA.md`. The migration headers are the only documentation, and several creation migrations (V1, V8, V9, V10, V12, V17, V25, V29, V45) ship with no comment at all. Tobias has to read application code to learn what `audit_log.kind` values exist, what `documents.script_type` values are valid, what `transcription_block_mentioned_persons.display_name` is for.	Add `docs/SCHEMA.md` with one paragraph per domain group + an ER sketch. Cheaper than per-table SQL comments, mutates with the codebase. (See §8.)
C7.5 — audit log retention documented	PARTIAL	Major	V46 attempts DB-level append-only via `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` (V46:25). Header explicitly notes the REVOKE is a no-op when the migration runs as the table owner — meaning the application user retains all privileges in dev/local because Flyway runs as the same role. V47 then runs the same REVOKE again "for the actual application role" but still uses `CURRENT_USER` (V47:8) — same no-op problem. The append-only guarantee is not enforced at DB level today. The migration is honest about it ("enforced at the application layer only") but the dual-REVOKE looks like it should work and doesn't.	Either (a) split DB roles so Flyway runs as `archive_owner` and the app runs as `archive_app`, then `REVOKE … FROM archive_app` will bite, or (b) drop both REVOKE statements with a comment "kept at app layer only" so future readers don't think it's enforced. Document the choice in `docs/SCHEMA.md` § audit.
C9.3 — migration naming + non-trivial explanation	PASS (with caveats)	Minor	Naming convention `V<n>__<verb>_<noun>` is consistent (verbs: `add`, `drop`, `backfill`, `make`, `seed`). 25 of 57 migrations carry a header comment explaining intent — strong on the harder ones (V31 nullable transcript, V34 FTS triggers, V46 audit, V51 annotation_id backfill, V54 family tree, V56 mentioned persons, V59 BLOG_WRITE seed). 18 migrations are one-liner `ALTER TABLE … ADD COLUMN` and self-explanatory. Caveat: V13, V20, V32, V36, V41, V44, V47 add cross-cutting columns or fix prior migrations without naming the issue/PR — Anja can't trace why.	Adopt a 1-line convention: every migration starts with `-- #<issue>: <one-line why>`. Backfill is optional.
C9.3 — sequence integrity	FAIL	Minor	V37 and V43 are missing. V37 was rolled back (commit `ca0cf490`, see §7). V43 was never created (no git history). Flyway tolerates gaps but a reader scanning `ls` sees holes and asks "did I lose a file?"	Add a placeholder `V43__skipped.sql` containing only a comment `-- intentionally skipped during issue #XXX rebase; see commit YYYY` and same for `V37__skipped.sql`, OR rebase the sequence (riskier on prod).
C10.4 — half-finished / abandoned migrations	FAIL	Major	(a) `scripts/schema.sql` is a stale snapshot — only contains tables created up to ~V2 (8 tables). It's missing 19 of the 27 current tables (no `transcription_blocks`, `audit_log`, `geschichten`, `notifications`, `ocr_`, `person_`, etc.). `scripts/CLAUDE.md` says it's "for quick reference only" but Tobias will be misled hard. (b) V37 hole + git note "V37 was already applied; rename V37→V38" implies a production rebase that the file tree no longer reflects. (c) `scripts/large-data.sql` has no header explaining what it is or how to safely reset before reloading.	(a) Delete `scripts/schema.sql` or regenerate it on every release via `pg_dump` (CI step). (b) Add a `MIGRATIONS.md` next to the migration directory listing intentional gaps. (c) Add a 5-line header to `large-data.sql`: purpose, prereq, idempotency.
C3.3 — column-name mismatch with entity field names	FAIL	Major	`documents.meta_date` ↔ `Document.documentDate` (field) — naming diverges between layer (V1 baseline). `comment_mentions.user_id` references `users(id)` but the entity is `AppUser`. `audit_log.actor_id` references `users(id)` (V46:9) but is read as "the AppUser who acted". Tobias opens schema, then opens entity, has to translate.	Either rename DB columns to match entity fields (breaking, requires migration) OR add `@Column(name="...")` everywhere and mention the convention in `docs/SCHEMA.md`.

§4 Subsystem health summary

Category	PASS	FAIL-Critical	FAIL-Major	FAIL-Minor
C3 (Domain Clarity)	0	1 (C3.3 users-vs-AppUser)	2 (junction naming asymmetry, column/entity mismatch)	1 (singular `tag`)
C7 (Documentation)	0	0	1 (audit retention claim vs reality)	1 (zero COMMENTs / no SCHEMA.md)
C9 (Operability)	1 (C9.3 naming)	0	0	1 (sequence gaps V37/V43)
C10 (Code Sanity)	0	0	1 (stale `scripts/schema.sql` + V37 ghost)	0

Overall health: 🔴 RED. One Critical (C3.3 users vs AppUser) and three Majors triggers the threshold (🟢 requires 0 Critical AND ≥80% pass — currently 1 of 8 applicable checks passes cleanly).

§5 Domain-mapping gaps (ambiguous tables)

Table	Plausible domains	Blocker question
`document_training_labels` (V29)	`document` (data lives on a Document) or `ocr` (only consumer is the OCR training pipeline)	Should this move to `ocr/` package alongside `OcrTrainingRun`? Today the Document entity carries `Set<String> trainingLabels` so domain ownership splits across `document/` and `ocr/`.
`sender_models` (V40)	`person` (1:1 with Person) or `ocr` (it's a per-sender HTR model)	Same as above. The entity is `SenderModel` in `model/`, not in any `ocr/` subpackage — but it only exists because of OCR.
`users` (table) ↔ `AppUser` (entity)	user	Hard mismatch with the canonical glossary that says "user" domain owns AppUser. The table name says "users" — readers expecting the canonical domain set look for `app_users`.
`audit_log` ↔ `AuditLog` (entity)	shared/audit	Lives under `audit/` package but is a single bare table — the package name implies a domain, the table name implies infrastructure. Acceptable, but worth one sentence in `docs/SCHEMA.md`.
`invite_token_group_ids`	user (junction)	Name suggests it stores raw IDs as scalars (it does — `@ElementCollection<UUID>` style), but the column is actually a UUID FK. Confusing for first-time reader.

§6 Cross-cutting candidates

N/A. Tables belong to exactly one domain (or one sub-package within a domain). The single shared table — audit_log — already lives under the canonical shared/audit umbrella.

§7 Dead / suspicious-code register (C10 findings)

File:line	Smell	Suggested action
`scripts/schema.sql:29-156`	Stale snapshot — covers only V1+V2-era schema (8 of 27 tables) yet `scripts/CLAUDE.md` calls it "complete database schema dump for reference".	Regenerate via `pg_dump` in a CI hook on every release tag, or delete and replace with a one-line note pointing at Flyway.
`backend/src/main/resources/db/migration/V37__*.sql` (deleted)	Rolled-back migration — V37 once existed (`needsExpert` flag), was reverted in commit `ca0cf490`, leaving a hole in the sequence. Successor-X sees `V36 → V38` and assumes a missing file.	Add `V37__skipped_needs_expert_rolled_back.sql` containing only a `SELECT 1;` and a comment explaining the rollback. Same for V43 (silent gap with no git trace).
`backend/src/main/resources/db/migration/V46__add_audit_log.sql:25` and `V47__add_user_color.sql:8`	No-op security control — both `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` calls are documented (V46:22-24) as no-ops because the role is the table owner. Two migrations that look protective but aren't.	Either implement the role split (preferred) or delete the REVOKE lines and replace with a single `COMMENT ON TABLE audit_log IS 'Append-only at application layer only'`.
`backend/src/main/resources/db/migration/V1__initial_schema.sql:202-263`	Hibernate-generated FK names (`fkc99c5qjulwx9gru07yrhicgd2`, `fkl5xhww7es3b4um01vmly4y18m`, etc.). V14 has to reference these unreadable names by hand.	One-time `ALTER TABLE … RENAME CONSTRAINT` migration to give them speaking names (`fk_documents_sender`, `fk_document_tags_document`, …).
`scripts/large-data.sql`	1321 lines of `INSERT` with no header. Idempotency relies on `ON CONFLICT DO NOTHING` but no FK preconditions — running it on an empty DB then on a populated DB will silently produce different results.	Add header: purpose, prerequisites (e.g., reset DB first), expected runtime, dataset size.
`backend/src/main/resources/db/migration/V35__backfill_fts_search_vector.sql:6`	`UPDATE documents SET title = title;` — looks like a typo. The header explains it's intentional (re-fires the BEFORE UPDATE trigger), but a stranger reading without context will rewrite it as a bug.	Header is good; consider `UPDATE documents SET title = title /* trigger re-fire — see V34 */;` for inline clarity.

§8 Documentation gaps

The DB layer needs one of these (recommend the second):

Option A — per-table SQL comments. Add COMMENT ON TABLE/COMMENT ON COLUMN to every table and to every column whose meaning is non-obvious (documents.script_type, documents.metadata_complete, transcription_blocks.source, audit_log.kind, audit_log.payload, notifications.type, etc.). Survives pg_dump. Cost: ~40 statements in a new migration.

Option B — docs/SCHEMA.md overview (recommended). Domain-grouped page with:

one paragraph per table (purpose, key columns, lifecycle if any),
ER sketch per domain (document / person / user / ocr / audit),
explicit list of intentionally-skipped migration numbers (V37, V43),
glossary entry that resolves users ↔ AppUser and tag (singular) ↔ entity Tag,
audit-log retention statement (app-layer only, REVOKE is no-op).

Option B is cheaper, evolves alongside CLAUDE.md, and addresses C3.3 + C7.5 + half of C10.4 with one document.

Per-domain README files (C7.1) would also pull weight here — one paragraph in each backend package README about which tables it owns.

§9 Prerequisites for the big-bang refactor

REFACTOR-1 reorganizes packages, not tables. Therefore the schema layer needs no changes for REFACTOR-1 to land safely, provided:

The package move keeps each entity's @Table(name=...) value (and adds one to Tag.java to lock in the current tag table name even after the entity moves to tag/Tag.java).
TranscriptionBlock, DocumentAnnotation, DocumentComment stay in document.transcription, document.annotation, document.comment sub-packages (canonical D-OQ-2/3/4) — no table renames triggered.
SenderModel and OcrTrainingRun end up under the ocr/ package (D-OQ-5) — table names stay.

Independent prerequisite (not blocking REFACTOR-1): the users ↔ AppUser rename should happen in a separate migration before or after REFACTOR-1, never bundled — it touches FKs in 6+ tables (comment_mentions, audit_log, notifications, users_groups, password_reset_tokens, document_versions.editor_id, transcription_blocks.created_by/updated_by, etc.).

§10 Top 5 prioritized recommendations

Create docs/SCHEMA.md (domain-grouped overview). Single highest-leverage win for C3.3 + C7.5 + C10.4. Resolves the users/AppUser and tag/tags gotchas with a glossary section; documents the 27 tables + 6 junctions + intentional V37/V43 gaps in one read.
Fix or delete scripts/schema.sql. Either regenerate it on every release tag via pg_dump (CI hook) or delete it and replace with a one-line README pointing readers at db/migration/. Today it actively misleads — 19 of 27 tables are missing.
Resolve the audit-log append-only contradiction. Either split DB roles so the V46/V47 REVOKE statements bite, or remove them and add a single inline COMMENT ON TABLE audit_log IS 'append-only enforced at app layer only — see AuditLogRepository'. Today the migration claims protection it doesn't deliver.
Patch the migration sequence holes. Add stub V37__skipped_needs_expert_rolled_back.sql and V43__skipped.sql containing nothing but SELECT 1; and a 1-line -- comment explaining the rollback / accidental skip. Cheap; eliminates "did I lose a file?" friction.
Rename tag → tags in a dedicated migration and add @Table(name = "tags") to Tag.java. Smallest surface-area renames in the schema (one table, one FK in document_tags, one inline reference in V34 trigger function). Removes the only singular-vs-plural outlier and brings the table under the canonical naming convention.

Top blocker: users table maps to entity AppUser and shares lexical space with persons — Tobias will conflate them within 30 seconds of opening the schema.

# AUDIT-4 — Database schema & Flyway migrations **Scope:** `backend/src/main/resources/db/migration/V*.sql` (V1–V59, 57 files), `scripts/schema.sql` snapshot, `scripts/large-data.sql`. **Reference bundle:** issue #387 first comment (Legibility Rubric + Per-Subsystem Orient Template). --- ## §1 Subsystem profile The database layer is **PostgreSQL 16** managed by **Flyway** with versioned `Vn__description.sql` migrations. The schema is owned by the Spring Boot backend; JPA/Hibernate maps entities onto these tables. Migrations are append-only by convention, but the layer also includes: - one **DB-level baseline** (`V1__initial_schema.sql`) that is a verbatim `pg_dump` of an early schema (Hibernate-generated `fkXXXXXXXX` constraint names, no `gen_random_uuid()` defaults), - two **trigger-driven features** (FTS in V34/V35, no other DDL triggers), - one **append-only audit table** (V46) plus one corrective re-REVOKE (V47), - two **explicit data backfills** (V35 FTS, V50 comments→audit, V51 annotation_id, V59 grant seed). ``` backend/src/main/resources/db/migration/ ├── V1__initial_schema.sql — pg_dump baseline ├── V2…V47 — incremental DDL + small data fixes ├── V48…V59 — newer features (audit, thumbnails, family tree, mentions, geschichten) scripts/ ├── schema.sql — STALE pg_dump (≈V1+V2 only) └── large-data.sql — perf seed (1321 lines of INSERT) ``` **Counts.** 57 migration files, V1–V59 with **two gaps** (V37, V43). V37 was created and later removed (`needsExpert` rolled back; see §7). V43 was never assigned (silent gap). --- ## §2 Inventory — every table mapped to a canonical domain | Table | Created in | Domain | Notes | |---|---|---|---| | `documents` | V1 | document | Core entity. Pluralized. | | `document_receivers` | V1 | document (junction) | `documents ⇄ persons` (M:N) | | `document_tags` | V1 | document (junction) | `documents ⇄ tag` (M:N) | | `document_versions` | V9 | document | Versioned snapshot (JSONB) | | `document_annotations` | V10 | document.annotation | Sub-package | | `document_comments` | V12 | document.comment | Sub-package; threads via `parent_id` | | `comment_mentions` | V17 | document.comment (junction) | `document_comments ⇄ users` | | `transcription_blocks` | V18 | document.transcription | Sub-package | | `transcription_block_versions` | V19 | document.transcription | History rows | | `transcription_block_mentioned_persons` | V56 | document.transcription (junction) | **No FK on `person_id`** by design (V56 header) | | `document_training_labels` | V29 | document (or ocr?) | **Ambiguous — see §5** | | `document_thumbnails` (cols) | V52/V53 | document | Inline cols on `documents` | | `persons` | V1 | person | | | `person_name_aliases` | V21 | person | Historical names (married/widowed) | | `person_relationships` | V54 | person.relationship | Family tree edges | | `sender_models` | V40 | person + ocr | Per-person Kraken HTR model — **dual-domain, see §5** | | `tag` | V1 | tag | **Singular — outlier**; self-FK `parent_id` from V39 | | `users` | V1 | user | Maps to entity `AppUser` — **name mismatch, see §5** | | `user_groups` | V1 | user | Roles | | `users_groups` | V1 | user (junction) | `users ⇄ user_groups` | | `group_permissions` | V1 | user (junction) | `user_groups ⇄ permission_string` (no FK to a permissions table; `Set<String>`) | | `password_reset_tokens` | V8 | user | | | `invite_tokens` | V45 | user | | | `invite_token_group_ids` | V45 | user (junction) | Awkward name — see §5 | | `geschichten` | V58 | geschichte | Blog-like stories | | `geschichten_persons` | V58 | geschichte (junction) | M:N to `persons` | | `geschichten_documents` | V58 | geschichte (junction) | M:N to `documents` | | `notifications` | V16 | notification | | | `ocr_jobs` | V25 | ocr | | | `ocr_job_documents` | V25 | ocr (junction) | `ocr_jobs ⇄ documents` | | `ocr_training_runs` | V30 | ocr | | | `audit_log` | V46 | shared/audit | Append-only (see §3 / C7.5) | **Total: 27 tables** (1 outlier-named, 1 ambiguous, 1 dual-domain — see §5). --- ## §3 Rubric scorecard (focus C3.3, C7.5, C9.3, C10.4) | Check | Result | Severity | Evidence | Recommendation | |---|---|---|---|---| | **C3.3 — table naming/glossary** | **FAIL** | **Critical** | `users` table is mapped to entity `AppUser` (`backend/src/main/java/.../model/AppUser.java:27`); table name says "users" but the domain glossary explicitly insists `Person ≠ AppUser`. A Tobias-style reader who opens `users` first will conflate it with `persons`. | Rename `users` → `app_users` in a dedicated migration (also rename `users_groups` → `app_users_groups`, `comment_mentions.user_id` → `app_user_id`). OR add a `COMMENT ON TABLE users IS 'AppUser accounts — NOT historical persons. See persons.'` | | C3.3 — singular vs plural | FAIL | Major | `public.tag` is singular while every other table is plural (`documents`, `persons`, `tags`-as-junction-prefix in `document_tags`). `Tag.java` has no `@Table` annotation so JPA naming defaults leaked through. | Migration `Vn__rename_tag_to_tags.sql`: `ALTER TABLE tag RENAME TO tags;` then add `@Table(name="tags")` to `Tag.java`. Update `document_tags.tag_id` FK target — implicit, but verify. | | C3.3 — junction naming | FAIL | Minor | Junctions are inconsistent: most use `<plural>_<plural>` (`document_tags`, `document_receivers`, `geschichten_persons`, `geschichten_documents`, `users_groups`), but `group_permissions` uses `<singular>_<plural>` (no second table — it's `@ElementCollection`), `comment_mentions` is `<singular>_<plural>` (also implicit), `invite_token_group_ids` ends in `_ids` which no other junction does (`backend/src/main/resources/db/migration/V45__add_invite_tokens.sql:18`). `transcription_block_mentioned_persons` is descriptive but breaks the pattern. | Rename `invite_token_group_ids` → `invite_token_groups`. Document the `@ElementCollection` pattern (`group_permissions`, `comment_mentions`) explicitly so readers don't expect a sister entity. | | **C7.5 — per-table comment OR overview** | **FAIL** | Minor | Zero `COMMENT ON TABLE` / `COMMENT ON COLUMN` statements anywhere in the migration set. No `docs/SCHEMA.md`. The migration *headers* are the only documentation, and several creation migrations (V1, V8, V9, V10, V12, V17, V25, V29, V45) ship with no comment at all. Tobias has to read application code to learn what `audit_log.kind` values exist, what `documents.script_type` values are valid, what `transcription_block_mentioned_persons.display_name` is for. | Add `docs/SCHEMA.md` with one paragraph per domain group + an ER sketch. Cheaper than per-table SQL comments, mutates with the codebase. (See §8.) | | C7.5 — audit log retention documented | PARTIAL | Major | V46 *attempts* DB-level append-only via `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` (V46:25). Header explicitly notes the REVOKE is a **no-op** when the migration runs as the table owner — meaning the application user retains all privileges in dev/local because Flyway runs as the same role. V47 then runs the same REVOKE again "for the actual application role" but still uses `CURRENT_USER` (V47:8) — same no-op problem. **The append-only guarantee is not enforced at DB level today.** The migration is honest about it ("enforced at the application layer only") but the dual-REVOKE looks like it should work and doesn't. | Either (a) split DB roles so Flyway runs as `archive_owner` and the app runs as `archive_app`, then `REVOKE … FROM archive_app` will bite, or (b) drop both REVOKE statements with a comment "kept at app layer only" so future readers don't think it's enforced. Document the choice in `docs/SCHEMA.md` § audit. | | **C9.3 — migration naming + non-trivial explanation** | PASS (with caveats) | Minor | Naming convention `V<n>__<verb>_<noun>` is consistent (verbs: `add`, `drop`, `backfill`, `make`, `seed`). 25 of 57 migrations carry a header comment explaining intent — strong on the harder ones (V31 nullable transcript, V34 FTS triggers, V46 audit, V51 annotation_id backfill, V54 family tree, V56 mentioned persons, V59 BLOG_WRITE seed). 18 migrations are one-liner `ALTER TABLE … ADD COLUMN` and self-explanatory. **Caveat:** V13, V20, V32, V36, V41, V44, V47 add cross-cutting columns or fix prior migrations without naming the issue/PR — Anja can't trace why. | Adopt a 1-line convention: every migration starts with `-- #<issue>: <one-line why>`. Backfill is optional. | | C9.3 — sequence integrity | FAIL | Minor | V37 and V43 are missing. V37 was rolled back (commit `ca0cf490`, see §7). V43 was never created (no git history). Flyway tolerates gaps but a reader scanning `ls` sees holes and asks "did I lose a file?" | Add a placeholder `V43__skipped.sql` containing only a comment `-- intentionally skipped during issue #XXX rebase; see commit YYYY` and same for `V37__skipped.sql`, OR rebase the sequence (riskier on prod). | | **C10.4 — half-finished / abandoned migrations** | FAIL | Major | (a) `scripts/schema.sql` is a **stale snapshot** — only contains tables created up to ~V2 (8 tables). It's missing 19 of the 27 current tables (no `transcription_blocks`, `audit_log`, `geschichten`, `notifications`, `ocr_*`, `person_*`, etc.). `scripts/CLAUDE.md` says it's "for quick reference only" but Tobias will be misled hard. (b) V37 hole + git note "V37 was already applied; rename V37→V38" implies a production rebase that the file tree no longer reflects. (c) `scripts/large-data.sql` has no header explaining what it is or how to safely reset before reloading. | (a) Delete `scripts/schema.sql` or regenerate it on every release via `pg_dump` (CI step). (b) Add a `MIGRATIONS.md` next to the migration directory listing intentional gaps. (c) Add a 5-line header to `large-data.sql`: purpose, prereq, idempotency. | | C3.3 — column-name mismatch with entity field names | FAIL | Major | `documents.meta_date` ↔ `Document.documentDate` (field) — naming diverges between layer (V1 baseline). `comment_mentions.user_id` references `users(id)` but the entity is `AppUser`. `audit_log.actor_id` references `users(id)` (V46:9) but is read as "the AppUser who acted". Tobias opens schema, then opens entity, has to translate. | Either rename DB columns to match entity fields (breaking, requires migration) OR add `@Column(name="...")` everywhere and mention the convention in `docs/SCHEMA.md`. | --- ## §4 Subsystem health summary | Category | PASS | FAIL-Critical | FAIL-Major | FAIL-Minor | |---|---|---|---|---| | C3 (Domain Clarity) | 0 | 1 (C3.3 users-vs-AppUser) | 2 (junction naming asymmetry, column/entity mismatch) | 1 (singular `tag`) | | C7 (Documentation) | 0 | 0 | 1 (audit retention claim vs reality) | 1 (zero COMMENTs / no SCHEMA.md) | | C9 (Operability) | 1 (C9.3 naming) | 0 | 0 | 1 (sequence gaps V37/V43) | | C10 (Code Sanity) | 0 | 0 | 1 (stale `scripts/schema.sql` + V37 ghost) | 0 | **Overall health: 🔴 RED.** One Critical (C3.3 `users` vs `AppUser`) and three Majors triggers the threshold (🟢 requires 0 Critical AND ≥80% pass — currently 1 of 8 applicable checks passes cleanly). --- ## §5 Domain-mapping gaps (ambiguous tables) | Table | Plausible domains | Blocker question | |---|---|---| | `document_training_labels` (V29) | `document` (data lives on a Document) **or** `ocr` (only consumer is the OCR training pipeline) | Should this move to `ocr/` package alongside `OcrTrainingRun`? Today the Document entity carries `Set<String> trainingLabels` so domain ownership splits across `document/` and `ocr/`. | | `sender_models` (V40) | `person` (1:1 with Person) **or** `ocr` (it's a per-sender HTR model) | Same as above. The entity is `SenderModel` in `model/`, not in any `ocr/` subpackage — but it only exists because of OCR. | | `users` (table) ↔ `AppUser` (entity) | user | Hard mismatch with the canonical glossary that says "user" domain owns AppUser. The table name says "users" — readers expecting the canonical domain set look for `app_users`. | | `audit_log` ↔ `AuditLog` (entity) | shared/audit | Lives under `audit/` package but is a single bare table — the package name implies a domain, the table name implies infrastructure. Acceptable, but worth one sentence in `docs/SCHEMA.md`. | | `invite_token_group_ids` | user (junction) | Name suggests it stores raw IDs as scalars (it does — `@ElementCollection<UUID>` style), but the column is actually a UUID FK. Confusing for first-time reader. | --- ## §6 Cross-cutting candidates **N/A.** Tables belong to exactly one domain (or one sub-package within a domain). The single shared table — `audit_log` — already lives under the canonical `shared/audit` umbrella. --- ## §7 Dead / suspicious-code register (C10 findings) | File:line | Smell | Suggested action | |---|---|---| | `scripts/schema.sql:29-156` | **Stale snapshot** — covers only V1+V2-era schema (8 of 27 tables) yet `scripts/CLAUDE.md` calls it "complete database schema dump for reference". | Regenerate via `pg_dump` in a CI hook on every release tag, or delete and replace with a one-line note pointing at Flyway. | | `backend/src/main/resources/db/migration/V37__*.sql` (deleted) | **Rolled-back migration** — V37 once existed (`needsExpert` flag), was reverted in commit `ca0cf490`, leaving a hole in the sequence. Successor-X sees `V36 → V38` and assumes a missing file. | Add `V37__skipped_needs_expert_rolled_back.sql` containing only a `SELECT 1;` and a comment explaining the rollback. Same for V43 (silent gap with no git trace). | | `backend/src/main/resources/db/migration/V46__add_audit_log.sql:25` and `V47__add_user_color.sql:8` | **No-op security control** — both `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` calls are documented (V46:22-24) as no-ops because the role *is* the table owner. Two migrations that look protective but aren't. | Either implement the role split (preferred) or delete the REVOKE lines and replace with a single `COMMENT ON TABLE audit_log IS 'Append-only at application layer only'`. | | `backend/src/main/resources/db/migration/V1__initial_schema.sql:202-263` | **Hibernate-generated FK names** (`fkc99c5qjulwx9gru07yrhicgd2`, `fkl5xhww7es3b4um01vmly4y18m`, etc.). V14 has to reference these unreadable names by hand. | One-time `ALTER TABLE … RENAME CONSTRAINT` migration to give them speaking names (`fk_documents_sender`, `fk_document_tags_document`, …). | | `scripts/large-data.sql` | 1321 lines of `INSERT` with no header. Idempotency relies on `ON CONFLICT DO NOTHING` but **no FK preconditions** — running it on an empty DB then on a populated DB will silently produce different results. | Add header: purpose, prerequisites (e.g., reset DB first), expected runtime, dataset size. | | `backend/src/main/resources/db/migration/V35__backfill_fts_search_vector.sql:6` | `UPDATE documents SET title = title;` — looks like a typo. The header explains it's intentional (re-fires the BEFORE UPDATE trigger), but a stranger reading without context will rewrite it as a bug. | Header is good; consider `UPDATE documents SET title = title /* trigger re-fire — see V34 */;` for inline clarity. | --- ## §8 Documentation gaps The DB layer needs **one** of these (recommend the second): **Option A — per-table SQL comments.** Add `COMMENT ON TABLE`/`COMMENT ON COLUMN` to every table and to every column whose meaning is non-obvious (`documents.script_type`, `documents.metadata_complete`, `transcription_blocks.source`, `audit_log.kind`, `audit_log.payload`, `notifications.type`, etc.). Survives `pg_dump`. Cost: ~40 statements in a new migration. **Option B — `docs/SCHEMA.md` overview (recommended).** Domain-grouped page with: - one paragraph per table (purpose, key columns, lifecycle if any), - ER sketch per domain (document / person / user / ocr / audit), - explicit list of intentionally-skipped migration numbers (V37, V43), - glossary entry that resolves `users` ↔ `AppUser` and `tag` (singular) ↔ entity `Tag`, - audit-log retention statement (app-layer only, REVOKE is no-op). Option B is cheaper, evolves alongside `CLAUDE.md`, and addresses C3.3 + C7.5 + half of C10.4 with one document. Per-domain README files (C7.1) would also pull weight here — one paragraph in each backend package README about which tables it owns. --- ## §9 Prerequisites for the big-bang refactor REFACTOR-1 reorganizes packages, **not tables**. Therefore the schema layer needs **no changes** for REFACTOR-1 to land safely, *provided*: 1. The package move keeps each entity's `@Table(name=...)` value (and adds one to `Tag.java` to lock in the current `tag` table name even after the entity moves to `tag/Tag.java`). 2. `TranscriptionBlock`, `DocumentAnnotation`, `DocumentComment` stay in `document.transcription`, `document.annotation`, `document.comment` sub-packages (canonical D-OQ-2/3/4) — no table renames triggered. 3. `SenderModel` and `OcrTrainingRun` end up under the `ocr/` package (D-OQ-5) — table names stay. **Independent prerequisite (not blocking REFACTOR-1):** the `users` ↔ `AppUser` rename should happen in a *separate* migration before or after REFACTOR-1, never bundled — it touches FKs in 6+ tables (`comment_mentions`, `audit_log`, `notifications`, `users_groups`, `password_reset_tokens`, `document_versions.editor_id`, `transcription_blocks.created_by/updated_by`, etc.). --- ## §10 Top 5 prioritized recommendations 1. **Create `docs/SCHEMA.md` (domain-grouped overview).** Single highest-leverage win for C3.3 + C7.5 + C10.4. Resolves the `users`/`AppUser` and `tag`/`tags` gotchas with a glossary section; documents the 27 tables + 6 junctions + intentional V37/V43 gaps in one read. 2. **Fix or delete `scripts/schema.sql`.** Either regenerate it on every release tag via `pg_dump` (CI hook) or delete it and replace with a one-line README pointing readers at `db/migration/`. Today it actively misleads — 19 of 27 tables are missing. 3. **Resolve the audit-log append-only contradiction.** Either split DB roles so the V46/V47 `REVOKE` statements bite, or remove them and add a single inline `COMMENT ON TABLE audit_log IS 'append-only enforced at app layer only — see AuditLogRepository'`. Today the migration claims protection it doesn't deliver. 4. **Patch the migration sequence holes.** Add stub `V37__skipped_needs_expert_rolled_back.sql` and `V43__skipped.sql` containing nothing but `SELECT 1;` and a 1-line `--` comment explaining the rollback / accidental skip. Cheap; eliminates "did I lose a file?" friction. 5. **Rename `tag` → `tags`** in a dedicated migration and add `@Table(name = "tags")` to `Tag.java`. Smallest surface-area renames in the schema (one table, one FK in `document_tags`, one inline reference in V34 trigger function). Removes the only singular-vs-plural outlier and brings the table under the canonical naming convention. **Top blocker:** `users` table maps to entity `AppUser` and shares lexical space with `persons` — Tobias will conflate them within 30 seconds of opening the schema.

marcel referenced this issue

2026-05-04 16:41:37 +02:00

audit(rollup): produce global readiness scorecard from subsystem audits #393

marcel referenced this issue

2026-05-04 16:41:59 +02:00

decide(backend): resolve users-table / AppUser-entity naming mismatch before REFACTOR-1 #418

marcel referenced this issue

2026-05-04 21:07:13 +02:00

epic(legibility): polish — remove smells surfaced by audits #411

marcel referenced this issue

2026-05-04 21:46:18 +02:00

fix(backend): rename users table to app_users (closes #418) #419