audit(db): score Flyway migrations + DB schema against legibility rubric #391

Closed
opened 2026-05-04 16:04:29 +02:00 by marcel · 1 comment
Owner

Context

Part of Epic #387 — Codebase Legibility Audit. This is AUDIT-4: read-only assessment of the database schema and migration history. Tobias (PM-with-CS) will look at this to understand the data model; Anja (Backend Dev) will look to understand evolution and conventions.

The output is a Markdown report under docs/audits/audit-db-report.md following the per-subsystem orient template.

Inputs (read first)

See the first comment of #387 for the complete reference bundle.

Scope (in)

  • Every Flyway migration under backend/src/main/resources/db/migration/ (V*__*.sql)
  • The composite/effective schema as a logical whole — what tables exist, what columns, what indexes, what FKs
  • Migration naming convention compliance
  • scripts/schema.sql (the snapshot for reference)
  • scripts/large-data.sql (note its purpose; flag if confusing)

Scope (out)

  • JPA entity Java code (AUDIT-2 covers that — but flag mismatches between entity field names and column names in §5)
  • Application data, test fixtures
  • Database performance tuning (out of scope for legibility)

Rubric checks particularly relevant to this subsystem

C3.3 (table naming consistency, glossary alignment — users vs app_users vs persons is exactly the kind of clarity test Tobias will run), C7.5 (per-table comments OR a domain-grouped schema overview), C9.3 (migration naming and non-trivial migration explanations), C10.4 (any rolled-back / commented-out / abandoned migration patterns).

Specific things to verify (subsystem-specific)

  • Domain grouping clarity. Group every table under one of the canonical domains (document, person, tag, user, geschichte, notification, ocr, audit). Flag any table whose domain isn't obvious from its name.
  • Naming consistency. Singular vs plural? snake_case consistent? Junction tables follow a pattern?
  • Migration narrative. Read migrations in order — does the schema's evolution tell a coherent story, or does it look like an accumulation of fixes?
  • Junction tables. Document_tags, document_receivers, geschichten_persons, geschichten_documents, group_permissions, transcription_block_mentioned_persons — verify these are documented somewhere or self-evidently named.
  • Audit log retention. audit_log table — is the append-only constraint enforced at DB level (REVOKE UPDATE/DELETE) and documented?

Acceptance criteria

  • File docs/audits/audit-db-report.md exists on a feature branch
  • §2 contains a complete table inventory with apparent domain assignment per the canonical domain set
  • §3 scorecard covers C3.3, C7.5, C9.3, C10.4 at minimum
  • §5 lists any table whose domain assignment is ambiguous
  • §8 recommends either per-table comments OR a domain-grouped docs/SCHEMA.md overview document
  • §10 lists top-5 recommendations
  • PR opened and merged before this issue is closed

Definition of Done

Report committed under docs/audits/audit-db-report.md on main. Top-5 recommendations summarized as a closing comment.

Dispatch

Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit backend/src/main/resources/db/migration/ and the effective DB schema per the orient template, producing docs/audits/audit-db-report.md. Read-only.

## Context Part of **Epic #387** — Codebase Legibility Audit. This is **AUDIT-4**: read-only assessment of the database schema and migration history. Tobias (PM-with-CS) will look at this to understand the data model; Anja (Backend Dev) will look to understand evolution and conventions. The output is a Markdown report under `docs/audits/audit-db-report.md` following the per-subsystem orient template. ## Inputs (read first) See **the first comment of #387** for the complete reference bundle. ## Scope (in) - Every Flyway migration under `backend/src/main/resources/db/migration/` (`V*__*.sql`) - The composite/effective schema as a logical whole — what tables exist, what columns, what indexes, what FKs - Migration naming convention compliance - `scripts/schema.sql` (the snapshot for reference) - `scripts/large-data.sql` (note its purpose; flag if confusing) ## Scope (out) - JPA entity Java code (AUDIT-2 covers that — but flag mismatches between entity field names and column names in §5) - Application data, test fixtures - Database performance tuning (out of scope for legibility) ## Rubric checks particularly relevant to this subsystem C3.3 (table naming consistency, glossary alignment — `users` vs `app_users` vs `persons` is exactly the kind of clarity test Tobias will run), C7.5 (per-table comments OR a domain-grouped schema overview), C9.3 (migration naming and non-trivial migration explanations), C10.4 (any rolled-back / commented-out / abandoned migration patterns). ## Specific things to verify (subsystem-specific) - **Domain grouping clarity.** Group every table under one of the canonical domains (document, person, tag, user, geschichte, notification, ocr, audit). Flag any table whose domain isn't obvious from its name. - **Naming consistency.** Singular vs plural? snake_case consistent? Junction tables follow a pattern? - **Migration narrative.** Read migrations in order — does the schema's evolution tell a coherent story, or does it look like an accumulation of fixes? - **Junction tables.** Document_tags, document_receivers, geschichten_persons, geschichten_documents, group_permissions, transcription_block_mentioned_persons — verify these are documented somewhere or self-evidently named. - **Audit log retention.** `audit_log` table — is the append-only constraint enforced at DB level (REVOKE UPDATE/DELETE) and documented? ## Acceptance criteria - [ ] File `docs/audits/audit-db-report.md` exists on a feature branch - [ ] §2 contains a complete table inventory with apparent domain assignment per the canonical domain set - [ ] §3 scorecard covers C3.3, C7.5, C9.3, C10.4 at minimum - [ ] §5 lists any table whose domain assignment is ambiguous - [ ] §8 recommends either per-table comments OR a domain-grouped `docs/SCHEMA.md` overview document - [ ] §10 lists top-5 recommendations - [ ] PR opened and merged before this issue is closed ## Definition of Done Report committed under `docs/audits/audit-db-report.md` on `main`. Top-5 recommendations summarized as a closing comment. ## Dispatch > Read issue #387 first comment for the reference bundle. Then read this issue (#TBD). Then audit `backend/src/main/resources/db/migration/` and the effective DB schema per the orient template, producing `docs/audits/audit-db-report.md`. Read-only.
marcel added this to the Codebase Legibility milestone 2026-05-04 16:04:29 +02:00
marcel added the P2-mediumauditlegibility labels 2026-05-04 16:05:43 +02:00
Author
Owner

AUDIT-4 — Database schema & Flyway migrations

Scope: backend/src/main/resources/db/migration/V*.sql (V1–V59, 57 files), scripts/schema.sql snapshot, scripts/large-data.sql.

Reference bundle: issue #387 first comment (Legibility Rubric + Per-Subsystem Orient Template).


§1 Subsystem profile

The database layer is PostgreSQL 16 managed by Flyway with versioned Vn__description.sql migrations. The schema is owned by the Spring Boot backend; JPA/Hibernate maps entities onto these tables. Migrations are append-only by convention, but the layer also includes:

  • one DB-level baseline (V1__initial_schema.sql) that is a verbatim pg_dump of an early schema (Hibernate-generated fkXXXXXXXX constraint names, no gen_random_uuid() defaults),
  • two trigger-driven features (FTS in V34/V35, no other DDL triggers),
  • one append-only audit table (V46) plus one corrective re-REVOKE (V47),
  • two explicit data backfills (V35 FTS, V50 comments→audit, V51 annotation_id, V59 grant seed).
backend/src/main/resources/db/migration/
├── V1__initial_schema.sql           — pg_dump baseline
├── V2…V47                            — incremental DDL + small data fixes
├── V48…V59                           — newer features (audit, thumbnails, family tree, mentions, geschichten)
scripts/
├── schema.sql                        — STALE pg_dump (≈V1+V2 only)
└── large-data.sql                    — perf seed (1321 lines of INSERT)

Counts. 57 migration files, V1–V59 with two gaps (V37, V43). V37 was created and later removed (needsExpert rolled back; see §7). V43 was never assigned (silent gap).


§2 Inventory — every table mapped to a canonical domain

Table Created in Domain Notes
documents V1 document Core entity. Pluralized.
document_receivers V1 document (junction) documents ⇄ persons (M:N)
document_tags V1 document (junction) documents ⇄ tag (M:N)
document_versions V9 document Versioned snapshot (JSONB)
document_annotations V10 document.annotation Sub-package
document_comments V12 document.comment Sub-package; threads via parent_id
comment_mentions V17 document.comment (junction) document_comments ⇄ users
transcription_blocks V18 document.transcription Sub-package
transcription_block_versions V19 document.transcription History rows
transcription_block_mentioned_persons V56 document.transcription (junction) No FK on person_id by design (V56 header)
document_training_labels V29 document (or ocr?) Ambiguous — see §5
document_thumbnails (cols) V52/V53 document Inline cols on documents
persons V1 person
person_name_aliases V21 person Historical names (married/widowed)
person_relationships V54 person.relationship Family tree edges
sender_models V40 person + ocr Per-person Kraken HTR model — dual-domain, see §5
tag V1 tag Singular — outlier; self-FK parent_id from V39
users V1 user Maps to entity AppUsername mismatch, see §5
user_groups V1 user Roles
users_groups V1 user (junction) users ⇄ user_groups
group_permissions V1 user (junction) user_groups ⇄ permission_string (no FK to a permissions table; Set<String>)
password_reset_tokens V8 user
invite_tokens V45 user
invite_token_group_ids V45 user (junction) Awkward name — see §5
geschichten V58 geschichte Blog-like stories
geschichten_persons V58 geschichte (junction) M:N to persons
geschichten_documents V58 geschichte (junction) M:N to documents
notifications V16 notification
ocr_jobs V25 ocr
ocr_job_documents V25 ocr (junction) ocr_jobs ⇄ documents
ocr_training_runs V30 ocr
audit_log V46 shared/audit Append-only (see §3 / C7.5)

Total: 27 tables (1 outlier-named, 1 ambiguous, 1 dual-domain — see §5).


§3 Rubric scorecard (focus C3.3, C7.5, C9.3, C10.4)

Check Result Severity Evidence Recommendation
C3.3 — table naming/glossary FAIL Critical users table is mapped to entity AppUser (backend/src/main/java/.../model/AppUser.java:27); table name says "users" but the domain glossary explicitly insists Person ≠ AppUser. A Tobias-style reader who opens users first will conflate it with persons. Rename usersapp_users in a dedicated migration (also rename users_groupsapp_users_groups, comment_mentions.user_idapp_user_id). OR add a COMMENT ON TABLE users IS 'AppUser accounts — NOT historical persons. See persons.'
C3.3 — singular vs plural FAIL Major public.tag is singular while every other table is plural (documents, persons, tags-as-junction-prefix in document_tags). Tag.java has no @Table annotation so JPA naming defaults leaked through. Migration Vn__rename_tag_to_tags.sql: ALTER TABLE tag RENAME TO tags; then add @Table(name="tags") to Tag.java. Update document_tags.tag_id FK target — implicit, but verify.
C3.3 — junction naming FAIL Minor Junctions are inconsistent: most use <plural>_<plural> (document_tags, document_receivers, geschichten_persons, geschichten_documents, users_groups), but group_permissions uses <singular>_<plural> (no second table — it's @ElementCollection), comment_mentions is <singular>_<plural> (also implicit), invite_token_group_ids ends in _ids which no other junction does (backend/src/main/resources/db/migration/V45__add_invite_tokens.sql:18). transcription_block_mentioned_persons is descriptive but breaks the pattern. Rename invite_token_group_idsinvite_token_groups. Document the @ElementCollection pattern (group_permissions, comment_mentions) explicitly so readers don't expect a sister entity.
C7.5 — per-table comment OR overview FAIL Minor Zero COMMENT ON TABLE / COMMENT ON COLUMN statements anywhere in the migration set. No docs/SCHEMA.md. The migration headers are the only documentation, and several creation migrations (V1, V8, V9, V10, V12, V17, V25, V29, V45) ship with no comment at all. Tobias has to read application code to learn what audit_log.kind values exist, what documents.script_type values are valid, what transcription_block_mentioned_persons.display_name is for. Add docs/SCHEMA.md with one paragraph per domain group + an ER sketch. Cheaper than per-table SQL comments, mutates with the codebase. (See §8.)
C7.5 — audit log retention documented PARTIAL Major V46 attempts DB-level append-only via REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER (V46:25). Header explicitly notes the REVOKE is a no-op when the migration runs as the table owner — meaning the application user retains all privileges in dev/local because Flyway runs as the same role. V47 then runs the same REVOKE again "for the actual application role" but still uses CURRENT_USER (V47:8) — same no-op problem. The append-only guarantee is not enforced at DB level today. The migration is honest about it ("enforced at the application layer only") but the dual-REVOKE looks like it should work and doesn't. Either (a) split DB roles so Flyway runs as archive_owner and the app runs as archive_app, then REVOKE … FROM archive_app will bite, or (b) drop both REVOKE statements with a comment "kept at app layer only" so future readers don't think it's enforced. Document the choice in docs/SCHEMA.md § audit.
C9.3 — migration naming + non-trivial explanation PASS (with caveats) Minor Naming convention V<n>__<verb>_<noun> is consistent (verbs: add, drop, backfill, make, seed). 25 of 57 migrations carry a header comment explaining intent — strong on the harder ones (V31 nullable transcript, V34 FTS triggers, V46 audit, V51 annotation_id backfill, V54 family tree, V56 mentioned persons, V59 BLOG_WRITE seed). 18 migrations are one-liner ALTER TABLE … ADD COLUMN and self-explanatory. Caveat: V13, V20, V32, V36, V41, V44, V47 add cross-cutting columns or fix prior migrations without naming the issue/PR — Anja can't trace why. Adopt a 1-line convention: every migration starts with -- #<issue>: <one-line why>. Backfill is optional.
C9.3 — sequence integrity FAIL Minor V37 and V43 are missing. V37 was rolled back (commit ca0cf490, see §7). V43 was never created (no git history). Flyway tolerates gaps but a reader scanning ls sees holes and asks "did I lose a file?" Add a placeholder V43__skipped.sql containing only a comment -- intentionally skipped during issue #XXX rebase; see commit YYYY and same for V37__skipped.sql, OR rebase the sequence (riskier on prod).
C10.4 — half-finished / abandoned migrations FAIL Major (a) scripts/schema.sql is a stale snapshot — only contains tables created up to ~V2 (8 tables). It's missing 19 of the 27 current tables (no transcription_blocks, audit_log, geschichten, notifications, ocr_*, person_*, etc.). scripts/CLAUDE.md says it's "for quick reference only" but Tobias will be misled hard. (b) V37 hole + git note "V37 was already applied; rename V37→V38" implies a production rebase that the file tree no longer reflects. (c) scripts/large-data.sql has no header explaining what it is or how to safely reset before reloading. (a) Delete scripts/schema.sql or regenerate it on every release via pg_dump (CI step). (b) Add a MIGRATIONS.md next to the migration directory listing intentional gaps. (c) Add a 5-line header to large-data.sql: purpose, prereq, idempotency.
C3.3 — column-name mismatch with entity field names FAIL Major documents.meta_dateDocument.documentDate (field) — naming diverges between layer (V1 baseline). comment_mentions.user_id references users(id) but the entity is AppUser. audit_log.actor_id references users(id) (V46:9) but is read as "the AppUser who acted". Tobias opens schema, then opens entity, has to translate. Either rename DB columns to match entity fields (breaking, requires migration) OR add @Column(name="...") everywhere and mention the convention in docs/SCHEMA.md.

§4 Subsystem health summary

Category PASS FAIL-Critical FAIL-Major FAIL-Minor
C3 (Domain Clarity) 0 1 (C3.3 users-vs-AppUser) 2 (junction naming asymmetry, column/entity mismatch) 1 (singular tag)
C7 (Documentation) 0 0 1 (audit retention claim vs reality) 1 (zero COMMENTs / no SCHEMA.md)
C9 (Operability) 1 (C9.3 naming) 0 0 1 (sequence gaps V37/V43)
C10 (Code Sanity) 0 0 1 (stale scripts/schema.sql + V37 ghost) 0

Overall health: 🔴 RED. One Critical (C3.3 users vs AppUser) and three Majors triggers the threshold (🟢 requires 0 Critical AND ≥80% pass — currently 1 of 8 applicable checks passes cleanly).


§5 Domain-mapping gaps (ambiguous tables)

Table Plausible domains Blocker question
document_training_labels (V29) document (data lives on a Document) or ocr (only consumer is the OCR training pipeline) Should this move to ocr/ package alongside OcrTrainingRun? Today the Document entity carries Set<String> trainingLabels so domain ownership splits across document/ and ocr/.
sender_models (V40) person (1:1 with Person) or ocr (it's a per-sender HTR model) Same as above. The entity is SenderModel in model/, not in any ocr/ subpackage — but it only exists because of OCR.
users (table) ↔ AppUser (entity) user Hard mismatch with the canonical glossary that says "user" domain owns AppUser. The table name says "users" — readers expecting the canonical domain set look for app_users.
audit_logAuditLog (entity) shared/audit Lives under audit/ package but is a single bare table — the package name implies a domain, the table name implies infrastructure. Acceptable, but worth one sentence in docs/SCHEMA.md.
invite_token_group_ids user (junction) Name suggests it stores raw IDs as scalars (it does — @ElementCollection<UUID> style), but the column is actually a UUID FK. Confusing for first-time reader.

§6 Cross-cutting candidates

N/A. Tables belong to exactly one domain (or one sub-package within a domain). The single shared table — audit_log — already lives under the canonical shared/audit umbrella.


§7 Dead / suspicious-code register (C10 findings)

File:line Smell Suggested action
scripts/schema.sql:29-156 Stale snapshot — covers only V1+V2-era schema (8 of 27 tables) yet scripts/CLAUDE.md calls it "complete database schema dump for reference". Regenerate via pg_dump in a CI hook on every release tag, or delete and replace with a one-line note pointing at Flyway.
backend/src/main/resources/db/migration/V37__*.sql (deleted) Rolled-back migration — V37 once existed (needsExpert flag), was reverted in commit ca0cf490, leaving a hole in the sequence. Successor-X sees V36 → V38 and assumes a missing file. Add V37__skipped_needs_expert_rolled_back.sql containing only a SELECT 1; and a comment explaining the rollback. Same for V43 (silent gap with no git trace).
backend/src/main/resources/db/migration/V46__add_audit_log.sql:25 and V47__add_user_color.sql:8 No-op security control — both REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER calls are documented (V46:22-24) as no-ops because the role is the table owner. Two migrations that look protective but aren't. Either implement the role split (preferred) or delete the REVOKE lines and replace with a single COMMENT ON TABLE audit_log IS 'Append-only at application layer only'.
backend/src/main/resources/db/migration/V1__initial_schema.sql:202-263 Hibernate-generated FK names (fkc99c5qjulwx9gru07yrhicgd2, fkl5xhww7es3b4um01vmly4y18m, etc.). V14 has to reference these unreadable names by hand. One-time ALTER TABLE … RENAME CONSTRAINT migration to give them speaking names (fk_documents_sender, fk_document_tags_document, …).
scripts/large-data.sql 1321 lines of INSERT with no header. Idempotency relies on ON CONFLICT DO NOTHING but no FK preconditions — running it on an empty DB then on a populated DB will silently produce different results. Add header: purpose, prerequisites (e.g., reset DB first), expected runtime, dataset size.
backend/src/main/resources/db/migration/V35__backfill_fts_search_vector.sql:6 UPDATE documents SET title = title; — looks like a typo. The header explains it's intentional (re-fires the BEFORE UPDATE trigger), but a stranger reading without context will rewrite it as a bug. Header is good; consider UPDATE documents SET title = title /* trigger re-fire — see V34 */; for inline clarity.

§8 Documentation gaps

The DB layer needs one of these (recommend the second):

Option A — per-table SQL comments. Add COMMENT ON TABLE/COMMENT ON COLUMN to every table and to every column whose meaning is non-obvious (documents.script_type, documents.metadata_complete, transcription_blocks.source, audit_log.kind, audit_log.payload, notifications.type, etc.). Survives pg_dump. Cost: ~40 statements in a new migration.

Option B — docs/SCHEMA.md overview (recommended). Domain-grouped page with:

  • one paragraph per table (purpose, key columns, lifecycle if any),
  • ER sketch per domain (document / person / user / ocr / audit),
  • explicit list of intentionally-skipped migration numbers (V37, V43),
  • glossary entry that resolves usersAppUser and tag (singular) ↔ entity Tag,
  • audit-log retention statement (app-layer only, REVOKE is no-op).

Option B is cheaper, evolves alongside CLAUDE.md, and addresses C3.3 + C7.5 + half of C10.4 with one document.

Per-domain README files (C7.1) would also pull weight here — one paragraph in each backend package README about which tables it owns.


§9 Prerequisites for the big-bang refactor

REFACTOR-1 reorganizes packages, not tables. Therefore the schema layer needs no changes for REFACTOR-1 to land safely, provided:

  1. The package move keeps each entity's @Table(name=...) value (and adds one to Tag.java to lock in the current tag table name even after the entity moves to tag/Tag.java).
  2. TranscriptionBlock, DocumentAnnotation, DocumentComment stay in document.transcription, document.annotation, document.comment sub-packages (canonical D-OQ-2/3/4) — no table renames triggered.
  3. SenderModel and OcrTrainingRun end up under the ocr/ package (D-OQ-5) — table names stay.

Independent prerequisite (not blocking REFACTOR-1): the usersAppUser rename should happen in a separate migration before or after REFACTOR-1, never bundled — it touches FKs in 6+ tables (comment_mentions, audit_log, notifications, users_groups, password_reset_tokens, document_versions.editor_id, transcription_blocks.created_by/updated_by, etc.).


§10 Top 5 prioritized recommendations

  1. Create docs/SCHEMA.md (domain-grouped overview). Single highest-leverage win for C3.3 + C7.5 + C10.4. Resolves the users/AppUser and tag/tags gotchas with a glossary section; documents the 27 tables + 6 junctions + intentional V37/V43 gaps in one read.

  2. Fix or delete scripts/schema.sql. Either regenerate it on every release tag via pg_dump (CI hook) or delete it and replace with a one-line README pointing readers at db/migration/. Today it actively misleads — 19 of 27 tables are missing.

  3. Resolve the audit-log append-only contradiction. Either split DB roles so the V46/V47 REVOKE statements bite, or remove them and add a single inline COMMENT ON TABLE audit_log IS 'append-only enforced at app layer only — see AuditLogRepository'. Today the migration claims protection it doesn't deliver.

  4. Patch the migration sequence holes. Add stub V37__skipped_needs_expert_rolled_back.sql and V43__skipped.sql containing nothing but SELECT 1; and a 1-line -- comment explaining the rollback / accidental skip. Cheap; eliminates "did I lose a file?" friction.

  5. Rename tagtags in a dedicated migration and add @Table(name = "tags") to Tag.java. Smallest surface-area renames in the schema (one table, one FK in document_tags, one inline reference in V34 trigger function). Removes the only singular-vs-plural outlier and brings the table under the canonical naming convention.

Top blocker: users table maps to entity AppUser and shares lexical space with persons — Tobias will conflate them within 30 seconds of opening the schema.

# AUDIT-4 — Database schema & Flyway migrations **Scope:** `backend/src/main/resources/db/migration/V*.sql` (V1–V59, 57 files), `scripts/schema.sql` snapshot, `scripts/large-data.sql`. **Reference bundle:** issue #387 first comment (Legibility Rubric + Per-Subsystem Orient Template). --- ## §1 Subsystem profile The database layer is **PostgreSQL 16** managed by **Flyway** with versioned `Vn__description.sql` migrations. The schema is owned by the Spring Boot backend; JPA/Hibernate maps entities onto these tables. Migrations are append-only by convention, but the layer also includes: - one **DB-level baseline** (`V1__initial_schema.sql`) that is a verbatim `pg_dump` of an early schema (Hibernate-generated `fkXXXXXXXX` constraint names, no `gen_random_uuid()` defaults), - two **trigger-driven features** (FTS in V34/V35, no other DDL triggers), - one **append-only audit table** (V46) plus one corrective re-REVOKE (V47), - two **explicit data backfills** (V35 FTS, V50 comments→audit, V51 annotation_id, V59 grant seed). ``` backend/src/main/resources/db/migration/ ├── V1__initial_schema.sql — pg_dump baseline ├── V2…V47 — incremental DDL + small data fixes ├── V48…V59 — newer features (audit, thumbnails, family tree, mentions, geschichten) scripts/ ├── schema.sql — STALE pg_dump (≈V1+V2 only) └── large-data.sql — perf seed (1321 lines of INSERT) ``` **Counts.** 57 migration files, V1–V59 with **two gaps** (V37, V43). V37 was created and later removed (`needsExpert` rolled back; see §7). V43 was never assigned (silent gap). --- ## §2 Inventory — every table mapped to a canonical domain | Table | Created in | Domain | Notes | |---|---|---|---| | `documents` | V1 | document | Core entity. Pluralized. | | `document_receivers` | V1 | document (junction) | `documents ⇄ persons` (M:N) | | `document_tags` | V1 | document (junction) | `documents ⇄ tag` (M:N) | | `document_versions` | V9 | document | Versioned snapshot (JSONB) | | `document_annotations` | V10 | document.annotation | Sub-package | | `document_comments` | V12 | document.comment | Sub-package; threads via `parent_id` | | `comment_mentions` | V17 | document.comment (junction) | `document_comments ⇄ users` | | `transcription_blocks` | V18 | document.transcription | Sub-package | | `transcription_block_versions` | V19 | document.transcription | History rows | | `transcription_block_mentioned_persons` | V56 | document.transcription (junction) | **No FK on `person_id`** by design (V56 header) | | `document_training_labels` | V29 | document (or ocr?) | **Ambiguous — see §5** | | `document_thumbnails` (cols) | V52/V53 | document | Inline cols on `documents` | | `persons` | V1 | person | | | `person_name_aliases` | V21 | person | Historical names (married/widowed) | | `person_relationships` | V54 | person.relationship | Family tree edges | | `sender_models` | V40 | person + ocr | Per-person Kraken HTR model — **dual-domain, see §5** | | `tag` | V1 | tag | **Singular — outlier**; self-FK `parent_id` from V39 | | `users` | V1 | user | Maps to entity `AppUser` — **name mismatch, see §5** | | `user_groups` | V1 | user | Roles | | `users_groups` | V1 | user (junction) | `users ⇄ user_groups` | | `group_permissions` | V1 | user (junction) | `user_groups ⇄ permission_string` (no FK to a permissions table; `Set<String>`) | | `password_reset_tokens` | V8 | user | | | `invite_tokens` | V45 | user | | | `invite_token_group_ids` | V45 | user (junction) | Awkward name — see §5 | | `geschichten` | V58 | geschichte | Blog-like stories | | `geschichten_persons` | V58 | geschichte (junction) | M:N to `persons` | | `geschichten_documents` | V58 | geschichte (junction) | M:N to `documents` | | `notifications` | V16 | notification | | | `ocr_jobs` | V25 | ocr | | | `ocr_job_documents` | V25 | ocr (junction) | `ocr_jobs ⇄ documents` | | `ocr_training_runs` | V30 | ocr | | | `audit_log` | V46 | shared/audit | Append-only (see §3 / C7.5) | **Total: 27 tables** (1 outlier-named, 1 ambiguous, 1 dual-domain — see §5). --- ## §3 Rubric scorecard (focus C3.3, C7.5, C9.3, C10.4) | Check | Result | Severity | Evidence | Recommendation | |---|---|---|---|---| | **C3.3 — table naming/glossary** | **FAIL** | **Critical** | `users` table is mapped to entity `AppUser` (`backend/src/main/java/.../model/AppUser.java:27`); table name says "users" but the domain glossary explicitly insists `Person ≠ AppUser`. A Tobias-style reader who opens `users` first will conflate it with `persons`. | Rename `users` → `app_users` in a dedicated migration (also rename `users_groups` → `app_users_groups`, `comment_mentions.user_id` → `app_user_id`). OR add a `COMMENT ON TABLE users IS 'AppUser accounts — NOT historical persons. See persons.'` | | C3.3 — singular vs plural | FAIL | Major | `public.tag` is singular while every other table is plural (`documents`, `persons`, `tags`-as-junction-prefix in `document_tags`). `Tag.java` has no `@Table` annotation so JPA naming defaults leaked through. | Migration `Vn__rename_tag_to_tags.sql`: `ALTER TABLE tag RENAME TO tags;` then add `@Table(name="tags")` to `Tag.java`. Update `document_tags.tag_id` FK target — implicit, but verify. | | C3.3 — junction naming | FAIL | Minor | Junctions are inconsistent: most use `<plural>_<plural>` (`document_tags`, `document_receivers`, `geschichten_persons`, `geschichten_documents`, `users_groups`), but `group_permissions` uses `<singular>_<plural>` (no second table — it's `@ElementCollection`), `comment_mentions` is `<singular>_<plural>` (also implicit), `invite_token_group_ids` ends in `_ids` which no other junction does (`backend/src/main/resources/db/migration/V45__add_invite_tokens.sql:18`). `transcription_block_mentioned_persons` is descriptive but breaks the pattern. | Rename `invite_token_group_ids` → `invite_token_groups`. Document the `@ElementCollection` pattern (`group_permissions`, `comment_mentions`) explicitly so readers don't expect a sister entity. | | **C7.5 — per-table comment OR overview** | **FAIL** | Minor | Zero `COMMENT ON TABLE` / `COMMENT ON COLUMN` statements anywhere in the migration set. No `docs/SCHEMA.md`. The migration *headers* are the only documentation, and several creation migrations (V1, V8, V9, V10, V12, V17, V25, V29, V45) ship with no comment at all. Tobias has to read application code to learn what `audit_log.kind` values exist, what `documents.script_type` values are valid, what `transcription_block_mentioned_persons.display_name` is for. | Add `docs/SCHEMA.md` with one paragraph per domain group + an ER sketch. Cheaper than per-table SQL comments, mutates with the codebase. (See §8.) | | C7.5 — audit log retention documented | PARTIAL | Major | V46 *attempts* DB-level append-only via `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` (V46:25). Header explicitly notes the REVOKE is a **no-op** when the migration runs as the table owner — meaning the application user retains all privileges in dev/local because Flyway runs as the same role. V47 then runs the same REVOKE again "for the actual application role" but still uses `CURRENT_USER` (V47:8) — same no-op problem. **The append-only guarantee is not enforced at DB level today.** The migration is honest about it ("enforced at the application layer only") but the dual-REVOKE looks like it should work and doesn't. | Either (a) split DB roles so Flyway runs as `archive_owner` and the app runs as `archive_app`, then `REVOKE … FROM archive_app` will bite, or (b) drop both REVOKE statements with a comment "kept at app layer only" so future readers don't think it's enforced. Document the choice in `docs/SCHEMA.md` § audit. | | **C9.3 — migration naming + non-trivial explanation** | PASS (with caveats) | Minor | Naming convention `V<n>__<verb>_<noun>` is consistent (verbs: `add`, `drop`, `backfill`, `make`, `seed`). 25 of 57 migrations carry a header comment explaining intent — strong on the harder ones (V31 nullable transcript, V34 FTS triggers, V46 audit, V51 annotation_id backfill, V54 family tree, V56 mentioned persons, V59 BLOG_WRITE seed). 18 migrations are one-liner `ALTER TABLE … ADD COLUMN` and self-explanatory. **Caveat:** V13, V20, V32, V36, V41, V44, V47 add cross-cutting columns or fix prior migrations without naming the issue/PR — Anja can't trace why. | Adopt a 1-line convention: every migration starts with `-- #<issue>: <one-line why>`. Backfill is optional. | | C9.3 — sequence integrity | FAIL | Minor | V37 and V43 are missing. V37 was rolled back (commit `ca0cf490`, see §7). V43 was never created (no git history). Flyway tolerates gaps but a reader scanning `ls` sees holes and asks "did I lose a file?" | Add a placeholder `V43__skipped.sql` containing only a comment `-- intentionally skipped during issue #XXX rebase; see commit YYYY` and same for `V37__skipped.sql`, OR rebase the sequence (riskier on prod). | | **C10.4 — half-finished / abandoned migrations** | FAIL | Major | (a) `scripts/schema.sql` is a **stale snapshot** — only contains tables created up to ~V2 (8 tables). It's missing 19 of the 27 current tables (no `transcription_blocks`, `audit_log`, `geschichten`, `notifications`, `ocr_*`, `person_*`, etc.). `scripts/CLAUDE.md` says it's "for quick reference only" but Tobias will be misled hard. (b) V37 hole + git note "V37 was already applied; rename V37→V38" implies a production rebase that the file tree no longer reflects. (c) `scripts/large-data.sql` has no header explaining what it is or how to safely reset before reloading. | (a) Delete `scripts/schema.sql` or regenerate it on every release via `pg_dump` (CI step). (b) Add a `MIGRATIONS.md` next to the migration directory listing intentional gaps. (c) Add a 5-line header to `large-data.sql`: purpose, prereq, idempotency. | | C3.3 — column-name mismatch with entity field names | FAIL | Major | `documents.meta_date` ↔ `Document.documentDate` (field) — naming diverges between layer (V1 baseline). `comment_mentions.user_id` references `users(id)` but the entity is `AppUser`. `audit_log.actor_id` references `users(id)` (V46:9) but is read as "the AppUser who acted". Tobias opens schema, then opens entity, has to translate. | Either rename DB columns to match entity fields (breaking, requires migration) OR add `@Column(name="...")` everywhere and mention the convention in `docs/SCHEMA.md`. | --- ## §4 Subsystem health summary | Category | PASS | FAIL-Critical | FAIL-Major | FAIL-Minor | |---|---|---|---|---| | C3 (Domain Clarity) | 0 | 1 (C3.3 users-vs-AppUser) | 2 (junction naming asymmetry, column/entity mismatch) | 1 (singular `tag`) | | C7 (Documentation) | 0 | 0 | 1 (audit retention claim vs reality) | 1 (zero COMMENTs / no SCHEMA.md) | | C9 (Operability) | 1 (C9.3 naming) | 0 | 0 | 1 (sequence gaps V37/V43) | | C10 (Code Sanity) | 0 | 0 | 1 (stale `scripts/schema.sql` + V37 ghost) | 0 | **Overall health: 🔴 RED.** One Critical (C3.3 `users` vs `AppUser`) and three Majors triggers the threshold (🟢 requires 0 Critical AND ≥80% pass — currently 1 of 8 applicable checks passes cleanly). --- ## §5 Domain-mapping gaps (ambiguous tables) | Table | Plausible domains | Blocker question | |---|---|---| | `document_training_labels` (V29) | `document` (data lives on a Document) **or** `ocr` (only consumer is the OCR training pipeline) | Should this move to `ocr/` package alongside `OcrTrainingRun`? Today the Document entity carries `Set<String> trainingLabels` so domain ownership splits across `document/` and `ocr/`. | | `sender_models` (V40) | `person` (1:1 with Person) **or** `ocr` (it's a per-sender HTR model) | Same as above. The entity is `SenderModel` in `model/`, not in any `ocr/` subpackage — but it only exists because of OCR. | | `users` (table) ↔ `AppUser` (entity) | user | Hard mismatch with the canonical glossary that says "user" domain owns AppUser. The table name says "users" — readers expecting the canonical domain set look for `app_users`. | | `audit_log` ↔ `AuditLog` (entity) | shared/audit | Lives under `audit/` package but is a single bare table — the package name implies a domain, the table name implies infrastructure. Acceptable, but worth one sentence in `docs/SCHEMA.md`. | | `invite_token_group_ids` | user (junction) | Name suggests it stores raw IDs as scalars (it does — `@ElementCollection<UUID>` style), but the column is actually a UUID FK. Confusing for first-time reader. | --- ## §6 Cross-cutting candidates **N/A.** Tables belong to exactly one domain (or one sub-package within a domain). The single shared table — `audit_log` — already lives under the canonical `shared/audit` umbrella. --- ## §7 Dead / suspicious-code register (C10 findings) | File:line | Smell | Suggested action | |---|---|---| | `scripts/schema.sql:29-156` | **Stale snapshot** — covers only V1+V2-era schema (8 of 27 tables) yet `scripts/CLAUDE.md` calls it "complete database schema dump for reference". | Regenerate via `pg_dump` in a CI hook on every release tag, or delete and replace with a one-line note pointing at Flyway. | | `backend/src/main/resources/db/migration/V37__*.sql` (deleted) | **Rolled-back migration** — V37 once existed (`needsExpert` flag), was reverted in commit `ca0cf490`, leaving a hole in the sequence. Successor-X sees `V36 → V38` and assumes a missing file. | Add `V37__skipped_needs_expert_rolled_back.sql` containing only a `SELECT 1;` and a comment explaining the rollback. Same for V43 (silent gap with no git trace). | | `backend/src/main/resources/db/migration/V46__add_audit_log.sql:25` and `V47__add_user_color.sql:8` | **No-op security control** — both `REVOKE UPDATE, DELETE ON audit_log FROM CURRENT_USER` calls are documented (V46:22-24) as no-ops because the role *is* the table owner. Two migrations that look protective but aren't. | Either implement the role split (preferred) or delete the REVOKE lines and replace with a single `COMMENT ON TABLE audit_log IS 'Append-only at application layer only'`. | | `backend/src/main/resources/db/migration/V1__initial_schema.sql:202-263` | **Hibernate-generated FK names** (`fkc99c5qjulwx9gru07yrhicgd2`, `fkl5xhww7es3b4um01vmly4y18m`, etc.). V14 has to reference these unreadable names by hand. | One-time `ALTER TABLE … RENAME CONSTRAINT` migration to give them speaking names (`fk_documents_sender`, `fk_document_tags_document`, …). | | `scripts/large-data.sql` | 1321 lines of `INSERT` with no header. Idempotency relies on `ON CONFLICT DO NOTHING` but **no FK preconditions** — running it on an empty DB then on a populated DB will silently produce different results. | Add header: purpose, prerequisites (e.g., reset DB first), expected runtime, dataset size. | | `backend/src/main/resources/db/migration/V35__backfill_fts_search_vector.sql:6` | `UPDATE documents SET title = title;` — looks like a typo. The header explains it's intentional (re-fires the BEFORE UPDATE trigger), but a stranger reading without context will rewrite it as a bug. | Header is good; consider `UPDATE documents SET title = title /* trigger re-fire — see V34 */;` for inline clarity. | --- ## §8 Documentation gaps The DB layer needs **one** of these (recommend the second): **Option A — per-table SQL comments.** Add `COMMENT ON TABLE`/`COMMENT ON COLUMN` to every table and to every column whose meaning is non-obvious (`documents.script_type`, `documents.metadata_complete`, `transcription_blocks.source`, `audit_log.kind`, `audit_log.payload`, `notifications.type`, etc.). Survives `pg_dump`. Cost: ~40 statements in a new migration. **Option B — `docs/SCHEMA.md` overview (recommended).** Domain-grouped page with: - one paragraph per table (purpose, key columns, lifecycle if any), - ER sketch per domain (document / person / user / ocr / audit), - explicit list of intentionally-skipped migration numbers (V37, V43), - glossary entry that resolves `users` ↔ `AppUser` and `tag` (singular) ↔ entity `Tag`, - audit-log retention statement (app-layer only, REVOKE is no-op). Option B is cheaper, evolves alongside `CLAUDE.md`, and addresses C3.3 + C7.5 + half of C10.4 with one document. Per-domain README files (C7.1) would also pull weight here — one paragraph in each backend package README about which tables it owns. --- ## §9 Prerequisites for the big-bang refactor REFACTOR-1 reorganizes packages, **not tables**. Therefore the schema layer needs **no changes** for REFACTOR-1 to land safely, *provided*: 1. The package move keeps each entity's `@Table(name=...)` value (and adds one to `Tag.java` to lock in the current `tag` table name even after the entity moves to `tag/Tag.java`). 2. `TranscriptionBlock`, `DocumentAnnotation`, `DocumentComment` stay in `document.transcription`, `document.annotation`, `document.comment` sub-packages (canonical D-OQ-2/3/4) — no table renames triggered. 3. `SenderModel` and `OcrTrainingRun` end up under the `ocr/` package (D-OQ-5) — table names stay. **Independent prerequisite (not blocking REFACTOR-1):** the `users` ↔ `AppUser` rename should happen in a *separate* migration before or after REFACTOR-1, never bundled — it touches FKs in 6+ tables (`comment_mentions`, `audit_log`, `notifications`, `users_groups`, `password_reset_tokens`, `document_versions.editor_id`, `transcription_blocks.created_by/updated_by`, etc.). --- ## §10 Top 5 prioritized recommendations 1. **Create `docs/SCHEMA.md` (domain-grouped overview).** Single highest-leverage win for C3.3 + C7.5 + C10.4. Resolves the `users`/`AppUser` and `tag`/`tags` gotchas with a glossary section; documents the 27 tables + 6 junctions + intentional V37/V43 gaps in one read. 2. **Fix or delete `scripts/schema.sql`.** Either regenerate it on every release tag via `pg_dump` (CI hook) or delete it and replace with a one-line README pointing readers at `db/migration/`. Today it actively misleads — 19 of 27 tables are missing. 3. **Resolve the audit-log append-only contradiction.** Either split DB roles so the V46/V47 `REVOKE` statements bite, or remove them and add a single inline `COMMENT ON TABLE audit_log IS 'append-only enforced at app layer only — see AuditLogRepository'`. Today the migration claims protection it doesn't deliver. 4. **Patch the migration sequence holes.** Add stub `V37__skipped_needs_expert_rolled_back.sql` and `V43__skipped.sql` containing nothing but `SELECT 1;` and a 1-line `--` comment explaining the rollback / accidental skip. Cheap; eliminates "did I lose a file?" friction. 5. **Rename `tag` → `tags`** in a dedicated migration and add `@Table(name = "tags")` to `Tag.java`. Smallest surface-area renames in the schema (one table, one FK in `document_tags`, one inline reference in V34 trigger function). Removes the only singular-vs-plural outlier and brings the table under the canonical naming convention. **Top blocker:** `users` table maps to entity `AppUser` and shares lexical space with `persons` — Tobias will conflate them within 30 seconds of opening the schema.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#391