Files
familienarchiv/docs/adr/033-tag-name-resolution-tolerates-case-collisions.md
Marcel 7629e35897
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m15s
CI / OCR Service Tests (pull_request) Successful in 23s
CI / Backend Unit Tests (pull_request) Successful in 3m40s
CI / fail2ban Regex (pull_request) Successful in 44s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
CI / Unit & Component Tests (push) Successful in 3m13s
CI / OCR Service Tests (push) Successful in 23s
CI / Backend Unit Tests (push) Successful in 3m40s
CI / fail2ban Regex (push) Successful in 46s
CI / Semgrep Security Scan (push) Successful in 21s
CI / Compose Bucket Idempotency (push) Successful in 1m7s
docs(adr): renumber tag case-collision ADR 032 → 033 to resolve number clash (#731)
Both #730 (tag case-collision) and #684 (person-delete DB integrity) landed
an ADR-032 on main. Renumber the tag/case-collision one to 033 — it is
referenced only from this PR's person-domain comments and its own file, so the
move is self-contained and touches no Flyway migration. The person-delete
ADR-032 and the V71 migration comment that cites it are deliberately left
untouched (editing an applied migration would drift its Flyway checksum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 13:52:25 +02:00

9.5 KiB

ADR-033 — Tag-name resolution tolerates case-collisions: exact-case first, then a deterministic lowest-id fallback, and never a unique(lower(name)) constraint

Date: 2026-06-06 Status: Accepted Issue: #730 (document with a case-colliding tag cannot be saved — findByNameIgnoreCase NonUniqueResultException) Milestone:


Context

TagService.findOrCreate(name) is the single point that turns a tag name into a Tag row. The document edit form, bulk-edit, and the upload batch all round-trip tag names (the edit form sends tags.map(t => t.name).join(',')) and re-resolve them on every save through resolveTags → findOrCreate. The old implementation resolved with tagRepository.findByNameIgnoreCase(name), a derived query returning Optional<Tag>.

That signature encodes an invariant the data does not hold: that a name is globally unique case-insensitively. The canonical tag tree legitimately contains names that differ only by case — a parent container and its same-named lowercase child (Geburt / Geburt/geburt, Weihnachten / Weihnachten/weihnachten, …), or two siblings (Reise/Reisepläne / Reise/reisepläne). Each is a distinct node with its own source_ref (the stable identity, per ADR-025) and its own document attachments — not an accidental duplicate. When two rows matched case-insensitively, Hibernate threw NonUniqueResultExceptionIncorrectResultSizeDataAccessException → a generic HTTP 500.

The effect was severe and opaque: every document carrying one of ~10 colliding tags (≈180 document-tag attachments on staging) became un-editable — any field change failed on save because the whole tag set is re-resolved — and the user saw only "an unexpected error", with no hint that a tag was the cause.

This is a lookup problem, not a data problem: the collisions are valid canonical nodes and must be preserved.

Decision

1. Resolution is exact-case first, then a non-throwing deterministic fallback

findOrCreate resolves in three ordered steps and never throws on a collision:

  1. findByName(cleanName)exact-case derived query. If present, return it. The edit round-trip replays the stored name verbatim, so the exact-case row is the right binding (typing the bare child name weihnachten binds to the child; Weihnachten binds to the parent container).
  2. else findAllByNameIgnoreCase(cleanName) — the plural case-insensitive list. If non-empty, return the element with the lowest id (min(comparing(Tag::getId))).
  3. else create the tag (an orphan: null sourceRef/parentId).

The two repository methods are deliberately two distinctly-named methods — Spring Data cannot disambiguate an Optional<Tag> from a List<Tag> derived query by return type alone. The throwing Optional<Tag> findByNameIgnoreCase is deleted so the non-unique-throwing call cannot be reintroduced; findOrCreate was its only production caller.

2. The tie-break is id, and it is load-bearing

id is a stable, always-present, unique column, so "lowest id" is a total, deterministic order over the candidates: the same name resolves to the same row on every call, forever, without throwing. This matters only in the free-text authoring path (step 2), where no exact-case row exists yet two case-folding rows do.

3. No unique(lower(name)) constraint — and a load-bearing comment says so

A global case-insensitive uniqueness constraint is wrong: it would reject the legitimate parent/child canonical nodes. It would also fail to apply against the existing rows, turning a code-only deploy into a failed Flyway migration that blocks startup. A comment at both findOrCreate and the repository methods records this so the constraint is not "helpfully" added later.

Consequences

  • Code-only, zero migration, fully reversible (roll back the JAR). No tag data is touched; the colliding rows stay exactly as the canonical importer produced them.
  • One change fixes all three write paths — single-document edit, bulk-edit, and upload batch — because they all funnel through resolveTags → findOrCreate, which stays the single source of truth (resolution logic is not hoisted into DocumentService).
  • Free-text tag semantics under collision are accepted as-is (issue #730, option A): the bare word weihnachten binds to the deep child and Weihnachten to the parent container. Correct for the edit round-trip and acceptable for authoring; making the typeahead show the tree path so an author can tell a container from its same-named child is a separate follow-up.
  • The wire response stays opaque: after the fix this path no longer throws IncorrectResultSizeDataAccessException, and GlobalExceptionHandler's generic handler maps any stray one to INTERNAL_ERROR with no Hibernate/SQL leak — so no dedicated handler was added.
  • The sibling Person path is fixed the same way — see the Person extension below (#731).
  • Postgres LOWER() folding of umlauts (ü/ä) is the actual correctness hinge of the fallback and cannot be proven by a mocked repo, so it is pinned by a Testcontainers postgres:16-alpine test on a Glückwünsche/glückwünsche pair; a plain-ASCII test would stay green while the bug reappeared for umlaut tags.

Person extension (#731)

The Person domain carried the same latent throw on two user-influenced lookup surfaces, and is fixed with the same exact-case-first, non-throwing pattern — but with a deliberately different fallback per surface, because the two paths have different consequences.

  • Alias path — PersonService.findOrCreateByAlias — deterministic lowest-id (mirrors tag). findByAliasIgnoreCase (Optional) is replaced by findByAlias (exact) → findAllByAliasIgnoreCase (plural, lowest id) → the existing create-when-absent branch (INSTITUTION/GROUP and the maiden-name alias are preserved verbatim). There is no human in the importer loop and the path creates-on-absent anyway, so a deterministic guess is the right behaviour — exactly like tags.

  • Name/sender path — PersonService.findByName — bail to null on ambiguity (the new wrinkle). Used only by DocumentService.storeDocument to resolve the upload sender from the parsed filename. findByFirstNameIgnoreCaseAndLastNameIgnoreCase (Optional) is replaced by findByFirstNameAndLastName (exact) → findAllByFirstNameIgnoreCaseAndLastNameIgnoreCase (plural). Resolution returns the exact-case match, else the single case-insensitive match, else — on two or more matches — empty. The sender is left unset rather than guessing.

    Why this diverges from the alias (and tag) decision: the archive's value is correct provenance. A confidently-wrong pre-filled Hans Müller is worse than an empty field, because a senior reviewer will not re-check a value that is already filled in, whereas an empty sender routes the document into the "needs completion" state (metadataComplete=false) for a human to assign. The load-bearing comment at findByName records this so a future "consistency cleanup" does not reintroduce the confidently-wrong-sender bug by switching it to lowest-id.

  • Fail-closed on a null first name. A parsed filename can lack a first name. The two new name methods use explicit HQL equality (= :firstName) rather than a derived …IgnoreCase query, because Spring Data folds a null derived-query argument to first_name IS NULL — which would silently widen the match and pull a last-name-only / institution row in as a "sender" (a quiet provenance-integrity defect). With HQL equality a null binds as = NULL, which never matches, so a null first name resolves to no sender. This is pinned by a real-Postgres repository test.

  • Scope — "ambiguous" is case-insensitive only. Both exact-case lookups (findByAlias, findByFirstNameAndLastName) return Optional, so two byte-identical same-case rows would still throw NonUniqueResultException. That is a true data anomaly, deliberately out of scope (it is not a case-collision), and it surfaces as the opaque INTERNAL_ERROR — never a silently wrong row — so it is no worse than any other unexpected error and needs no extra handling here.

  • Same stance as tags otherwise: no unique(lower(alias)) / unique(lower(name)) constraint (collisions are valid human labels; source_ref is the stable identity per ADR-025), no merge/dedupe, code-only and reversible, and no shared resolveExactThenCi(...) helper — the two Person paths have different fallbacks, so the exact→CI→fallback logic is inlined at each with its load-bearing comment (KISS).

Alternatives considered

  • A unique(lower(name)) index — rejected: the collisions are valid canonical nodes, and the migration would fail against the existing data and block startup.
  • Merging/deduping the colliding tags — rejected: each has a distinct source_ref, tree position, and real document attachments; they are not duplicates.
  • Round-tripping tag IDs instead of names so resolution can't be ambiguous at all — the cleaner long-term shape (removes the name-as-key smell), but a larger change with frontend surface; deferred to #732. The lookup fix here is the minimal correct unblock.
  • Hoisting resolution into DocumentService.resolveTags — rejected: it would duplicate the rule across the edit, bulk-edit, and import paths and let them drift; findOrCreate stays the one owner.