diff --git a/docs/import-migration/01-findings-spreadsheet-analysis.md b/docs/import-migration/01-findings-spreadsheet-analysis.md new file mode 100644 index 00000000..eee9723c --- /dev/null +++ b/docs/import-migration/01-findings-spreadsheet-analysis.md @@ -0,0 +1,313 @@ +# Spreadsheet Analysis — Findings (2026-05-25) + +Analysis of the **real raw archive** spreadsheets against the current `MassImportService` +(`backend/.../importing/MassImportService.java`). Goal: import ~7,600 letter rows + a +163-person register, with PDFs to follow. + +Every issue has an ID (`IMP-NN`), severity, evidence, and a proposed approach. + +--- + +## 0. Context: how the importer reads a row today + +`MassImportService` reads **sheet index 0** and maps columns by configurable indices +(`app.import.col.*`, defaults in the source): + +| Property | Default col | Meaning | +| --- | --- | --- | +| `colIndex` | 0 | Index (→ filename `.pdf`) | +| `colBox` | 1 | Box | +| `colFolder` | 2 | Mappe | +| `colSender` | 3 | Sender (raw) | +| `colReceivers` | 5 | Receivers (raw) | +| `colDate` | 7 | Date | +| `colLocation` | 9 | Location | +| `colTags` | 10 | Tag (single) | +| `colSummary` | 11 | Summary | +| `colTranscription` | 13 | Transcription | + +These defaults match the **ODS** file exactly (`Index, Box, Mappe, Von, BriefeschreiberIn, +An, EmpfängerIn, Datum, Datum Originalformat, Ort, Schlagwort, Inhalt, Zeitlicher Kontext, +Transkript` = 14 cols). The ODS was the development target. The new xlsx is a different beast. + +Per-row pipeline: skip if Index blank → derive filename from Index → validate filename → +look for file on disk (recursive; metadata-only if absent) → check PDF magic bytes → +`importSingleDocument` (upsert by `originalFilename`, dedupe non-placeholders as +`ALREADY_EXISTS`). Date parsing is **ISO-only** (`LocalDate.parse`). + +--- + +## IMP-01 — New xlsx column layout ≠ importer defaults 🔴 BLOCKER + +The new `…aktuell…xlsx` (sheet `Familienarchiv`, 7,943 rows × 12 cols) has a **denser, +different** layout. There is an extra `Datei` column at index 1, and the normalized +`Von`/`An`/ISO-`Datum` columns from the ODS **do not exist**. + +| col | New xlsx header | Importer default expects | Result with defaults | +| --- | --- | --- | --- | +| 0 | Index | Index | ✅ ok | +| 1 | **Datei** (path) | Box | ❌ Box ← `..\__scan\W-0001.pdf` | +| 2 | Box | Mappe | ❌ Mappe ← `V` | +| 3 | Mappe | Sender | ❌ Sender ← `1` | +| 4 | BriefeschreiberIn (sender) | — (unused) | ❌ sender ignored | +| 5 | EmpfängerIn (receiver) | Receivers | ✅ coincidentally ok | +| 6 | Datum des Briefes | — (unused) | ❌ date ignored | +| 7 | Ort (location) | Date | ❌ Date ← `Rotterdam` → null | +| 8 | Schlagwort (tag) | — (unused) | ❌ tag ignored | +| 9 | Inhalt (summary) | Location | ❌ Location ← summary text | +| 10 | — | Tag | ❌ empty | +| 11 | — | Summary | ❌ empty | +| 13 | — | Transcription | ❌ column doesn't exist | + +**Impact:** importing as-is produces almost entirely garbage metadata. + +**Proposed approach (decide with Marcel):** +- (a) Re-map via the existing `app.import.col.*` properties — fast, no code. New mapping: + `index=0, box=2, folder=3, sender=4, receivers=5, date=6, location=7, tags=8, summary=9`, + and there is **no** transcription column (point it past the end or add a "missing column" + convention). Caveat: tags land in `colTags` but the real per-letter keywords are in + `Inhalt` (col 9) — see IMP-08 note on tags vs summary. +- (b) Make the importer **header-driven** (map by header name, not index) so it survives + layout drift across files. More robust, needs a code change (→ Gitea issue). + +Recommendation: (b) is the durable fix given we have ≥3 different layouts already. + +--- + +## IMP-02 — 90% of dates are free-text the parser can't read 🔴 BLOCKER + +The dates are written **as in the letter**. `parseDate()` only does `LocalDate.parse()` +(ISO `yyyy-MM-dd`), so anything non-ISO becomes `null`. + +Of **7,319** rows with a date value (col 6): + +| kind | count | parses today? | +| --- | --- | --- | +| Real Excel date cells (→ ISO via POI) | 748 | ✅ | +| Free-text date strings | 6,571 | ❌ → null | + +→ **90% of dated rows lose their date.** (623 rows have no date at all.) + +Observed free-text formats (counts approximate, from col 6): + +| Format | Count | Examples | +| --- | --- | --- | +| `D.M.YY` | 1,338 | `11.10.08`, `13.5.09` | +| `D.RomanMonth.YY/YYYY` | ~1,527 | `22.III.18`, `19.XII.1954`, `1.III.27` | +| `D.Month YYYY` | 950 | `6.März 1888`, `9.März 1888` (note: **no space** after the dot) | +| `D.M.YYYY` | 358 | `15.2.1888`, `7.3.1888` | +| Approximate / unknown | 146 | `?`, `13.7.18?`, `17.Nov (?) 1887`, `13.Januar ? 1907` | +| `Month YYYY` / season / holiday | 41+27 | `Mai 1895`, `Herbst 1913`, `Pfingsten 1922`, `Ostern 1890` | +| `YYYY` only | 17 | `1905`, `1949` | +| `D.M.` no year | 10 | `8.9.`, `14.3.` | +| Ranges | 5+ | `8.1.1916 - 15.3.1916`, `1881/82`, `1945/46?` | +| Abbrev/English months, no space | many | `29.Sept.1891`, `10.Oct.95`, `9.December1889`, `18.Dez.1916` | +| Slash separator | ~315 | `2/2. 18`, `17/6. 1916`, `10/4. 1917` | +| English `Month D. YYYY` | several | `April 12. 1922`, `Oct.5. 1916`, `Mai 23. 1917` | +| Trailing notes | 5+ | `26.4.1888, 2. Brief`, `31.8.1888,2.Brief` | +| 3-digit year (typo) | 107 | `30.1.889` (→ 1889), `4.3.1023` (in person file → 1923) | +| Day-range within month | several | `7./8. Sept.1923` | + +**Proposed approach:** build a tolerant German/historical date parser (→ Gitea issue, it's +a code change). Requirements: +- Numeric `D.M.YY[YY]` and `D/M. YY[YY]` (slash = dot). +- Roman-numeral months (`I`–`XII`). +- German + English month names, full + abbreviated, with/without separating space + (`März`, `Sept.`, `Dez`, `December`, `Oct.`). +- 2-digit and 3-digit year normalization (`08`→1908? needs a century rule; `889`→1889). +- Partial dates → store what's known. The schema only has a single `documentDate + LocalDate`; **decide** whether to (i) store first-of-month/year, (ii) add a + `datePrecision` enum + `dateOriginal` text column, or (iii) keep raw text in a new + `documentDateRaw` field and leave `documentate` null when imprecise. Recommendation: + preserve the **original string** always (new column) + best-effort parsed date + + precision flag, so nothing is lost and the UI can show "ca. 1916". +- Unparseable/approximate (`?`, `Herbst 1913`) → keep raw, leave parsed date null, **do + not drop the row**. + +**Cross-check:** even after IMP-01 is fixed so the date column is read, IMP-02 still bites. +Both must be solved before a real import. + +--- + +## IMP-03 — New xlsx has no normalized/ISO date or name columns 🔴 BLOCKER + +The ODS had helper columns the importer relied on: `Von`/`An` (normalized names) and +`Datum` (ISO) alongside `Datum Originalformat`. The new xlsx has **only the raw** +`BriefeschreiberIn` / `EmpfängerIn` / `Datum des Briefes`. So: +- Names must be parsed from raw strings (PersonNameParser already does receivers; **sender + is taken raw, never split** — fine for senders, which are single, but no normalization). +- Dates must be parsed from raw (IMP-02). + +This is the root reason IMP-01/02 exist: the new file is the *uncurated* source, not the +hand-normalized ODS. Tie any importer redesign to this reality — we will not get clean +helper columns in the 7k-row file. + +--- + +## IMP-04 — Person register not imported at all 🟠 MAJOR + +`Personendatei 2.xlsx` → sheet `Tabelle1`, **163 people**, columns: +`Generation, Familienname, Vorname, geb als (maiden), Geburtsdatum, Geburtsort, +Todesdatum, Sterbeort, verheiratet mit, Bemerkung`. + +Today `MassImportService` has **no person-register import**. Persons are only +auto-created as bare aliases from the document sender/receiver strings +(`personService.findOrCreateByAlias`). All this rich genealogical data is unused: +- birth/death dates + places, +- maiden names (the key to dedup — see IMP-05), +- `verheiratet mit` (marriage links → `PersonRelationship` domain), +- `Bemerkung` relationship hints (`"Schwester v Marie Cram"`, `"Nichte von Herbert"`), +- `Generation` (G 1–G 4), +- nicknames in quotes (`"Tante Lolly"`). + +Data-quality notes in this file too: multi-value `Vorname` (`Charlotte,Meta,Jacobi`); +mixed Excel-date vs text dates; typos (`4.3.1023`); missing-day dates (`.12.1955`); +trailing spaces (`30.8.1862 `). + +**Proposed approach:** a separate **Person import** (→ Gitea issue). Order matters: import +persons *first* so documents can link to real people instead of creating alias stubs. +Use `geb als` + `verheiratet mit` to pre-build the alias/relationship graph. + +--- + +## IMP-05 — Name variations create duplicate Persons 🟠 MAJOR + +The same person appears under several surface forms across the document sheet: +- `Eugenie Müller` (151) vs `Eugenie de Gruyter` (452) — maiden vs married. +- `Clara Cram` (sender 1,284) vs `Clara de Gruyter` (455) vs `Clara de Gruyter sen.` (66). +- `Walter de Gruyter` (589) vs bare `Walter` (78). + +`findOrCreateByAlias` keys on the raw string, so each variant becomes (or matches) a +distinct alias and likely a **distinct Person**. Result: fragmented person records, +broken Briefwechsel pairing, wrong stats. + +**Proposed approach:** drive dedup from the register's `geb als` column (IMP-04) — +`Eugenie de Gruyter geb Müller` tells us the two strings are one person. Build an alias +map (married ↔ maiden ↔ nickname) before/while importing documents. This is partly data +(an alias mapping table/sheet) and partly code (consume it). Likely a Gitea issue once the +mapping format is decided. + +945 distinct sender strings / 274 distinct receiver strings — expect a long-tail of +variants to reconcile. Don't try to be perfect on the first pass; get the high-frequency +names right. + +--- + +## IMP-06 — 93 data rows with blank Index are silently dropped 🟠 MAJOR + +`processRows` does `if (index.isBlank()) continue;`. **93 rows** have a blank Index but +carry other data (sender/receiver/date/etc.). These are silently skipped — they don't even +appear in the `skippedFiles` report (that list only covers rows that *had* an index but +failed file checks). + +**Proposed approach:** before import, triage these 93 rows — are they continuation rows, +section markers, or genuine letters missing an ID? At minimum, surface a count/warning so +nothing vanishes unnoticed. Possibly a small importer change to report blank-index skips. + +--- + +## IMP-07 — 43 duplicate Index values 🟡 MINOR + +43 Index values repeat (e.g. `W-0388`, `Eu-0332`, `C-0234`, `C-0235`, `C-0236`, `J-0175`). +Since the filename is derived from Index, the importer's upsert keys both rows on the same +`originalFilename`: the second occurrence is treated as `ALREADY_EXISTS` (if the first +isn't a placeholder) and **its metadata is lost**, or it overwrites a placeholder. + +**Proposed approach:** list the 43 duplicates, check whether they're true duplicates or +two distinct letters that share an ID by mistake. Fix in the source data, or extend the ID +scheme. Data task first; software only if the ID scheme must change. + +--- + +## IMP-08 — Section/title rows interleaved with data 🟡 MINOR + +Row 2 of the sheet is a section header sitting only in the sender column +(`Brautbriefe von Walter der Gruyter an Eugenie Müller`) with a blank Index — caught by the +blank-Index skip (overlaps IMP-06). There may be more such banners scattered through 7,943 +rows. Also relevant: the per-letter **keywords live in `Inhalt` (col 9)** as comma-joined +values (`Tilburg,Verwandschaft`, `poetisch,Reise nach Breda`), while `Schlagwort` (col 8) +holds a single broad tag (`Brautbriefe`). The importer only takes **one** tag column — +decide which column feeds tags vs summary, and whether to split comma-lists into multiple +tags. + +**Proposed approach:** scan for rows where Index is blank but other cells are set (already +have the count: relates to the 93 in IMP-06). Confirm tag vs summary column choice with +Marcel. + +--- + +## IMP-09 — Index ↔ Datei filename mismatches 🟡 MINOR + +The `Datei` column (col 1) holds explicit relative paths (`..\__scan\W-0001.pdf`) but they +don't always agree with the Index. Example: row 20 has Index `W-0010x` but Datei +`..\__scan\W-0011x.pdf`. The importer derives the filename from **Index**, so it will look +for `W-0010x.pdf` and may miss the actual scan. (Note: the `Datei` paths themselves are +Windows-style with `\` and `..` and would be **rejected** by `isValidImportFilename` if anyone +tried to use that column directly — 7,623 rows use backslashes, 7,455 contain `..`.) + +**Proposed approach:** when the PDFs arrive, reconcile Index-derived names against actual +filenames; produce a mismatch report. Keep deriving from Index (stable IDs) but flag +disagreements. Mostly a data/QA task. + +--- + +## IMP-10 — `x`-suffix rows (letter backsides / enclosures) 🟡 MINOR + +**42 rows** have an `x`-suffixed Index (`W-0001x`, `W-0002x`, …). They're sparse — typically +only Index + Datei + sender + receiver, no box/folder/date. They appear to be the reverse +side or an enclosure of the preceding letter. The importer treats each as an independent +Document, and the `metadataComplete` heuristic flags them complete as soon as a sender is +present (date/box/folder all missing). + +**Proposed approach:** decide whether `x` rows should be (a) separate documents, (b) extra +pages/files attached to their parent, or (c) skipped. Affects both the data model and the +`metadataComplete` heuristic. Discuss with Marcel. + +--- + +## IMP-11 — Multi-receiver separators include bare `u` / `u.` 🟡 MINOR + +`PersonNameParser.parseReceivers` already handles ` und `, ` u `, `//`, `geb.`, +parenthesised shared surnames, and `Familie` filtering — good. But the real data also uses +the abbreviation in forms the top-receivers list shows are common: +`Eugenie u Walter de Gruyter` (230), `Herbert u Clara` (94), `Juan u Marie Cram` (75), +and space-joined pairs like `Ella Anita` (79) that may be two people. +Raw separator tally on receivers: ` und ` ×70, `,` ×11, `;` ×2, `/` ×1 — plus the many ` u ` +cases above. Senders are **not** parsed at all (taken raw), which is fine unless a sender +cell ever holds two names. + +**Proposed approach:** add `MassImportServiceTest` cases for the real-world strings above; +extend the parser only where it actually fails. `Ella Anita`-style space-joined pairs are +ambiguous — likely leave as one person unless the register says otherwise (ties to IMP-05). + +--- + +## IMP-12 — Importer reads only the first sheet, no validation 🟡 MINOR + +`readXlsx` does `workbook.getSheetAt(0)`. For the new xlsx that's `Familienarchiv` (✅), but +the file also contains `Inhaltsverzeichnis grob`, `Inhaltsverzeichnis WdG`, `Tabelle4`. +There is no header validation: if the wrong file/sheet is dropped in `/import`, the importer +will happily map columns positionally and import nonsense. Also `findSpreadsheetFile()` picks +the **first** spreadsheet found in `/import` — with three spreadsheets present there today, +which one wins is filesystem-order-dependent. + +**Proposed approach:** (a) validate the header row against expected names before importing; +(b) make the target sheet/file explicit (config or header match) rather than "first found". +Ties into the header-driven mapping in IMP-01(b). + +--- + +## Summary of recommended sequencing + +1. **Decide the importer mapping strategy** (IMP-01): positional re-config vs header-driven. + Header-driven is the durable choice and unblocks IMP-03/12. +2. **Build the tolerant date parser** (IMP-02) with original-string preservation + precision. +3. **Import the Person register first** (IMP-04) and build the alias/marriage graph, + which feeds person dedup (IMP-05). +4. **Then import documents**, with reporting for blank-index (IMP-06), duplicates (IMP-07), + and section rows (IMP-08). +5. **Reconcile files** when the ~7,000 PDFs arrive (IMP-09), and decide `x`-row semantics + (IMP-10). + +Code-change items (→ Gitea issues when we get there): IMP-01(b), IMP-02, IMP-04, IMP-05 +(consume side), IMP-06 reporting, IMP-12. Pure-data items stay in this folder. diff --git a/docs/import-migration/02-normalization-spec.md b/docs/import-migration/02-normalization-spec.md new file mode 100644 index 00000000..08ccf1d2 --- /dev/null +++ b/docs/import-migration/02-normalization-spec.md @@ -0,0 +1,384 @@ +# Spec — Import Normalizer + +> Authored in the voice of **"Elicit"**, requirements engineer (see +> `.claude/personas/req_engineer.md`). This is a requirements artifact: it states +> *what* the normalizer must do and *how we'll know it's done*, in problem/behaviour +> language. Technology choices already made during brainstorming (Python, openpyxl, +> overrides-and-rerun) are recorded as **constraints**, not re-litigated here. + +- **Status:** Draft for review +- **Date:** 2026-05-25 +- **Related:** [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) (issues `IMP-01..12`), [`README.md`](./README.md) +- **Scope boundary:** This spec covers the **offline normalizer** that turns the raw + spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical + contract into the Java `MassImportService` and the `Document`/`Person` model is **Phase 2** + and gets its own spec. This spec only *defines the contract* Phase 2 must satisfy. + +--- + +## 1. Project Brief + +**Vision.** Turn the family's human-curated, free-form archive spreadsheets into a clean, +canonical dataset that imports deterministically — without hand-editing thousands of rows +and without losing the historical nuance of how things were originally written. + +**Problem.** The real archive (`…aktuell…xlsx`, 7,943 rows) and the person register +(`Personendatei 2.xlsx`, 163 people) were authored for humans to read, not machines to +import. Dates are written as they appeared in each letter (≈90% unparseable by the current +importer), the column layout differs from what the importer expects, and the same person +appears under many names. Importing as-is produces garbage (see `IMP-01..12`). + +**Goal (measurable).** +- G1 — After the automated pass, **≤ 5%** of dated rows remain `UNKNOWN`; after the + overrides-iteration loop, **≤ 0.5%**. +- G2 — **100%** of source rows are represented in the canonical output or in a review file — + *zero silent drops*. +- G3 — **100%** of original values (raw date string, raw name string, source row number) + are preserved. +- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is + **byte-identical** when re-run with unchanged inputs+overrides. + +**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future +agent re-running the pipeline; and the `MassImportService` as the downstream consumer. + +**Non-Goals (explicitly out of scope).** +- NG1 — Changing `MassImportService` or the DB schema (that is Phase 2). +- NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by `index`). +- NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited. +- NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the + long tail stays as provisional persons. +- NG5 — OCR/transcription content (the new xlsx has no transcription column). + +**Key assumptions.** (A1) Sheet `Familienarchiv` is the document source of truth. +(A2) Archive date range is **1873–1957** (drives the 2-digit-year century rule). +(A3) `index` is the stable document key and the basis for future PDF matching. +(A4) `Schlagwort` is a broad tag; `Inhalt` is a short summary/topic. + +**Risks.** (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag ++ overrides. (R2) Name matching false-positives merge distinct people → mitigated by +conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with +layout drift → mitigated by header-name-based mapping, not fixed indices. + +--- + +## 2. Personas + +**Marcel — Data Steward.** Role: solo owner of Familienarchiv. Context: holds the complete +raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently, +not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts. +Frustrations: dates in ~20 formats; one ancestor under 4 name variants. **JTBD:** *"When I +have raw, human-curated archive spreadsheets, I want to transform them into a clean importable +dataset without losing how things were originally written, so I can load the archive and keep +correcting edge cases as they surface."* + +**The Returning Agent.** Role: a future assistant session resuming the work. Goal: re-run the +pipeline deterministically and understand exactly what still needs human input. **JTBD:** +*"When I pick this up cold, I want one command and a clear residue report, so I can continue +without re-deriving context."* + +--- + +## 3. Constraints & Decisions Already Made + +These were settled during brainstorming and are fixed inputs to the requirements below. + +| # | Decision | Rationale | +| --- | --- | --- | +| C1 | **New canonical layout** with explicit headers (not the old positional ODS shape). | Fits the new data; importer becomes header-driven in Phase 2. | +| C2 | Dates stored as **parsed (nullable) + raw + precision**. | Historical archive; never lose the original; enable "ca. 1916". | +| C3 | **Include person resolution** (register + alias/marriage map → canonical persons) in this effort. | Maiden-name dedup needs the register. | +| C4 | **Overrides-file + re-run** loop for residue. | Deterministic, diffable, repeatable. | +| C5 | Implementation: **Python 3.12 + openpyxl**, standalone tool at `tools/import-normalizer/`. | Fast iteration; no Spring rebuild / coverage gate on transform code. | +| C6 | Century rule for archive **1873–1957**: 2-digit `00–57`→`19YY`, `73–99`→`18YY`, `58–72`→**flag**; 3-digit `DDD`→`1DDD`; never 20xx. | Stated by Marcel. Boundaries live in config. | +| C7 | `Schlagwort`→tag, `Inhalt`→summary. | Matches importer's existing semantics. | +| C8 | Non-register correspondents become **provisional persons**. | ~945 distinct sender strings vs 163 register people. | + +--- + +## 4. Functional Requirements + +Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules +use EARS. Traceability to findings in §8. + +### 4.1 Ingest & layout (`FR-INGEST`, `FR-MAP`) + +**US-MAP-01** — *As the data steward, I want each source column mapped to a named canonical +field regardless of its position, so a re-exported spreadsheet with shifted columns still +imports correctly.* +- AC1 — Given the `Familienarchiv` sheet, when the normalizer reads the header row, then it + maps columns by **header name** (not fixed index) to the canonical fields. +- AC2 — Given a header the normalizer does not recognise, when it runs, then it records the + unknown header in `review/summary.txt` and continues (does not crash). +- AC3 — Given a required source header is **absent**, when it runs, then it aborts with a + clear message naming the missing header (fail loud, before producing partial output). + +- **REQ-INGEST-01** — The normalizer shall read only the `Familienarchiv` sheet of the + document workbook and the `Tabelle1` sheet of the person workbook. +- **REQ-MAP-01** — Header matching shall be case-insensitive and tolerant of internal + multiple spaces (e.g. `"Datum des Briefes"`). + +### 4.2 Row triage (`FR-TRIAGE`) — resolves IMP-06, IMP-07, IMP-08 + +**US-TRIAGE-01** — *As the data steward, I want rows that have data but no index surfaced +rather than dropped, so I never lose a letter silently.* +- AC1 — Given a row whose `index` is blank but which has any other non-empty cell, when the + normalizer runs, then that row is written to `review/blank-index-rows.csv` with its source + row number and is **not** emitted as a canonical document. +- AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not + reported as an anomaly). + +- **REQ-TRIAGE-01** — If two or more rows resolve to the same `index`, then the normalizer + shall emit all of them to `review/duplicate-index.csv` and mark each canonical row + `needs_review = duplicate_index` (it shall **not** silently drop either). +- **REQ-TRIAGE-02** — Where a row is identified as a section/banner row (blank index, text + only in a name column), the normalizer shall classify it as such in the blank-index report. +- **REQ-TRIAGE-03** — Rows whose `index` ends in `x` (a transcription/back-side of the base + letter, not yet independently mappable) shall be **skipped** — not emitted as a canonical + document — and written to `review/skipped-x-suffix.csv` with their source row and base index + (`index` minus the trailing `x`), so they can be linked in a later pass. (Resolves IMP-10.) + +### 4.3 Date normalization (`FR-DATE`) — resolves IMP-02, IMP-03 + +**US-DATE-01** — *As the data steward, I want every date interpreted as precisely as the +source allows, with the original always kept, so I can sort the archive and still see what the +letter actually said.* +- AC1 — Given a parseable date, when normalized, then `date_iso` holds the best-effort ISO + date, `date_raw` holds the verbatim source string, and `date_precision` ∈ + `{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}`. +- AC2 — Given an unparseable date, when normalized, then `date_iso` is empty, + `date_precision = UNKNOWN`, `date_raw` is preserved, and the value appears in + `review/unparsed-dates.csv`. +- AC3 — Given the same `date_raw` appears in `overrides/dates.csv`, when normalized, then the + override's `(iso, precision)` wins over the automatic parse. + +- **REQ-DATE-01** — The parser shall accept, at minimum, these forms (see §10 examples): + Excel/ISO; `D.M.YYYY`/`D.M.YY`; `D/M. YY[YY]` (slash treated as dot); Roman-numeral months + `I–XII`; German + English month names, full and abbreviated, with or without a separating + space; `Month YYYY`; season/holiday + year; bare `YYYY`; and start-anchored ranges. +- **REQ-DATE-02** — Precision shall be assigned by what is known: full day → `DAY`; month+year + → `MONTH` (day = 1); a **named feast/holiday + year** → resolved to its **actual calendar + date for that year** → `DAY`; a **season + year** → representative mid-season month (day = 1) + → `SEASON`; year only → `YEAR` (month = Jan, day = 1); a range → start date + `RANGE`; a + value carrying an uncertainty marker (`?`, `um`, `ca`, `circa`) → `APPROX` with best-effort date. +- **REQ-DATE-03** — Two-digit and three-digit years shall be expanded per **C6**; a 2-digit + year in `58–72` shall yield `UNKNOWN` + a review entry rather than a guess. +- **REQ-DATE-04** — Trailing editorial notes (e.g. `", 2. Brief"`) shall be stripped before + parsing and preserved (kept within `date_raw`; not invented into the date). +- **REQ-DATE-05** — The parser shall be pure and side-effect-free so it can be unit-tested in + isolation (see NFR-TEST-01). +- **REQ-DATE-06** — **Movable feasts are never mapped to a fixed month**; they shall be + computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter−2, Ostern = + Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag = + Easter+50, Fronleichnam = Easter+60, 1.–4. Advent = the 4th…1st Sunday before 25 Dec. Fixed + feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25, + Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul, + Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py` + (NFR-MAINT-01). + +### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11 + +**US-PERS-01** — *As the data steward, I want the genealogical register turned into canonical +people with all their known facts, so documents can link to real persons.* +- AC1 — Given a register row, when parsed, then a canonical person is produced with + `person_id`, name parts, `maiden_name`, birth/death (parsed + raw + place), spouse, + generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates. +- AC2 — Given multi-value given names (`"Charlotte,Meta,Jacobi"`), when parsed, then the + primary given name is the first; the remainder are retained as additional names/aliases. + +**US-PERS-02** — *As the data steward, I want each sender/receiver string matched to a +canonical person where possible and never dropped otherwise, so the correspondence graph is +complete.* +- AC1 — Given a sender/receiver string, when resolved, then it maps to a register + `person_id` via the alias index (exact → normalized/casefold → conservative fuzzy). +- AC2 — Given no confident match, when resolved, then a **provisional person** is created from + the cleaned string, linked, and listed in `review/unmatched-names.csv` (occurrence count + + example source rows). +- AC3 — Given the string appears in `overrides/names.csv`, when resolved, then it maps to the + specified `person_id` (override wins). +- AC4 — Given a multi-person receiver cell (`"Eugenie u Walter de Gruyter"`, `"Herbert u + Clara"`, `"…//…"`, `"Hedi und Tutu (Gruber)"`), when resolved, then it is split into + individual people, each resolved independently; ambiguous space-joined pairs + (`"Ella Anita"`) are emitted to `review/ambiguous-receivers.csv` rather than guessed. + +- **REQ-DEDUP-01** — The alias index shall be derived from the register: canonical + "First Last", maiden form (`geb als`), spouse-surname married form, nickname, and + first-name-only **only when unambiguous** across the register. +- **REQ-DEDUP-02** — The normalizer shall not merge two distinct strings into one person on + fuzzy similarity alone above a configured threshold without the match being reported; merges + must be auditable. +- **REQ-PERS-01** — Sender cells shall be parsed for multi-person content using the same rules + as receiver cells (today the importer parses only receivers — IMP-11). + +### 4.5 Overrides & idempotency (`FR-OVR`) — supports the iteration loop + +- **REQ-OVR-01** — When the normalizer runs, then it shall load `overrides/dates.csv` and + `overrides/names.csv` if present and apply them; absence of either file shall not be an error. +- **REQ-OVR-02** — While overrides are unchanged and inputs are unchanged, re-running shall + produce **byte-identical** canonical outputs and review files (NFR-IDEM-01). +- **REQ-OVR-03** — Each override application shall be counted in `review/summary.txt` (how many + dates/names were resolved by override vs automatically). + +### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12 + +- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and + `out/canonical-persons.xlsx` with the headered schemas in §6. +- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number + in the source sheet) so any value can be traced back to the original. +- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more + flags (`duplicate_index`, `unparsed_date`, `unmatched_sender`, `unmatched_receiver`, + `index_file_mismatch`, …) so the import and the UI can foreground uncertain data. +- **REQ-OUT-02** — Where the source `Datei` path disagrees with the index-derived filename + (IMP-09), the normalizer shall record the discrepancy in `review/index-file-mismatch.csv` + and flag the row; it shall **not** alter the `index` (the stable key). + +--- + +## 5. Non-Functional Requirements + +| ID | Category | Requirement (measurable) | +| --- | --- | --- | +| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. | +| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. | +| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. | +| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. | +| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. | +| NFR-I18N-01 | Encoding | UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output. | +| NFR-TEST-01 | Testability | `dates.py` and `persons.py` have pytest tests covering every format/alias category in §10 with real examples from the archive. | +| NFR-MAINT-01 | Maintainability | Column-name map, century boundaries, season→month map, and fuzzy threshold live in `config.py`, not inline in logic. | +| NFR-OBSERV-01 | Observability | `review/summary.txt` reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type. | +| NFR-SAFETY-01 | Source safety | Source workbooks are opened read-only and never written. | + +--- + +## 6. Data Dictionary (canonical contract) + +This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a +DB schema. + +### 6.1 `canonical-documents.xlsx` + +| Field | Required | Format / values | Notes | +| --- | --- | --- | --- | +| `index` | yes | string | Stable key; basis for PDF matching. | +| `box` | no | string | from `Box`. | +| `folder` | no | string | from `Mappe`. | +| `sender_person_id` | no | person_id | resolved; empty if no sender. | +| `sender_name` | no | string | canonical display name (or cleaned raw if provisional). | +| `receiver_person_ids` | no | `id\|id\|…` | pipe-separated. | +| `receiver_names` | no | `name\|name\|…` | pipe-separated, aligned with ids. | +| `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. | +| `date_raw` | no | string | verbatim source date. | +| `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. | +| `location` | no | string | from `Ort`. | +| `tags` | no | `tag\|tag` | from `Schlagwort`. | +| `summary` | no | string | from `Inhalt`. | +| `source_row` | yes | int | provenance (NFR-DATA-01). | +| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). | + +### 6.2 `canonical-persons.xlsx` + +| Field | Required | Format | Notes | +| --- | --- | --- | --- | +| `person_id` | yes | slug | stable id (e.g. `de-gruyter-eugenie`); collisions suffixed. | +| `last_name` | yes | string | from `Familienname`. | +| `first_name` | no | string | primary given name. | +| `maiden_name` | no | string | from `geb als` — drives dedup. | +| `title` | no | string | e.g. honorifics if present. | +| `nickname` | no | string | from quoted `Bemerkung`/spouse field. | +| `birth_date` / `birth_date_raw` / `birth_place` | no | ISO / string / string | §4.3 rules. | +| `death_date` / `death_date_raw` / `death_place` | no | ISO / string / string | §4.3 rules. | +| `spouse` | no | person_id or name | from `verheiratet mit`. | +| `generation` | no | string | `G 1`..`G 4`. | +| `notes` | no | string | from `Bemerkung`. | +| `aliases` | no | `a\|b\|c` | every surface form that maps here. | +| `provisional` | yes | bool | true if created from a document string, not the register. | + +--- + +## 7. Prioritized Backlog (MoSCoW) + +| ID | Item | MoSCoW | Effort | Depends on | +| --- | --- | --- | --- | --- | +| B1 | Project scaffolding + read both workbooks (`FR-INGEST`, header map `FR-MAP`) | Must | S | — | +| B2 | Row triage + blank/duplicate/empty reports (`FR-TRIAGE`) | Must | S | B1 | +| B3 | Date parser + precision + century rule + Easter/feast computus + season map + tests (`FR-DATE`) | Must | L | B1 | +| B4 | Person register parser → canonical persons (`FR-PERS` US-PERS-01) | Must | M | B1 | +| B5 | Alias index + name resolution + multi-person split (`FR-DEDUP`, US-PERS-02) | Must | L | B4 | +| B6 | Overrides load + apply + idempotency (`FR-OVR`) | Must | S | B3,B5 | +| B7 | Canonical writers + provenance + review summary (`FR-OUT`, `FR-PROV`) | Must | M | B2,B3,B5 | +| B8 | Index↔Datei mismatch report (`REQ-OUT-02`) | Should | XS | B1 | +| B9 | Ambiguous-receiver review path (US-PERS-02 AC4) | Should | S | B5 | +| B10 | Comma-split `Inhalt` into extra tags | Could | XS | B7 | +| B11 | Phase-2 importer wiring (separate spec) | Won't (this spec) | — | B7 | + +--- + +## 8. Traceability — Findings → Requirements + +| Finding | Severity | Addressed by | +| --- | --- | --- | +| IMP-01 layout mismatch | blocker | C1, FR-MAP, REQ-OUT-01 | +| IMP-02 free-text dates | blocker | FR-DATE (all), C2, C6 | +| IMP-03 no ISO/normalized cols | blocker | FR-DATE, FR-PERS | +| IMP-04 register unimported | major | C3, US-PERS-01, §6.2 | +| IMP-05 name variants → dupes | major | C3, FR-DEDUP | +| IMP-06 blank-index dropped | major | US-TRIAGE-01 | +| IMP-07 duplicate indices | minor | REQ-TRIAGE-01 | +| IMP-08 section rows / tags vs summary | minor | REQ-TRIAGE-02, C7 | +| IMP-09 index↔file mismatch | minor | REQ-OUT-02, B8 | +| IMP-10 `x`-suffix rows | minor | REQ-TRIAGE-03 (skip + log this pass) | +| IMP-11 sender not split / ` u ` sep | minor | REQ-PERS-01, US-PERS-02 AC4 | +| IMP-12 first-sheet, no validation | minor | REQ-INGEST-01, FR-MAP AC2/AC3 | + +--- + +## 9. Open Questions / TBD Register + +| ID | Question | Why it matters | Ref | Resolution | +| --- | --- | --- | --- | --- | +| OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). | +| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02 | **Confirmed:** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`. | +| OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. | +| OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. | +| OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). | +| OQ-06 ✅ | Fuzzy-match policy. | False-positive person merges (R2). | REQ-DEDUP-02 | **Confirmed:** conservative — report all fuzzy matches; no silent merge. | + +*All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.* + +--- + +## 10. Glossary & Worked Examples + +**Precision** — how exactly a date is known (`DAY` … `UNKNOWN`). **Provisional person** — a +person created from a document name string with no register match. **Alias index** — map from +every known surface form of a name to a canonical `person_id`. **Override** — a +human-supplied correction applied deterministically on each run. + +**Date examples → expected outcome:** + +| `date_raw` | `date_iso` | `date_precision` | +| --- | --- | --- | +| `15.2.1888` | 1888-02-15 | DAY | +| `6.März 1888` | 1888-03-06 | DAY | +| `22.III.18` | 1918-03-22 | DAY | +| `13.5.09` | 1909-05-13 | DAY | +| `10.Oct.95` | 1895-10-10 | DAY | +| `17/6. 1916` | 1916-06-17 | DAY | +| `Mai 1895` | 1895-05-01 | MONTH | +| `Pfingsten 1922` | 1922-06-04 | DAY (computed: Easter 1922 = Apr 16, +49 days) | +| `Herbst 1913` | 1913-10-01 | SEASON | +| `1905` | 1905-01-01 | YEAR | +| `8.1.1916 - 15.3.1916` | 1916-01-08 | RANGE | +| `17.Nov (?) 1887` | 1887-11-17 | APPROX | +| `?` | *(empty)* | UNKNOWN | + +**Name examples → expected outcome:** + +| raw cell | resolves to | +| --- | --- | +| `Eugenie Müller` (+ register `geb Müller`) | `de-gruyter-eugenie` (matched via maiden alias) | +| `Eugenie de Gruyter` | `de-gruyter-eugenie` | +| `Herbert u Clara` | `cram-herbert` + `cram-clara` (split, surname distributed) | +| `Hedi und Tutu (Gruber)` | `gruber-hedi` + `gruber-tutu` | +| `Ella Anita` | → `review/ambiguous-receivers.csv` (not auto-split) | +| `Hans Wittkopf` (not in register) | provisional `wittkopf-hans` | diff --git a/docs/import-migration/README.md b/docs/import-migration/README.md new file mode 100644 index 00000000..b478e719 --- /dev/null +++ b/docs/import-migration/README.md @@ -0,0 +1,62 @@ +# Import Migration — Working Folder + +This folder tracks the iterative work of mass-importing the **real, raw family archive** +spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv. + +It is intentionally **local docs, not Gitea issues**. We only open a Gitea issue when a +finding requires a *software* change (e.g. a new date parser). Pure data observations and +the running plan live here so any agent can pick the work up cold. + +## Source files (in `/import`) + +| File | What it is | Importer support today | +| --- | --- | --- | +| `zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx` | The **real raw archive** — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters. | ❌ layout does **not** match importer defaults | +| `Personendatei 2.xlsx` | Genealogical **person register** — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships). | ❌ no importer at all | +| `zzfamilienarchiv Walter und Eugenie 2025-04-10.ods` | A small, **already-normalized** subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what `MassImportService` was built for | + +The PDFs (~7,000) will follow later. The importer matches files by the **Index** column +(e.g. `W-0001` → `W-0001.pdf`), and already imports metadata-only when a file is missing — +so we can import all metadata now and the PDFs will attach on a re-run. + +## How to inspect the spreadsheets + +`openpyxl` is installed in the OCR service venv: + +```bash +/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)" +``` + +## Documents in this folder + +- [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID `IMP-NN`. +- [`02-normalization-spec.md`](./02-normalization-spec.md) — requirements spec for the offline **import normalizer** (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements `FR-*`/`NFR-*`, traceable to the `IMP-NN` findings. +- `WORKLOG.md` — running log of what each session did and what's next. **Start here when resuming.** + +## Strategy (decided 2026-05-25) + +Normalize **before** import. A standalone Python tool (`tools/import-normalizer/`, not yet +built) transforms the raw xlsx + person register into a clean canonical dataset +(`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs. Residual cases +(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and +re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**. +See the spec for the full contract. + +## Status board + +| ID | Issue | Severity | Status | +| --- | --- | --- | --- | +| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open | +| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open | +| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open | +| IMP-04 | Person register (`Personendatei 2.xlsx`) not imported | 🟠 major | open | +| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open | +| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open | +| IMP-07 | 43 duplicate Index values | 🟡 minor | open | +| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open | +| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open | +| IMP-10 | `x`-suffix rows (letter backsides/enclosures) | 🟡 minor | open | +| IMP-11 | Multi-receiver separators incl. bare `u`/`u.` | 🟡 minor | open | +| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open | + +See the findings doc for detail and proposed approach per issue. diff --git a/docs/import-migration/WORKLOG.md b/docs/import-migration/WORKLOG.md new file mode 100644 index 00000000..ef7b2e38 --- /dev/null +++ b/docs/import-migration/WORKLOG.md @@ -0,0 +1,62 @@ +# Import Migration — Worklog + +Running log of each working session. **Resume here.** Newest entry on top. + +--- + +## 2026-05-25 (session 2) — Strategy + normalizer spec + +**Did:** +- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher + leverage than making the Java importer tolerate every mess). +- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw + + precision; include person register + dedup in this effort; overrides-file + re-run loop; + Python tool at `tools/import-normalizer/`. +- Century rule fixed by Marcel: archive spans **1873–1957**; 2-digit `00–57`→19YY, + `73–99`→18YY, `58–72`→flag; 3-digit→1DDD; never 20xx. +- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer + persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register). + +**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …) +**computed per year from Easter**, never a fixed month; seasons → mid-season month +(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows +**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable). +OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge. + +**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean +branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv` +pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left +uncommitted, not ours.) + +**Next:** +- Marcel reviews the spec. +- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1–B7 are + the Musts; B3 date parser incl. Easter computus is the big one). + +--- + +## 2026-05-25 (session 1) — Initial analysis + +**Did:** +- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow. +- Compared the new xlsx layout against `MassImportService` defaults and the old ODS. +- Full statistical scan of all rows: dates, indices, senders/receivers, file column. +- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) + with 12 issues (IMP-01..IMP-12) + recommended sequencing. +- Installed `openpyxl` into the OCR service venv for inspection. + +**Key facts established:** +- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01). +- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02). +- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05). + +**Decisions pending from Marcel (blockers for any code work):** +1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite? +2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy? +3. IMP-04/05: format for the person/alias mapping; import persons before documents? +4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped? + +**Next:** +- Get Marcel's calls on the 4 decisions above. +- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12). +- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.