docs(import): add import-migration analysis + normalizer spec

Document the raw archive spreadsheet findings (IMP-01..12) and a
requirements spec for an offline normalizer that produces a clean
canonical dataset before import. Local docs only; no Gitea issue yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-25 12:32:37 +02:00
parent 8e9e3bba06
commit adfff420a5
4 changed files with 821 additions and 0 deletions

View File

@@ -0,0 +1,313 @@
# Spreadsheet Analysis — Findings (2026-05-25)
Analysis of the **real raw archive** spreadsheets against the current `MassImportService`
(`backend/.../importing/MassImportService.java`). Goal: import ~7,600 letter rows + a
163-person register, with PDFs to follow.
Every issue has an ID (`IMP-NN`), severity, evidence, and a proposed approach.
---
## 0. Context: how the importer reads a row today
`MassImportService` reads **sheet index 0** and maps columns by configurable indices
(`app.import.col.*`, defaults in the source):
| Property | Default col | Meaning |
| --- | --- | --- |
| `colIndex` | 0 | Index (→ filename `<index>.pdf`) |
| `colBox` | 1 | Box |
| `colFolder` | 2 | Mappe |
| `colSender` | 3 | Sender (raw) |
| `colReceivers` | 5 | Receivers (raw) |
| `colDate` | 7 | Date |
| `colLocation` | 9 | Location |
| `colTags` | 10 | Tag (single) |
| `colSummary` | 11 | Summary |
| `colTranscription` | 13 | Transcription |
These defaults match the **ODS** file exactly (`Index, Box, Mappe, Von, BriefeschreiberIn,
An, EmpfängerIn, Datum, Datum Originalformat, Ort, Schlagwort, Inhalt, Zeitlicher Kontext,
Transkript` = 14 cols). The ODS was the development target. The new xlsx is a different beast.
Per-row pipeline: skip if Index blank → derive filename from Index → validate filename →
look for file on disk (recursive; metadata-only if absent) → check PDF magic bytes →
`importSingleDocument` (upsert by `originalFilename`, dedupe non-placeholders as
`ALREADY_EXISTS`). Date parsing is **ISO-only** (`LocalDate.parse`).
---
## IMP-01 — New xlsx column layout ≠ importer defaults 🔴 BLOCKER
The new `…aktuell…xlsx` (sheet `Familienarchiv`, 7,943 rows × 12 cols) has a **denser,
different** layout. There is an extra `Datei` column at index 1, and the normalized
`Von`/`An`/ISO-`Datum` columns from the ODS **do not exist**.
| col | New xlsx header | Importer default expects | Result with defaults |
| --- | --- | --- | --- |
| 0 | Index | Index | ✅ ok |
| 1 | **Datei** (path) | Box | ❌ Box ← `..\__scan\W-0001.pdf` |
| 2 | Box | Mappe | ❌ Mappe ← `V` |
| 3 | Mappe | Sender | ❌ Sender ← `1` |
| 4 | BriefeschreiberIn (sender) | — (unused) | ❌ sender ignored |
| 5 | EmpfängerIn (receiver) | Receivers | ✅ coincidentally ok |
| 6 | Datum des Briefes | — (unused) | ❌ date ignored |
| 7 | Ort (location) | Date | ❌ Date ← `Rotterdam` → null |
| 8 | Schlagwort (tag) | — (unused) | ❌ tag ignored |
| 9 | Inhalt (summary) | Location | ❌ Location ← summary text |
| 10 | — | Tag | ❌ empty |
| 11 | — | Summary | ❌ empty |
| 13 | — | Transcription | ❌ column doesn't exist |
**Impact:** importing as-is produces almost entirely garbage metadata.
**Proposed approach (decide with Marcel):**
- (a) Re-map via the existing `app.import.col.*` properties — fast, no code. New mapping:
`index=0, box=2, folder=3, sender=4, receivers=5, date=6, location=7, tags=8, summary=9`,
and there is **no** transcription column (point it past the end or add a "missing column"
convention). Caveat: tags land in `colTags` but the real per-letter keywords are in
`Inhalt` (col 9) — see IMP-08 note on tags vs summary.
- (b) Make the importer **header-driven** (map by header name, not index) so it survives
layout drift across files. More robust, needs a code change (→ Gitea issue).
Recommendation: (b) is the durable fix given we have ≥3 different layouts already.
---
## IMP-02 — 90% of dates are free-text the parser can't read 🔴 BLOCKER
The dates are written **as in the letter**. `parseDate()` only does `LocalDate.parse()`
(ISO `yyyy-MM-dd`), so anything non-ISO becomes `null`.
Of **7,319** rows with a date value (col 6):
| kind | count | parses today? |
| --- | --- | --- |
| Real Excel date cells (→ ISO via POI) | 748 | ✅ |
| Free-text date strings | 6,571 | ❌ → null |
**90% of dated rows lose their date.** (623 rows have no date at all.)
Observed free-text formats (counts approximate, from col 6):
| Format | Count | Examples |
| --- | --- | --- |
| `D.M.YY` | 1,338 | `11.10.08`, `13.5.09` |
| `D.RomanMonth.YY/YYYY` | ~1,527 | `22.III.18`, `19.XII.1954`, `1.III.27` |
| `D.Month YYYY` | 950 | `6.März 1888`, `9.März 1888` (note: **no space** after the dot) |
| `D.M.YYYY` | 358 | `15.2.1888`, `7.3.1888` |
| Approximate / unknown | 146 | `?`, `13.7.18?`, `17.Nov (?) 1887`, `13.Januar ? 1907` |
| `Month YYYY` / season / holiday | 41+27 | `Mai 1895`, `Herbst 1913`, `Pfingsten 1922`, `Ostern 1890` |
| `YYYY` only | 17 | `1905`, `1949` |
| `D.M.` no year | 10 | `8.9.`, `14.3.` |
| Ranges | 5+ | `8.1.1916 - 15.3.1916`, `1881/82`, `1945/46?` |
| Abbrev/English months, no space | many | `29.Sept.1891`, `10.Oct.95`, `9.December1889`, `18.Dez.1916` |
| Slash separator | ~315 | `2/2. 18`, `17/6. 1916`, `10/4. 1917` |
| English `Month D. YYYY` | several | `April 12. 1922`, `Oct.5. 1916`, `Mai 23. 1917` |
| Trailing notes | 5+ | `26.4.1888, 2. Brief`, `31.8.1888,2.Brief` |
| 3-digit year (typo) | 107 | `30.1.889` (→ 1889), `4.3.1023` (in person file → 1923) |
| Day-range within month | several | `7./8. Sept.1923` |
**Proposed approach:** build a tolerant German/historical date parser (→ Gitea issue, it's
a code change). Requirements:
- Numeric `D.M.YY[YY]` and `D/M. YY[YY]` (slash = dot).
- Roman-numeral months (`I``XII`).
- German + English month names, full + abbreviated, with/without separating space
(`März`, `Sept.`, `Dez`, `December`, `Oct.`).
- 2-digit and 3-digit year normalization (`08`→1908? needs a century rule; `889`→1889).
- Partial dates → store what's known. The schema only has a single `documentDate
LocalDate`; **decide** whether to (i) store first-of-month/year, (ii) add a
`datePrecision` enum + `dateOriginal` text column, or (iii) keep raw text in a new
`documentDateRaw` field and leave `documentate` null when imprecise. Recommendation:
preserve the **original string** always (new column) + best-effort parsed date +
precision flag, so nothing is lost and the UI can show "ca. 1916".
- Unparseable/approximate (`?`, `Herbst 1913`) → keep raw, leave parsed date null, **do
not drop the row**.
**Cross-check:** even after IMP-01 is fixed so the date column is read, IMP-02 still bites.
Both must be solved before a real import.
---
## IMP-03 — New xlsx has no normalized/ISO date or name columns 🔴 BLOCKER
The ODS had helper columns the importer relied on: `Von`/`An` (normalized names) and
`Datum` (ISO) alongside `Datum Originalformat`. The new xlsx has **only the raw**
`BriefeschreiberIn` / `EmpfängerIn` / `Datum des Briefes`. So:
- Names must be parsed from raw strings (PersonNameParser already does receivers; **sender
is taken raw, never split** — fine for senders, which are single, but no normalization).
- Dates must be parsed from raw (IMP-02).
This is the root reason IMP-01/02 exist: the new file is the *uncurated* source, not the
hand-normalized ODS. Tie any importer redesign to this reality — we will not get clean
helper columns in the 7k-row file.
---
## IMP-04 — Person register not imported at all 🟠 MAJOR
`Personendatei 2.xlsx` → sheet `Tabelle1`, **163 people**, columns:
`Generation, Familienname, Vorname, geb als (maiden), Geburtsdatum, Geburtsort,
Todesdatum, Sterbeort, verheiratet mit, Bemerkung`.
Today `MassImportService` has **no person-register import**. Persons are only
auto-created as bare aliases from the document sender/receiver strings
(`personService.findOrCreateByAlias`). All this rich genealogical data is unused:
- birth/death dates + places,
- maiden names (the key to dedup — see IMP-05),
- `verheiratet mit` (marriage links → `PersonRelationship` domain),
- `Bemerkung` relationship hints (`"Schwester v Marie Cram"`, `"Nichte von Herbert"`),
- `Generation` (G 1G 4),
- nicknames in quotes (`"Tante Lolly"`).
Data-quality notes in this file too: multi-value `Vorname` (`Charlotte,Meta,Jacobi`);
mixed Excel-date vs text dates; typos (`4.3.1023`); missing-day dates (`.12.1955`);
trailing spaces (`30.8.1862 `).
**Proposed approach:** a separate **Person import** (→ Gitea issue). Order matters: import
persons *first* so documents can link to real people instead of creating alias stubs.
Use `geb als` + `verheiratet mit` to pre-build the alias/relationship graph.
---
## IMP-05 — Name variations create duplicate Persons 🟠 MAJOR
The same person appears under several surface forms across the document sheet:
- `Eugenie Müller` (151) vs `Eugenie de Gruyter` (452) — maiden vs married.
- `Clara Cram` (sender 1,284) vs `Clara de Gruyter` (455) vs `Clara de Gruyter sen.` (66).
- `Walter de Gruyter` (589) vs bare `Walter` (78).
`findOrCreateByAlias` keys on the raw string, so each variant becomes (or matches) a
distinct alias and likely a **distinct Person**. Result: fragmented person records,
broken Briefwechsel pairing, wrong stats.
**Proposed approach:** drive dedup from the register's `geb als` column (IMP-04) —
`Eugenie de Gruyter geb Müller` tells us the two strings are one person. Build an alias
map (married ↔ maiden ↔ nickname) before/while importing documents. This is partly data
(an alias mapping table/sheet) and partly code (consume it). Likely a Gitea issue once the
mapping format is decided.
945 distinct sender strings / 274 distinct receiver strings — expect a long-tail of
variants to reconcile. Don't try to be perfect on the first pass; get the high-frequency
names right.
---
## IMP-06 — 93 data rows with blank Index are silently dropped 🟠 MAJOR
`processRows` does `if (index.isBlank()) continue;`. **93 rows** have a blank Index but
carry other data (sender/receiver/date/etc.). These are silently skipped — they don't even
appear in the `skippedFiles` report (that list only covers rows that *had* an index but
failed file checks).
**Proposed approach:** before import, triage these 93 rows — are they continuation rows,
section markers, or genuine letters missing an ID? At minimum, surface a count/warning so
nothing vanishes unnoticed. Possibly a small importer change to report blank-index skips.
---
## IMP-07 — 43 duplicate Index values 🟡 MINOR
43 Index values repeat (e.g. `W-0388`, `Eu-0332`, `C-0234`, `C-0235`, `C-0236`, `J-0175`).
Since the filename is derived from Index, the importer's upsert keys both rows on the same
`originalFilename`: the second occurrence is treated as `ALREADY_EXISTS` (if the first
isn't a placeholder) and **its metadata is lost**, or it overwrites a placeholder.
**Proposed approach:** list the 43 duplicates, check whether they're true duplicates or
two distinct letters that share an ID by mistake. Fix in the source data, or extend the ID
scheme. Data task first; software only if the ID scheme must change.
---
## IMP-08 — Section/title rows interleaved with data 🟡 MINOR
Row 2 of the sheet is a section header sitting only in the sender column
(`Brautbriefe von Walter der Gruyter an Eugenie Müller`) with a blank Index — caught by the
blank-Index skip (overlaps IMP-06). There may be more such banners scattered through 7,943
rows. Also relevant: the per-letter **keywords live in `Inhalt` (col 9)** as comma-joined
values (`Tilburg,Verwandschaft`, `poetisch,Reise nach Breda`), while `Schlagwort` (col 8)
holds a single broad tag (`Brautbriefe`). The importer only takes **one** tag column —
decide which column feeds tags vs summary, and whether to split comma-lists into multiple
tags.
**Proposed approach:** scan for rows where Index is blank but other cells are set (already
have the count: relates to the 93 in IMP-06). Confirm tag vs summary column choice with
Marcel.
---
## IMP-09 — Index ↔ Datei filename mismatches 🟡 MINOR
The `Datei` column (col 1) holds explicit relative paths (`..\__scan\W-0001.pdf`) but they
don't always agree with the Index. Example: row 20 has Index `W-0010x` but Datei
`..\__scan\W-0011x.pdf`. The importer derives the filename from **Index**, so it will look
for `W-0010x.pdf` and may miss the actual scan. (Note: the `Datei` paths themselves are
Windows-style with `\` and `..` and would be **rejected** by `isValidImportFilename` if anyone
tried to use that column directly — 7,623 rows use backslashes, 7,455 contain `..`.)
**Proposed approach:** when the PDFs arrive, reconcile Index-derived names against actual
filenames; produce a mismatch report. Keep deriving from Index (stable IDs) but flag
disagreements. Mostly a data/QA task.
---
## IMP-10 — `x`-suffix rows (letter backsides / enclosures) 🟡 MINOR
**42 rows** have an `x`-suffixed Index (`W-0001x`, `W-0002x`, …). They're sparse — typically
only Index + Datei + sender + receiver, no box/folder/date. They appear to be the reverse
side or an enclosure of the preceding letter. The importer treats each as an independent
Document, and the `metadataComplete` heuristic flags them complete as soon as a sender is
present (date/box/folder all missing).
**Proposed approach:** decide whether `x` rows should be (a) separate documents, (b) extra
pages/files attached to their parent, or (c) skipped. Affects both the data model and the
`metadataComplete` heuristic. Discuss with Marcel.
---
## IMP-11 — Multi-receiver separators include bare `u` / `u.` 🟡 MINOR
`PersonNameParser.parseReceivers` already handles ` und `, ` u `, `//`, `geb.`,
parenthesised shared surnames, and `Familie` filtering — good. But the real data also uses
the abbreviation in forms the top-receivers list shows are common:
`Eugenie u Walter de Gruyter` (230), `Herbert u Clara` (94), `Juan u Marie Cram` (75),
and space-joined pairs like `Ella Anita` (79) that may be two people.
Raw separator tally on receivers: ` und ` ×70, `,` ×11, `;` ×2, `/` ×1 — plus the many ` u `
cases above. Senders are **not** parsed at all (taken raw), which is fine unless a sender
cell ever holds two names.
**Proposed approach:** add `MassImportServiceTest` cases for the real-world strings above;
extend the parser only where it actually fails. `Ella Anita`-style space-joined pairs are
ambiguous — likely leave as one person unless the register says otherwise (ties to IMP-05).
---
## IMP-12 — Importer reads only the first sheet, no validation 🟡 MINOR
`readXlsx` does `workbook.getSheetAt(0)`. For the new xlsx that's `Familienarchiv` (✅), but
the file also contains `Inhaltsverzeichnis grob`, `Inhaltsverzeichnis WdG`, `Tabelle4`.
There is no header validation: if the wrong file/sheet is dropped in `/import`, the importer
will happily map columns positionally and import nonsense. Also `findSpreadsheetFile()` picks
the **first** spreadsheet found in `/import` — with three spreadsheets present there today,
which one wins is filesystem-order-dependent.
**Proposed approach:** (a) validate the header row against expected names before importing;
(b) make the target sheet/file explicit (config or header match) rather than "first found".
Ties into the header-driven mapping in IMP-01(b).
---
## Summary of recommended sequencing
1. **Decide the importer mapping strategy** (IMP-01): positional re-config vs header-driven.
Header-driven is the durable choice and unblocks IMP-03/12.
2. **Build the tolerant date parser** (IMP-02) with original-string preservation + precision.
3. **Import the Person register first** (IMP-04) and build the alias/marriage graph,
which feeds person dedup (IMP-05).
4. **Then import documents**, with reporting for blank-index (IMP-06), duplicates (IMP-07),
and section rows (IMP-08).
5. **Reconcile files** when the ~7,000 PDFs arrive (IMP-09), and decide `x`-row semantics
(IMP-10).
Code-change items (→ Gitea issues when we get there): IMP-01(b), IMP-02, IMP-04, IMP-05
(consume side), IMP-06 reporting, IMP-12. Pure-data items stay in this folder.

View File

@@ -0,0 +1,384 @@
# Spec — Import Normalizer
> Authored in the voice of **"Elicit"**, requirements engineer (see
> `.claude/personas/req_engineer.md`). This is a requirements artifact: it states
> *what* the normalizer must do and *how we'll know it's done*, in problem/behaviour
> language. Technology choices already made during brainstorming (Python, openpyxl,
> overrides-and-rerun) are recorded as **constraints**, not re-litigated here.
- **Status:** Draft for review
- **Date:** 2026-05-25
- **Related:** [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) (issues `IMP-01..12`), [`README.md`](./README.md)
- **Scope boundary:** This spec covers the **offline normalizer** that turns the raw
spreadsheets into a clean, canonical dataset + review artifacts. Wiring the canonical
contract into the Java `MassImportService` and the `Document`/`Person` model is **Phase 2**
and gets its own spec. This spec only *defines the contract* Phase 2 must satisfy.
---
## 1. Project Brief
**Vision.** Turn the family's human-curated, free-form archive spreadsheets into a clean,
canonical dataset that imports deterministically — without hand-editing thousands of rows
and without losing the historical nuance of how things were originally written.
**Problem.** The real archive (`…aktuell…xlsx`, 7,943 rows) and the person register
(`Personendatei 2.xlsx`, 163 people) were authored for humans to read, not machines to
import. Dates are written as they appeared in each letter (≈90% unparseable by the current
importer), the column layout differs from what the importer expects, and the same person
appears under many names. Importing as-is produces garbage (see `IMP-01..12`).
**Goal (measurable).**
- G1 — After the automated pass, **≤ 5%** of dated rows remain `UNKNOWN`; after the
overrides-iteration loop, **≤ 0.5%**.
- G2 — **100%** of source rows are represented in the canonical output or in a review file —
*zero silent drops*.
- G3 — **100%** of original values (raw date string, raw name string, source row number)
are preserved.
- G4 — A full run over the current inputs completes in **< 60 s** on the dev laptop and is
**byte-identical** when re-run with unchanged inputs+overrides.
**Primary actor.** Marcel — solo owner & data steward (tech comfort 4/5). Also: a future
agent re-running the pipeline; and the `MassImportService` as the downstream consumer.
**Non-Goals (explicitly out of scope).**
- NG1 — Changing `MassImportService` or the DB schema (that is Phase 2).
- NG2 — Uploading/attaching the ~7,000 PDFs (they arrive later; import matches by `index`).
- NG3 — A GUI. The interface is spreadsheets in, CSVs out, an overrides file hand-edited.
- NG4 — Perfect genealogical reconstruction. We resolve confidently-matchable people; the
long tail stays as provisional persons.
- NG5 — OCR/transcription content (the new xlsx has no transcription column).
**Key assumptions.** (A1) Sheet `Familienarchiv` is the document source of truth.
(A2) Archive date range is **18731957** (drives the 2-digit-year century rule).
(A3) `index` is the stable document key and the basis for future PDF matching.
(A4) `Schlagwort` is a broad tag; `Inhalt` is a short summary/topic.
**Risks.** (R1) 2-digit/partial dates are genuinely ambiguous → mitigated by precision flag
+ overrides. (R2) Name matching false-positives merge distinct people → mitigated by
conservative matching + review before merge. (R3) Source spreadsheet may be re-exported with
layout drift → mitigated by header-name-based mapping, not fixed indices.
---
## 2. Personas
**Marcel — Data Steward.** Role: solo owner of Familienarchiv. Context: holds the complete
raw archive; PDFs follow. Tech comfort: 4/5 (semi-technical, reads CSV/spreadsheets fluently,
not keen to hand-edit 7,600 rows). Primary goal: a clean, importable dataset he trusts.
Frustrations: dates in ~20 formats; one ancestor under 4 name variants. **JTBD:** *"When I
have raw, human-curated archive spreadsheets, I want to transform them into a clean importable
dataset without losing how things were originally written, so I can load the archive and keep
correcting edge cases as they surface."*
**The Returning Agent.** Role: a future assistant session resuming the work. Goal: re-run the
pipeline deterministically and understand exactly what still needs human input. **JTBD:**
*"When I pick this up cold, I want one command and a clear residue report, so I can continue
without re-deriving context."*
---
## 3. Constraints & Decisions Already Made
These were settled during brainstorming and are fixed inputs to the requirements below.
| # | Decision | Rationale |
| --- | --- | --- |
| C1 | **New canonical layout** with explicit headers (not the old positional ODS shape). | Fits the new data; importer becomes header-driven in Phase 2. |
| C2 | Dates stored as **parsed (nullable) + raw + precision**. | Historical archive; never lose the original; enable "ca. 1916". |
| C3 | **Include person resolution** (register + alias/marriage map → canonical persons) in this effort. | Maiden-name dedup needs the register. |
| C4 | **Overrides-file + re-run** loop for residue. | Deterministic, diffable, repeatable. |
| C5 | Implementation: **Python 3.12 + openpyxl**, standalone tool at `tools/import-normalizer/`. | Fast iteration; no Spring rebuild / coverage gate on transform code. |
| C6 | Century rule for archive **18731957**: 2-digit `0057``19YY`, `7399``18YY`, `5872`→**flag**; 3-digit `DDD``1DDD`; never 20xx. | Stated by Marcel. Boundaries live in config. |
| C7 | `Schlagwort`→tag, `Inhalt`→summary. | Matches importer's existing semantics. |
| C8 | Non-register correspondents become **provisional persons**. | ~945 distinct sender strings vs 163 register people. |
---
## 4. Functional Requirements
Each requirement has a stable ID. User stories use Connextra + Given-When-Then; system rules
use EARS. Traceability to findings in §8.
### 4.1 Ingest & layout (`FR-INGEST`, `FR-MAP`)
**US-MAP-01** — *As the data steward, I want each source column mapped to a named canonical
field regardless of its position, so a re-exported spreadsheet with shifted columns still
imports correctly.*
- AC1 — Given the `Familienarchiv` sheet, when the normalizer reads the header row, then it
maps columns by **header name** (not fixed index) to the canonical fields.
- AC2 — Given a header the normalizer does not recognise, when it runs, then it records the
unknown header in `review/summary.txt` and continues (does not crash).
- AC3 — Given a required source header is **absent**, when it runs, then it aborts with a
clear message naming the missing header (fail loud, before producing partial output).
- **REQ-INGEST-01** — The normalizer shall read only the `Familienarchiv` sheet of the
document workbook and the `Tabelle1` sheet of the person workbook.
- **REQ-MAP-01** — Header matching shall be case-insensitive and tolerant of internal
multiple spaces (e.g. `"Datum des Briefes"`).
### 4.2 Row triage (`FR-TRIAGE`) — resolves IMP-06, IMP-07, IMP-08
**US-TRIAGE-01** — *As the data steward, I want rows that have data but no index surfaced
rather than dropped, so I never lose a letter silently.*
- AC1 — Given a row whose `index` is blank but which has any other non-empty cell, when the
normalizer runs, then that row is written to `review/blank-index-rows.csv` with its source
row number and is **not** emitted as a canonical document.
- AC2 — Given a fully empty row, when it runs, then the row is skipped and counted (not
reported as an anomaly).
- **REQ-TRIAGE-01** — If two or more rows resolve to the same `index`, then the normalizer
shall emit all of them to `review/duplicate-index.csv` and mark each canonical row
`needs_review = duplicate_index` (it shall **not** silently drop either).
- **REQ-TRIAGE-02** — Where a row is identified as a section/banner row (blank index, text
only in a name column), the normalizer shall classify it as such in the blank-index report.
- **REQ-TRIAGE-03** — Rows whose `index` ends in `x` (a transcription/back-side of the base
letter, not yet independently mappable) shall be **skipped** — not emitted as a canonical
document — and written to `review/skipped-x-suffix.csv` with their source row and base index
(`index` minus the trailing `x`), so they can be linked in a later pass. (Resolves IMP-10.)
### 4.3 Date normalization (`FR-DATE`) — resolves IMP-02, IMP-03
**US-DATE-01** — *As the data steward, I want every date interpreted as precisely as the
source allows, with the original always kept, so I can sort the archive and still see what the
letter actually said.*
- AC1 — Given a parseable date, when normalized, then `date_iso` holds the best-effort ISO
date, `date_raw` holds the verbatim source string, and `date_precision`
`{DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN}`.
- AC2 — Given an unparseable date, when normalized, then `date_iso` is empty,
`date_precision = UNKNOWN`, `date_raw` is preserved, and the value appears in
`review/unparsed-dates.csv`.
- AC3 — Given the same `date_raw` appears in `overrides/dates.csv`, when normalized, then the
override's `(iso, precision)` wins over the automatic parse.
- **REQ-DATE-01** — The parser shall accept, at minimum, these forms (see §10 examples):
Excel/ISO; `D.M.YYYY`/`D.M.YY`; `D/M. YY[YY]` (slash treated as dot); Roman-numeral months
`IXII`; German + English month names, full and abbreviated, with or without a separating
space; `Month YYYY`; season/holiday + year; bare `YYYY`; and start-anchored ranges.
- **REQ-DATE-02** — Precision shall be assigned by what is known: full day → `DAY`; month+year
`MONTH` (day = 1); a **named feast/holiday + year** → resolved to its **actual calendar
date for that year** → `DAY`; a **season + year** → representative mid-season month (day = 1)
`SEASON`; year only → `YEAR` (month = Jan, day = 1); a range → start date + `RANGE`; a
value carrying an uncertainty marker (`?`, `um`, `ca`, `circa`) → `APPROX` with best-effort date.
- **REQ-DATE-03** — Two-digit and three-digit years shall be expanded per **C6**; a 2-digit
year in `5872` shall yield `UNKNOWN` + a review entry rather than a guess.
- **REQ-DATE-04** — Trailing editorial notes (e.g. `", 2. Brief"`) shall be stripped before
parsing and preserved (kept within `date_raw`; not invented into the date).
- **REQ-DATE-05** — The parser shall be pure and side-effect-free so it can be unit-tested in
isolation (see NFR-TEST-01).
- **REQ-DATE-06** — **Movable feasts are never mapped to a fixed month**; they shall be
computed per year from Easter (Gauss/Butcher computus): Karfreitag = Easter2, Ostern =
Easter Sunday, Himmelfahrt = Easter+39, Pfingst(sonntag) = Easter+49, Pfingstmontag =
Easter+50, Fronleichnam = Easter+60, 1.4. Advent = the 4th…1st Sunday before 25 Dec. Fixed
feasts use a lookup table (Neujahr=01-01, Heiligabend=12-24, Weihnachten=12-25,
Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul,
Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py`
(NFR-MAINT-01).
### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11
**US-PERS-01** — *As the data steward, I want the genealogical register turned into canonical
people with all their known facts, so documents can link to real persons.*
- AC1 — Given a register row, when parsed, then a canonical person is produced with
`person_id`, name parts, `maiden_name`, birth/death (parsed + raw + place), spouse,
generation, nickname, notes — applying the same date rules as §4.3 to birth/death dates.
- AC2 — Given multi-value given names (`"Charlotte,Meta,Jacobi"`), when parsed, then the
primary given name is the first; the remainder are retained as additional names/aliases.
**US-PERS-02** — *As the data steward, I want each sender/receiver string matched to a
canonical person where possible and never dropped otherwise, so the correspondence graph is
complete.*
- AC1 — Given a sender/receiver string, when resolved, then it maps to a register
`person_id` via the alias index (exact → normalized/casefold → conservative fuzzy).
- AC2 — Given no confident match, when resolved, then a **provisional person** is created from
the cleaned string, linked, and listed in `review/unmatched-names.csv` (occurrence count +
example source rows).
- AC3 — Given the string appears in `overrides/names.csv`, when resolved, then it maps to the
specified `person_id` (override wins).
- AC4 — Given a multi-person receiver cell (`"Eugenie u Walter de Gruyter"`, `"Herbert u
Clara"`, `"…//…"`, `"Hedi und Tutu (Gruber)"`), when resolved, then it is split into
individual people, each resolved independently; ambiguous space-joined pairs
(`"Ella Anita"`) are emitted to `review/ambiguous-receivers.csv` rather than guessed.
- **REQ-DEDUP-01** — The alias index shall be derived from the register: canonical
"First Last", maiden form (`geb als`), spouse-surname married form, nickname, and
first-name-only **only when unambiguous** across the register.
- **REQ-DEDUP-02** — The normalizer shall not merge two distinct strings into one person on
fuzzy similarity alone above a configured threshold without the match being reported; merges
must be auditable.
- **REQ-PERS-01** — Sender cells shall be parsed for multi-person content using the same rules
as receiver cells (today the importer parses only receivers — IMP-11).
### 4.5 Overrides & idempotency (`FR-OVR`) — supports the iteration loop
- **REQ-OVR-01** — When the normalizer runs, then it shall load `overrides/dates.csv` and
`overrides/names.csv` if present and apply them; absence of either file shall not be an error.
- **REQ-OVR-02** — While overrides are unchanged and inputs are unchanged, re-running shall
produce **byte-identical** canonical outputs and review files (NFR-IDEM-01).
- **REQ-OVR-03** — Each override application shall be counted in `review/summary.txt` (how many
dates/names were resolved by override vs automatically).
### 4.6 Canonical output & provenance (`FR-OUT`, `FR-PROV`) — resolves IMP-01, IMP-09, IMP-12
- **REQ-OUT-01** — The normalizer shall write `out/canonical-documents.xlsx` and
`out/canonical-persons.xlsx` with the headered schemas in §6.
- **REQ-PROV-01** — Every canonical document row shall carry `source_row` (1-based row number
in the source sheet) so any value can be traced back to the original.
- **REQ-PROV-02** — Every canonical row shall carry a `needs_review` field listing zero or more
flags (`duplicate_index`, `unparsed_date`, `unmatched_sender`, `unmatched_receiver`,
`index_file_mismatch`, …) so the import and the UI can foreground uncertain data.
- **REQ-OUT-02** — Where the source `Datei` path disagrees with the index-derived filename
(IMP-09), the normalizer shall record the discrepancy in `review/index-file-mismatch.csv`
and flag the row; it shall **not** alter the `index` (the stable key).
---
## 5. Non-Functional Requirements
| ID | Category | Requirement (measurable) |
| --- | --- | --- |
| NFR-DATA-01 | Data integrity | 100% of source rows are accounted for in output **or** a review file; 100% of original date/name strings preserved verbatim. |
| NFR-IDEM-01 | Determinism | Identical inputs + overrides ⇒ byte-identical outputs across runs and machines. |
| NFR-PERF-01 | Performance | Full run over 7,943 doc rows + 163 person rows completes in < 60 s on the dev laptop. |
| NFR-ACCUR-01 | Date accuracy | After automated pass, `UNKNOWN` dates ≤ 5% of dated rows; after overrides iteration, ≤ 0.5%. |
| NFR-ACCUR-02 | Name coverage | Every sender/receiver occurrence yields a linked person (register or provisional); 0 dropped. |
| NFR-I18N-01 | Encoding | UTF-8 end-to-end; German diacritics and ß round-trip with no mojibake in any output. |
| NFR-TEST-01 | Testability | `dates.py` and `persons.py` have pytest tests covering every format/alias category in §10 with real examples from the archive. |
| NFR-MAINT-01 | Maintainability | Column-name map, century boundaries, season→month map, and fuzzy threshold live in `config.py`, not inline in logic. |
| NFR-OBSERV-01 | Observability | `review/summary.txt` reports per-run stats: rows in, documents out, dates by precision, names matched vs provisional, overrides applied, anomalies by type. |
| NFR-SAFETY-01 | Source safety | Source workbooks are opened read-only and never written. |
---
## 6. Data Dictionary (canonical contract)
This is the contract Phase 2 (the importer) must consume. Field-level, format-level — not a
DB schema.
### 6.1 `canonical-documents.xlsx`
| Field | Required | Format / values | Notes |
| --- | --- | --- | --- |
| `index` | yes | string | Stable key; basis for PDF matching. |
| `box` | no | string | from `Box`. |
| `folder` | no | string | from `Mappe`. |
| `sender_person_id` | no | person_id | resolved; empty if no sender. |
| `sender_name` | no | string | canonical display name (or cleaned raw if provisional). |
| `receiver_person_ids` | no | `id\|id\|…` | pipe-separated. |
| `receiver_names` | no | `name\|name\|…` | pipe-separated, aligned with ids. |
| `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. |
| `date_raw` | no | string | verbatim source date. |
| `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. |
| `location` | no | string | from `Ort`. |
| `tags` | no | `tag\|tag` | from `Schlagwort`. |
| `summary` | no | string | from `Inhalt`. |
| `source_row` | yes | int | provenance (NFR-DATA-01). |
| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). |
### 6.2 `canonical-persons.xlsx`
| Field | Required | Format | Notes |
| --- | --- | --- | --- |
| `person_id` | yes | slug | stable id (e.g. `de-gruyter-eugenie`); collisions suffixed. |
| `last_name` | yes | string | from `Familienname`. |
| `first_name` | no | string | primary given name. |
| `maiden_name` | no | string | from `geb als` — drives dedup. |
| `title` | no | string | e.g. honorifics if present. |
| `nickname` | no | string | from quoted `Bemerkung`/spouse field. |
| `birth_date` / `birth_date_raw` / `birth_place` | no | ISO / string / string | §4.3 rules. |
| `death_date` / `death_date_raw` / `death_place` | no | ISO / string / string | §4.3 rules. |
| `spouse` | no | person_id or name | from `verheiratet mit`. |
| `generation` | no | string | `G 1`..`G 4`. |
| `notes` | no | string | from `Bemerkung`. |
| `aliases` | no | `a\|b\|c` | every surface form that maps here. |
| `provisional` | yes | bool | true if created from a document string, not the register. |
---
## 7. Prioritized Backlog (MoSCoW)
| ID | Item | MoSCoW | Effort | Depends on |
| --- | --- | --- | --- | --- |
| B1 | Project scaffolding + read both workbooks (`FR-INGEST`, header map `FR-MAP`) | Must | S | — |
| B2 | Row triage + blank/duplicate/empty reports (`FR-TRIAGE`) | Must | S | B1 |
| B3 | Date parser + precision + century rule + Easter/feast computus + season map + tests (`FR-DATE`) | Must | L | B1 |
| B4 | Person register parser → canonical persons (`FR-PERS` US-PERS-01) | Must | M | B1 |
| B5 | Alias index + name resolution + multi-person split (`FR-DEDUP`, US-PERS-02) | Must | L | B4 |
| B6 | Overrides load + apply + idempotency (`FR-OVR`) | Must | S | B3,B5 |
| B7 | Canonical writers + provenance + review summary (`FR-OUT`, `FR-PROV`) | Must | M | B2,B3,B5 |
| B8 | Index↔Datei mismatch report (`REQ-OUT-02`) | Should | XS | B1 |
| B9 | Ambiguous-receiver review path (US-PERS-02 AC4) | Should | S | B5 |
| B10 | Comma-split `Inhalt` into extra tags | Could | XS | B7 |
| B11 | Phase-2 importer wiring (separate spec) | Won't (this spec) | — | B7 |
---
## 8. Traceability — Findings → Requirements
| Finding | Severity | Addressed by |
| --- | --- | --- |
| IMP-01 layout mismatch | blocker | C1, FR-MAP, REQ-OUT-01 |
| IMP-02 free-text dates | blocker | FR-DATE (all), C2, C6 |
| IMP-03 no ISO/normalized cols | blocker | FR-DATE, FR-PERS |
| IMP-04 register unimported | major | C3, US-PERS-01, §6.2 |
| IMP-05 name variants → dupes | major | C3, FR-DEDUP |
| IMP-06 blank-index dropped | major | US-TRIAGE-01 |
| IMP-07 duplicate indices | minor | REQ-TRIAGE-01 |
| IMP-08 section rows / tags vs summary | minor | REQ-TRIAGE-02, C7 |
| IMP-09 index↔file mismatch | minor | REQ-OUT-02, B8 |
| IMP-10 `x`-suffix rows | minor | REQ-TRIAGE-03 (skip + log this pass) |
| IMP-11 sender not split / ` u ` sep | minor | REQ-PERS-01, US-PERS-02 AC4 |
| IMP-12 first-sheet, no validation | minor | REQ-INGEST-01, FR-MAP AC2/AC3 |
---
## 9. Open Questions / TBD Register
| ID | Question | Why it matters | Ref | Resolution |
| --- | --- | --- | --- | --- |
| OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). |
| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02 | **Confirmed:** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`. |
| OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. |
| OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. |
| OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). |
| OQ-06 ✅ | Fuzzy-match policy. | False-positive person merges (R2). | REQ-DEDUP-02 | **Confirmed:** conservative — report all fuzzy matches; no silent merge. |
*All open questions resolved as of 2026-05-25. New ambiguities discovered during build go here.*
---
## 10. Glossary & Worked Examples
**Precision** — how exactly a date is known (`DAY` … `UNKNOWN`). **Provisional person** — a
person created from a document name string with no register match. **Alias index** — map from
every known surface form of a name to a canonical `person_id`. **Override** — a
human-supplied correction applied deterministically on each run.
**Date examples → expected outcome:**
| `date_raw` | `date_iso` | `date_precision` |
| --- | --- | --- |
| `15.2.1888` | 1888-02-15 | DAY |
| `6.März 1888` | 1888-03-06 | DAY |
| `22.III.18` | 1918-03-22 | DAY |
| `13.5.09` | 1909-05-13 | DAY |
| `10.Oct.95` | 1895-10-10 | DAY |
| `17/6. 1916` | 1916-06-17 | DAY |
| `Mai 1895` | 1895-05-01 | MONTH |
| `Pfingsten 1922` | 1922-06-04 | DAY (computed: Easter 1922 = Apr 16, +49 days) |
| `Herbst 1913` | 1913-10-01 | SEASON |
| `1905` | 1905-01-01 | YEAR |
| `8.1.1916 - 15.3.1916` | 1916-01-08 | RANGE |
| `17.Nov (?) 1887` | 1887-11-17 | APPROX |
| `?` | *(empty)* | UNKNOWN |
**Name examples → expected outcome:**
| raw cell | resolves to |
| --- | --- |
| `Eugenie Müller` (+ register `geb Müller`) | `de-gruyter-eugenie` (matched via maiden alias) |
| `Eugenie de Gruyter` | `de-gruyter-eugenie` |
| `Herbert u Clara` | `cram-herbert` + `cram-clara` (split, surname distributed) |
| `Hedi und Tutu (Gruber)` | `gruber-hedi` + `gruber-tutu` |
| `Ella Anita` | → `review/ambiguous-receivers.csv` (not auto-split) |
| `Hans Wittkopf` (not in register) | provisional `wittkopf-hans` |

View File

@@ -0,0 +1,62 @@
# Import Migration — Working Folder
This folder tracks the iterative work of mass-importing the **real, raw family archive**
spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.
It is intentionally **local docs, not Gitea issues**. We only open a Gitea issue when a
finding requires a *software* change (e.g. a new date parser). Pure data observations and
the running plan live here so any agent can pick the work up cold.
## Source files (in `/import`)
| File | What it is | Importer support today |
| --- | --- | --- |
| `zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx` | The **real raw archive** — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters. | ❌ layout does **not** match importer defaults |
| `Personendatei 2.xlsx` | Genealogical **person register** — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships). | ❌ no importer at all |
| `zzfamilienarchiv Walter und Eugenie 2025-04-10.ods` | A small, **already-normalized** subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what `MassImportService` was built for |
The PDFs (~7,000) will follow later. The importer matches files by the **Index** column
(e.g. `W-0001``W-0001.pdf`), and already imports metadata-only when a file is missing —
so we can import all metadata now and the PDFs will attach on a re-run.
## How to inspect the spreadsheets
`openpyxl` is installed in the OCR service venv:
```bash
/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"
```
## Documents in this folder
- [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID `IMP-NN`.
- [`02-normalization-spec.md`](./02-normalization-spec.md) — requirements spec for the offline **import normalizer** (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements `FR-*`/`NFR-*`, traceable to the `IMP-NN` findings.
- `WORKLOG.md` — running log of what each session did and what's next. **Start here when resuming.**
## Strategy (decided 2026-05-25)
Normalize **before** import. A standalone Python tool (`tools/import-normalizer/`, not yet
built) transforms the raw xlsx + person register into a clean canonical dataset
(`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs. Residual cases
(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
See the spec for the full contract.
## Status board
| ID | Issue | Severity | Status |
| --- | --- | --- | --- |
| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open |
| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open |
| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open |
| IMP-04 | Person register (`Personendatei 2.xlsx`) not imported | 🟠 major | open |
| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open |
| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open |
| IMP-07 | 43 duplicate Index values | 🟡 minor | open |
| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open |
| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open |
| IMP-10 | `x`-suffix rows (letter backsides/enclosures) | 🟡 minor | open |
| IMP-11 | Multi-receiver separators incl. bare `u`/`u.` | 🟡 minor | open |
| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open |
See the findings doc for detail and proposed approach per issue.

View File

@@ -0,0 +1,62 @@
# Import Migration — Worklog
Running log of each working session. **Resume here.** Newest entry on top.
---
## 2026-05-25 (session 2) — Strategy + normalizer spec
**Did:**
- Decided strategy with Marcel: **normalize the raw sheets first**, then import (higher
leverage than making the Java importer tolerate every mess).
- Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw +
precision; include person register + dedup in this effort; overrides-file + re-run loop;
Python tool at `tools/import-normalizer/`.
- Century rule fixed by Marcel: archive spans **18731957**; 2-digit `0057`→19YY,
`7399`→18YY, `5872`→flag; 3-digit→1DDD; never 20xx.
- Wrote [`02-normalization-spec.md`](./02-normalization-spec.md) in the requirements-engineer
persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).
**All 6 open questions resolved (spec §9):** OQ-01 — movable feasts (Ostern, Pfingsten, …)
**computed per year from Easter**, never a fixed month; seasons → mid-season month
(Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — `x`-suffix rows
**skipped + logged** this pass (they're transcriptions of the base letter, not yet mappable).
OQ-05 → `.xlsx`. OQ-06 → conservative, no silent merge.
**Git:** moved off the unrelated `feat/issue-356-…` branch; pulled `main`; created clean
branch **`docs/import-migration`** and committed these docs there. (The dirty `.venv`
pycache + `skills/implement/SKILL.md` in the tree are pre-existing/environmental noise — left
uncommitted, not ours.)
**Next:**
- Marcel reviews the spec.
- Then writing-plans → build the normalizer at `tools/import-normalizer/` (backlog B1B7 are
the Musts; B3 date parser incl. Easter computus is the big one).
---
## 2026-05-25 (session 1) — Initial analysis
**Did:**
- Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
- Compared the new xlsx layout against `MassImportService` defaults and the old ODS.
- Full statistical scan of all rows: dates, indices, senders/receivers, file column.
- Wrote [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md)
with 12 issues (IMP-01..IMP-12) + recommended sequencing.
- Installed `openpyxl` into the OCR service venv for inspection.
**Key facts established:**
- Importer defaults match the **ODS**, not the new xlsx → wrong column mapping (IMP-01).
- **90%** of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
- Person register is rich but **unimported**; holds the maiden-name dedup key (IMP-04/05).
**Decisions pending from Marcel (blockers for any code work):**
1. IMP-01: positional re-config of `app.import.col.*` vs header-driven mapping rewrite?
2. IMP-02: how to store imprecise dates — new `dateOriginal` + `precision` columns, or lossy?
3. IMP-04/05: format for the person/alias mapping; import persons before documents?
4. IMP-10: are `x`-suffix rows separate documents, attachments, or skipped?
**Next:**
- Get Marcel's calls on the 4 decisions above.
- Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
- Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.