diff --git a/docs/superpowers/specs/2026-05-25-personendatei-importer-design.md b/docs/superpowers/specs/2026-05-25-personendatei-importer-design.md new file mode 100644 index 00000000..17637445 --- /dev/null +++ b/docs/superpowers/specs/2026-05-25-personendatei-importer-design.md @@ -0,0 +1,292 @@ +# Personendatei Importer — Design Spec + +**Date:** 2026-05-25 +**Source file:** `import/Personendatei 2.xlsx` +**Output:** `tools/import-normalizer/out/canonical-persons-tree.json` +**Tool location:** `tools/import-normalizer/persons_tree.py` + +--- + +## 1. Purpose + +Normalize the 163-person family register in `Personendatei 2.xlsx` into a machine-readable JSON file that a future backend importer can consume to seed the `persons` and `person_relationships` tables. The tool is offline (no backend required) and produces a reviewable artifact with an explicit `unresolved[]` list for manual follow-up. + +--- + +## 2. Source Data — Column Map + +Sheet: `Tabelle1` (rows 2–164; row 1 is the header). + +| Col | Header | Content | Notes | +|-----|--------|---------|-------| +| A | Generation | `G 0`–`G 5` | Generation relative to Herbert & Clara Cram (G 2). Inconsistent formatting: `"G3"`, `"G 0"`, `"G 2 de Gruyter"` — strip non-digit chars and parse the integer. | +| B | Familienname | Last name | Sometimes compound: `"de Gruyter"`, `"Cram Heydrich"`, `"Burkhard- Meier"` | +| C | Vorname | First name | Sometimes multiple: `"Charlotte,Meta,Jacobi"`, nicknames in parens: `"Otto (Herbert)"` | +| D | geb als | Maiden name | Used as a name alias for matching | +| E | Geburtsdatum | Birth date | **Mixed types** — see §4 | +| F | Geburtsort | Birth place | Free-text string, stored verbatim | +| G | Todesdatum | Death date | Same mixed types as col E | +| H | Sterbeort | Death place | Free-text string, stored verbatim | +| I | verheiratet mit | Spouse name | Partial name in either `"Firstname Lastname"` or `"Lastname Firstname"` order | +| J | Bemerkung | German relationship notes | `"Sohn v Clara u Herbert"`, `"Nichte v Herbert"`, free text | + +--- + +## 3. Two-Pass Architecture + +### Pass 1 — Parse & Normalize (rows → person records) + +For each row: +1. Read all 10 columns. +2. Assign a stable `rowId`: `"row_{i:03d}"` where `i` is the 1-based row number (e.g. `row_002`). +3. Normalize fields per §4 and §5. +4. Build the **name-lookup index** (see §6). +5. Emit a person record. + +### Pass 2 — Resolve Relationships + +Walk every person record: +1. Resolve col I (spouse) → emit `SPOUSE_OF` edge or `unresolved` entry. +2. Parse col J (Bemerkung) for parent/child patterns → emit `PARENT_OF` edges or `unresolved` entries. +3. Append unmatched Bemerkung text to `person.notes`. + +--- + +## 4. Date Parsing + +Both col E (birth) and col G (death) arrive as either an Excel numeric serial or a string. + +### Excel serial conversion +When the cell value is an integer (or a float with no string representation): +``` +date = datetime(1899, 12, 30) + timedelta(days=int(value)) +year = date.year +``` +Excel's epoch is 1899-12-30 (accounts for the Lotus 1-2-3 leap-year bug). + +### String fallback — reuse existing `dates.parse_date()` +Pass the raw string to the existing `tools/import-normalizer/dates.parse_date()`. It already handles: +- `DD.MM.YYYY` and `D.M.YY` +- Year-only (`1930`) +- Month + year (`August 1941`, `Sept. 1913`) +- Partial/approximate markers + +Extract `.year` from the returned `ParsedDate.iso` if `iso` is not `None`. + +### Unresolvable dates +If both paths yield `None` (e.g. `"2.9.196"`, `"4.3.1023"`, `".12.1955"`): +- Set `birthYear`/`deathYear` to `null`. +- Append the raw value to `person.notes` as `"[Geburtsdatum: ]"` or `"[Todesdatum: ]"` for human review. + +--- + +## 5. Person Record Normalization + +### Name fields +- **lastName** = col B, stripped. +- **firstName** = col C. Keep as-is (including multi-name strings and parenthetical nicknames) — the backend can split later. +- **maidenName** = col D, stripped. Stored in the JSON; the backend maps this to a `PersonNameAlias` of type `BIRTH_NAME`. +- **alias** = `null` (the tool does not invent aliases; maiden name is the alias). + +### Generation +Extract the first digit sequence from col A: +```python +import re +m = re.search(r"\d+", raw_generation) +generation = int(m.group()) if m else None +``` +Handles all observed variants: `"G 3"`, `"G3"`, `"G 0"`, `"G 2 de Gruyter"`, `"G 0"`. +Stored as `generation: int | null` in the JSON (informational; not mapped to a backend field directly). + +### familyMember +Set `true` for all records. Every person in this register is part of the family network. The backend can refine this. + +### notes +Constructed by concatenation: +1. Unmatched Bemerkung text (after relationship pattern is stripped). +2. Unresolvable date raw values (prefixed with field name). + +--- + +## 6. Name Lookup Index + +After pass 1, build a `dict[str, list[str]]` mapping normalized name keys → list of `rowId`s. + +### Normalization function `_norm(s) -> str` +1. Lowercase. +2. Strip surrounding `"` and `'`. +3. Remove parenthetical substrings: `r"\([^)]*\)"`. +4. Collapse internal whitespace. +5. Strip geographic/honorific suffixes: `aachen`, `mex.`, `mexiko`, `sen`, `jun`, `jr`. +6. Strip trailing commas, dots. + +### Keys indexed per person +For a person with firstName `F`, lastName `L`, maidenName `M`: +- `_norm(f"{F} {L}")` — canonical order +- `_norm(f"{L} {F}")` — reversed order (col I uses this heavily) +- `_norm(f"{F} {M}")` if maidenName is set — maiden-name reference +- `_norm(L)` alone — single-token fallback + +### Match resolution +Given a raw name string from col I or col J: +1. `_norm(raw)` → look up in index. +2. **Exactly one hit** → match confirmed, use that `rowId`. +3. **Zero hits** → `reason: "not_found"` → `unresolved[]`. +4. **Multiple hits** → `reason: "ambiguous"` → `unresolved[]`. + +--- + +## 7. Relationship Extraction + +### 7.1 SPOUSE_OF (col I — `verheiratet mit`) + +1. Normalize col I value. +2. Resolve via name index (§6). +3. If matched: emit one edge `{ personId, relatedPersonId, type: "SPOUSE_OF", source: "verheiratet_mit" }`. + - Skip if an identical edge (regardless of direction) already exists in the relationship list. +4. If unresolved: add to `unresolved[]`. + +### 7.2 PARENT_OF (col J — `Bemerkung`) + +Apply these regex patterns in order, case-insensitive, with optional whitespace: + +| Pattern | Direction | Note | +|---------|-----------|------| +| `(Sohn\|Tochter)\s+v(?:on)?\s+(.+)` | Named person(s) → this person | "Sohn v Clara u Herbert" | +| `(Vater\|Mutter)\s+v(?:on)?\s+(.+)` | This person → named person(s) | "Vater v Herbert" | + +**Multi-parent extraction:** The parent string may contain two parents joined by `\s+u(?:nd)?\s+`. Split on this pattern, resolve each part independently. + +**Emit** one `PARENT_OF` edge per resolved parent: +```json +{ + "personId": "", + "relatedPersonId": "", + "type": "PARENT_OF", + "source": "bemerkung", + "rawBemerkung": "" +} +``` + +**Skip** (do not emit, do not add to `unresolved[]`, leave in notes): +- Patterns starting with `Neffe`, `Nichte`, `Enkel`, `Enkelin`, `Urenkel`, `Urenkelin` — too indirect. +- Patterns starting with `Bruder`, `Schwester` — SIBLING_OF is out of scope for this tool. +- Any other Bemerkung text that does not match the parent patterns. + +**After extraction:** the matched portion of the Bemerkung is removed; the remainder goes into `person.notes`. + +--- + +## 8. Output JSON Schema + +File: `tools/import-normalizer/out/canonical-persons-tree.json` + +```json +{ + "generated_at": "", + "source": "Personendatei 2.xlsx", + "stats": { + "persons": 163, + "relationships": 87, + "unresolved": 12 + }, + "persons": [ + { + "rowId": "row_002", + "firstName": "Elsgard", + "lastName": "Allemeyer", + "maidenName": "Wöhler", + "alias": null, + "notes": "Nichte von Herbert", + "birthYear": 1920, + "deathYear": 1999, + "birthPlace": "Garz", + "deathPlace": "Espelkamp", + "generation": 3, + "familyMember": true + } + ], + "relationships": [ + { + "personId": "row_002", + "relatedPersonId": "row_003", + "type": "SPOUSE_OF", + "source": "verheiratet_mit" + }, + { + "personId": "row_019", + "relatedPersonId": "row_021", + "type": "PARENT_OF", + "source": "bemerkung", + "rawBemerkung": "Tochter v Clara u Herbert" + } + ], + "unresolved": [ + { + "rowId": "row_007", + "field": "verheiratet_mit", + "raw": "\"Tante Lolly\"", + "reason": "not_found" + }, + { + "rowId": "row_042", + "field": "bemerkung", + "raw": "Zwillingsbruder v Herbert", + "reason": "not_found" + } + ] +} +``` + +--- + +## 9. CLI Interface + +``` +python3 persons_tree.py [--input PATH] [--output PATH] [--dry-run] +``` + +| Flag | Default | Description | +|------|---------|-------------| +| `--input` | `../../import/Personendatei 2.xlsx` | Source Excel file | +| `--output` | `out/canonical-persons-tree.json` | Output JSON file | +| `--dry-run` | off | Print stats + first 5 unresolved entries; do not write file | + +On success, print: +``` +✓ 163 persons parsed +✓ 87 relationships emitted (52 SPOUSE_OF, 35 PARENT_OF) +⚠ 12 unresolved (see unresolved[] in output) +→ out/canonical-persons-tree.json +``` + +--- + +## 10. Module Reuse + +| Existing module | What we reuse | +|-----------------|---------------| +| `dates.parse_date()` | String date parsing — handles DD.MM.YYYY, year-only, month+year, approximate markers | +| `config.MONTHS` | Month name → integer mapping (German + Spanish month names already present) | + +The Excel serial conversion is new logic added directly in `persons_tree.py` (3 lines). + +--- + +## 11. What This Tool Does NOT Do + +- Does not call the backend API or touch the database. +- Does not create `PersonNameAlias` records — it emits `maidenName` as a field; the future backend importer maps it. +- Does not infer SIBLING_OF edges (requires symmetric lookup across multiple rows — deferred). +- Does not deduplicate persons that appear in both this file and `canonical-persons.xlsx` — deduplication is the backend importer's responsibility. +- Does produce `birthPlace` / `deathPlace` as top-level fields in the JSON (see §8) — they are free-text strings and informational only. The `Person` entity has no corresponding columns; the future backend importer decides whether to add columns or fold the values into `notes`. + +--- + +## 12. Open Questions + +| OQ | Question | Blocks | +|----|----------|--------| +| OQ-01 | Some persons appear twice with slightly different data (rows 127/138 — Christa Schütz/Siebert; rows 129/139 — Christoph Seils). Deduplicate in the tool or leave as duplicates for the backend to handle? | §8 persons array | +| OQ-02 | `birthPlace` / `deathPlace` are in the source but absent from the `Person` entity. Should they go into `notes`, or should the backend importer add new columns? | §8 persons array, future backend importer | +| OQ-03 | The `firstName` for `"Charlotte,Meta,Jacobi"` (row 7 / row 120) is a comma-separated multi-name. Store verbatim or split into `firstName` + `alias`? | §5 name normalization |