docs(importer): add Personendatei importer design spec
Two-pass Python tool (persons_tree.py) that normalizes import/Personendatei 2.xlsx into canonical-persons-tree.json with persons, SPOUSE_OF/PARENT_OF relationships, and an unresolved[] list for manual review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,292 @@
|
||||
# Personendatei Importer — Design Spec
|
||||
|
||||
**Date:** 2026-05-25
|
||||
**Source file:** `import/Personendatei 2.xlsx`
|
||||
**Output:** `tools/import-normalizer/out/canonical-persons-tree.json`
|
||||
**Tool location:** `tools/import-normalizer/persons_tree.py`
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Normalize the 163-person family register in `Personendatei 2.xlsx` into a machine-readable JSON file that a future backend importer can consume to seed the `persons` and `person_relationships` tables. The tool is offline (no backend required) and produces a reviewable artifact with an explicit `unresolved[]` list for manual follow-up.
|
||||
|
||||
---
|
||||
|
||||
## 2. Source Data — Column Map
|
||||
|
||||
Sheet: `Tabelle1` (rows 2–164; row 1 is the header).
|
||||
|
||||
| Col | Header | Content | Notes |
|
||||
|-----|--------|---------|-------|
|
||||
| A | Generation | `G 0`–`G 5` | Generation relative to Herbert & Clara Cram (G 2). Inconsistent formatting: `"G3"`, `"G 0"`, `"G 2 de Gruyter"` — strip non-digit chars and parse the integer. |
|
||||
| B | Familienname | Last name | Sometimes compound: `"de Gruyter"`, `"Cram Heydrich"`, `"Burkhard- Meier"` |
|
||||
| C | Vorname | First name | Sometimes multiple: `"Charlotte,Meta,Jacobi"`, nicknames in parens: `"Otto (Herbert)"` |
|
||||
| D | geb als | Maiden name | Used as a name alias for matching |
|
||||
| E | Geburtsdatum | Birth date | **Mixed types** — see §4 |
|
||||
| F | Geburtsort | Birth place | Free-text string, stored verbatim |
|
||||
| G | Todesdatum | Death date | Same mixed types as col E |
|
||||
| H | Sterbeort | Death place | Free-text string, stored verbatim |
|
||||
| I | verheiratet mit | Spouse name | Partial name in either `"Firstname Lastname"` or `"Lastname Firstname"` order |
|
||||
| J | Bemerkung | German relationship notes | `"Sohn v Clara u Herbert"`, `"Nichte v Herbert"`, free text |
|
||||
|
||||
---
|
||||
|
||||
## 3. Two-Pass Architecture
|
||||
|
||||
### Pass 1 — Parse & Normalize (rows → person records)
|
||||
|
||||
For each row:
|
||||
1. Read all 10 columns.
|
||||
2. Assign a stable `rowId`: `"row_{i:03d}"` where `i` is the 1-based row number (e.g. `row_002`).
|
||||
3. Normalize fields per §4 and §5.
|
||||
4. Build the **name-lookup index** (see §6).
|
||||
5. Emit a person record.
|
||||
|
||||
### Pass 2 — Resolve Relationships
|
||||
|
||||
Walk every person record:
|
||||
1. Resolve col I (spouse) → emit `SPOUSE_OF` edge or `unresolved` entry.
|
||||
2. Parse col J (Bemerkung) for parent/child patterns → emit `PARENT_OF` edges or `unresolved` entries.
|
||||
3. Append unmatched Bemerkung text to `person.notes`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Date Parsing
|
||||
|
||||
Both col E (birth) and col G (death) arrive as either an Excel numeric serial or a string.
|
||||
|
||||
### Excel serial conversion
|
||||
When the cell value is an integer (or a float with no string representation):
|
||||
```
|
||||
date = datetime(1899, 12, 30) + timedelta(days=int(value))
|
||||
year = date.year
|
||||
```
|
||||
Excel's epoch is 1899-12-30 (accounts for the Lotus 1-2-3 leap-year bug).
|
||||
|
||||
### String fallback — reuse existing `dates.parse_date()`
|
||||
Pass the raw string to the existing `tools/import-normalizer/dates.parse_date()`. It already handles:
|
||||
- `DD.MM.YYYY` and `D.M.YY`
|
||||
- Year-only (`1930`)
|
||||
- Month + year (`August 1941`, `Sept. 1913`)
|
||||
- Partial/approximate markers
|
||||
|
||||
Extract `.year` from the returned `ParsedDate.iso` if `iso` is not `None`.
|
||||
|
||||
### Unresolvable dates
|
||||
If both paths yield `None` (e.g. `"2.9.196"`, `"4.3.1023"`, `".12.1955"`):
|
||||
- Set `birthYear`/`deathYear` to `null`.
|
||||
- Append the raw value to `person.notes` as `"[Geburtsdatum: <raw>]"` or `"[Todesdatum: <raw>]"` for human review.
|
||||
|
||||
---
|
||||
|
||||
## 5. Person Record Normalization
|
||||
|
||||
### Name fields
|
||||
- **lastName** = col B, stripped.
|
||||
- **firstName** = col C. Keep as-is (including multi-name strings and parenthetical nicknames) — the backend can split later.
|
||||
- **maidenName** = col D, stripped. Stored in the JSON; the backend maps this to a `PersonNameAlias` of type `BIRTH_NAME`.
|
||||
- **alias** = `null` (the tool does not invent aliases; maiden name is the alias).
|
||||
|
||||
### Generation
|
||||
Extract the first digit sequence from col A:
|
||||
```python
|
||||
import re
|
||||
m = re.search(r"\d+", raw_generation)
|
||||
generation = int(m.group()) if m else None
|
||||
```
|
||||
Handles all observed variants: `"G 3"`, `"G3"`, `"G 0"`, `"G 2 de Gruyter"`, `"G 0"`.
|
||||
Stored as `generation: int | null` in the JSON (informational; not mapped to a backend field directly).
|
||||
|
||||
### familyMember
|
||||
Set `true` for all records. Every person in this register is part of the family network. The backend can refine this.
|
||||
|
||||
### notes
|
||||
Constructed by concatenation:
|
||||
1. Unmatched Bemerkung text (after relationship pattern is stripped).
|
||||
2. Unresolvable date raw values (prefixed with field name).
|
||||
|
||||
---
|
||||
|
||||
## 6. Name Lookup Index
|
||||
|
||||
After pass 1, build a `dict[str, list[str]]` mapping normalized name keys → list of `rowId`s.
|
||||
|
||||
### Normalization function `_norm(s) -> str`
|
||||
1. Lowercase.
|
||||
2. Strip surrounding `"` and `'`.
|
||||
3. Remove parenthetical substrings: `r"\([^)]*\)"`.
|
||||
4. Collapse internal whitespace.
|
||||
5. Strip geographic/honorific suffixes: `aachen`, `mex.`, `mexiko`, `sen`, `jun`, `jr`.
|
||||
6. Strip trailing commas, dots.
|
||||
|
||||
### Keys indexed per person
|
||||
For a person with firstName `F`, lastName `L`, maidenName `M`:
|
||||
- `_norm(f"{F} {L}")` — canonical order
|
||||
- `_norm(f"{L} {F}")` — reversed order (col I uses this heavily)
|
||||
- `_norm(f"{F} {M}")` if maidenName is set — maiden-name reference
|
||||
- `_norm(L)` alone — single-token fallback
|
||||
|
||||
### Match resolution
|
||||
Given a raw name string from col I or col J:
|
||||
1. `_norm(raw)` → look up in index.
|
||||
2. **Exactly one hit** → match confirmed, use that `rowId`.
|
||||
3. **Zero hits** → `reason: "not_found"` → `unresolved[]`.
|
||||
4. **Multiple hits** → `reason: "ambiguous"` → `unresolved[]`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Relationship Extraction
|
||||
|
||||
### 7.1 SPOUSE_OF (col I — `verheiratet mit`)
|
||||
|
||||
1. Normalize col I value.
|
||||
2. Resolve via name index (§6).
|
||||
3. If matched: emit one edge `{ personId, relatedPersonId, type: "SPOUSE_OF", source: "verheiratet_mit" }`.
|
||||
- Skip if an identical edge (regardless of direction) already exists in the relationship list.
|
||||
4. If unresolved: add to `unresolved[]`.
|
||||
|
||||
### 7.2 PARENT_OF (col J — `Bemerkung`)
|
||||
|
||||
Apply these regex patterns in order, case-insensitive, with optional whitespace:
|
||||
|
||||
| Pattern | Direction | Note |
|
||||
|---------|-----------|------|
|
||||
| `(Sohn\|Tochter)\s+v(?:on)?\s+(.+)` | Named person(s) → this person | "Sohn v Clara u Herbert" |
|
||||
| `(Vater\|Mutter)\s+v(?:on)?\s+(.+)` | This person → named person(s) | "Vater v Herbert" |
|
||||
|
||||
**Multi-parent extraction:** The parent string may contain two parents joined by `\s+u(?:nd)?\s+`. Split on this pattern, resolve each part independently.
|
||||
|
||||
**Emit** one `PARENT_OF` edge per resolved parent:
|
||||
```json
|
||||
{
|
||||
"personId": "<parent_rowId>",
|
||||
"relatedPersonId": "<child_rowId>",
|
||||
"type": "PARENT_OF",
|
||||
"source": "bemerkung",
|
||||
"rawBemerkung": "<original col J value>"
|
||||
}
|
||||
```
|
||||
|
||||
**Skip** (do not emit, do not add to `unresolved[]`, leave in notes):
|
||||
- Patterns starting with `Neffe`, `Nichte`, `Enkel`, `Enkelin`, `Urenkel`, `Urenkelin` — too indirect.
|
||||
- Patterns starting with `Bruder`, `Schwester` — SIBLING_OF is out of scope for this tool.
|
||||
- Any other Bemerkung text that does not match the parent patterns.
|
||||
|
||||
**After extraction:** the matched portion of the Bemerkung is removed; the remainder goes into `person.notes`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Output JSON Schema
|
||||
|
||||
File: `tools/import-normalizer/out/canonical-persons-tree.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"generated_at": "<ISO-8601 timestamp>",
|
||||
"source": "Personendatei 2.xlsx",
|
||||
"stats": {
|
||||
"persons": 163,
|
||||
"relationships": 87,
|
||||
"unresolved": 12
|
||||
},
|
||||
"persons": [
|
||||
{
|
||||
"rowId": "row_002",
|
||||
"firstName": "Elsgard",
|
||||
"lastName": "Allemeyer",
|
||||
"maidenName": "Wöhler",
|
||||
"alias": null,
|
||||
"notes": "Nichte von Herbert",
|
||||
"birthYear": 1920,
|
||||
"deathYear": 1999,
|
||||
"birthPlace": "Garz",
|
||||
"deathPlace": "Espelkamp",
|
||||
"generation": 3,
|
||||
"familyMember": true
|
||||
}
|
||||
],
|
||||
"relationships": [
|
||||
{
|
||||
"personId": "row_002",
|
||||
"relatedPersonId": "row_003",
|
||||
"type": "SPOUSE_OF",
|
||||
"source": "verheiratet_mit"
|
||||
},
|
||||
{
|
||||
"personId": "row_019",
|
||||
"relatedPersonId": "row_021",
|
||||
"type": "PARENT_OF",
|
||||
"source": "bemerkung",
|
||||
"rawBemerkung": "Tochter v Clara u Herbert"
|
||||
}
|
||||
],
|
||||
"unresolved": [
|
||||
{
|
||||
"rowId": "row_007",
|
||||
"field": "verheiratet_mit",
|
||||
"raw": "\"Tante Lolly\"",
|
||||
"reason": "not_found"
|
||||
},
|
||||
{
|
||||
"rowId": "row_042",
|
||||
"field": "bemerkung",
|
||||
"raw": "Zwillingsbruder v Herbert",
|
||||
"reason": "not_found"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. CLI Interface
|
||||
|
||||
```
|
||||
python3 persons_tree.py [--input PATH] [--output PATH] [--dry-run]
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--input` | `../../import/Personendatei 2.xlsx` | Source Excel file |
|
||||
| `--output` | `out/canonical-persons-tree.json` | Output JSON file |
|
||||
| `--dry-run` | off | Print stats + first 5 unresolved entries; do not write file |
|
||||
|
||||
On success, print:
|
||||
```
|
||||
✓ 163 persons parsed
|
||||
✓ 87 relationships emitted (52 SPOUSE_OF, 35 PARENT_OF)
|
||||
⚠ 12 unresolved (see unresolved[] in output)
|
||||
→ out/canonical-persons-tree.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Module Reuse
|
||||
|
||||
| Existing module | What we reuse |
|
||||
|-----------------|---------------|
|
||||
| `dates.parse_date()` | String date parsing — handles DD.MM.YYYY, year-only, month+year, approximate markers |
|
||||
| `config.MONTHS` | Month name → integer mapping (German + Spanish month names already present) |
|
||||
|
||||
The Excel serial conversion is new logic added directly in `persons_tree.py` (3 lines).
|
||||
|
||||
---
|
||||
|
||||
## 11. What This Tool Does NOT Do
|
||||
|
||||
- Does not call the backend API or touch the database.
|
||||
- Does not create `PersonNameAlias` records — it emits `maidenName` as a field; the future backend importer maps it.
|
||||
- Does not infer SIBLING_OF edges (requires symmetric lookup across multiple rows — deferred).
|
||||
- Does not deduplicate persons that appear in both this file and `canonical-persons.xlsx` — deduplication is the backend importer's responsibility.
|
||||
- Does produce `birthPlace` / `deathPlace` as top-level fields in the JSON (see §8) — they are free-text strings and informational only. The `Person` entity has no corresponding columns; the future backend importer decides whether to add columns or fold the values into `notes`.
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Questions
|
||||
|
||||
| OQ | Question | Blocks |
|
||||
|----|----------|--------|
|
||||
| OQ-01 | Some persons appear twice with slightly different data (rows 127/138 — Christa Schütz/Siebert; rows 129/139 — Christoph Seils). Deduplicate in the tool or leave as duplicates for the backend to handle? | §8 persons array |
|
||||
| OQ-02 | `birthPlace` / `deathPlace` are in the source but absent from the `Person` entity. Should they go into `notes`, or should the backend importer add new columns? | §8 persons array, future backend importer |
|
||||
| OQ-03 | The `firstName` for `"Charlotte,Meta,Jacobi"` (row 7 / row 120) is a comma-separated multi-name. Store verbatim or split into `firstName` + `alias`? | §5 name normalization |
|
||||
Reference in New Issue
Block a user