OQ-01: tool deduplicates rows with identical (firstName, lastName, birthYear) OQ-02: birthPlace/deathPlace kept as separate JSON fields OQ-03: multi-name firstName stored verbatim Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Personendatei Importer — Design Spec
Date: 2026-05-25
Source file: import/Personendatei 2.xlsx
Output: tools/import-normalizer/out/canonical-persons-tree.json
Tool location: tools/import-normalizer/persons_tree.py
1. Purpose
Normalize the 163-person family register in Personendatei 2.xlsx into a machine-readable JSON file that a future backend importer can consume to seed the persons and person_relationships tables. The tool is offline (no backend required) and produces a reviewable artifact with an explicit unresolved[] list for manual follow-up.
2. Source Data — Column Map
Sheet: Tabelle1 (rows 2–164; row 1 is the header).
| Col | Header | Content | Notes |
|---|---|---|---|
| A | Generation | G 0–G 5 |
Generation relative to Herbert & Clara Cram (G 2). Inconsistent formatting: "G3", "G 0", "G 2 de Gruyter" — strip non-digit chars and parse the integer. |
| B | Familienname | Last name | Sometimes compound: "de Gruyter", "Cram Heydrich", "Burkhard- Meier" |
| C | Vorname | First name | Sometimes multiple: "Charlotte,Meta,Jacobi", nicknames in parens: "Otto (Herbert)" |
| D | geb als | Maiden name | Used as a name alias for matching |
| E | Geburtsdatum | Birth date | Mixed types — see §4 |
| F | Geburtsort | Birth place | Free-text string, stored verbatim |
| G | Todesdatum | Death date | Same mixed types as col E |
| H | Sterbeort | Death place | Free-text string, stored verbatim |
| I | verheiratet mit | Spouse name | Partial name in either "Firstname Lastname" or "Lastname Firstname" order |
| J | Bemerkung | German relationship notes | "Sohn v Clara u Herbert", "Nichte v Herbert", free text |
3. Two-Pass Architecture
Pass 1 — Parse & Normalize (rows → person records)
For each row:
- Read all 10 columns.
- Assign a stable
rowId:"row_{i:03d}"whereiis the 1-based row number (e.g.row_002). - Normalize fields per §4 and §5.
- Build the name-lookup index (see §6).
- Emit a person record.
Pass 2 — Resolve Relationships
Walk every person record:
- Resolve col I (spouse) → emit
SPOUSE_OFedge orunresolvedentry. - Parse col J (Bemerkung) for parent/child patterns → emit
PARENT_OFedges orunresolvedentries. - Append unmatched Bemerkung text to
person.notes.
4. Date Parsing
Both col E (birth) and col G (death) arrive as either an Excel numeric serial or a string.
Excel serial conversion
When the cell value is an integer (or a float with no string representation):
date = datetime(1899, 12, 30) + timedelta(days=int(value))
year = date.year
Excel's epoch is 1899-12-30 (accounts for the Lotus 1-2-3 leap-year bug).
String fallback — reuse existing dates.parse_date()
Pass the raw string to the existing tools/import-normalizer/dates.parse_date(). It already handles:
DD.MM.YYYYandD.M.YY- Year-only (
1930) - Month + year (
August 1941,Sept. 1913) - Partial/approximate markers
Extract .year from the returned ParsedDate.iso if iso is not None.
Unresolvable dates
If both paths yield None (e.g. "2.9.196", "4.3.1023", ".12.1955"):
- Set
birthYear/deathYeartonull. - Append the raw value to
person.notesas"[Geburtsdatum: <raw>]"or"[Todesdatum: <raw>]"for human review.
5. Person Record Normalization
Name fields
- lastName = col B, stripped.
- firstName = col C. Keep as-is (including multi-name strings and parenthetical nicknames) — the backend can split later.
- maidenName = col D, stripped. Stored in the JSON; the backend maps this to a
PersonNameAliasof typeBIRTH_NAME. - alias =
null(the tool does not invent aliases; maiden name is the alias).
Generation
Extract the first digit sequence from col A:
import re
m = re.search(r"\d+", raw_generation)
generation = int(m.group()) if m else None
Handles all observed variants: "G 3", "G3", "G 0", "G 2 de Gruyter", "G 0".
Stored as generation: int | null in the JSON (informational; not mapped to a backend field directly).
familyMember
Set true for all records. Every person in this register is part of the family network. The backend can refine this.
notes
Constructed by concatenation:
- Unmatched Bemerkung text (after relationship pattern is stripped).
- Unresolvable date raw values (prefixed with field name).
6. Name Lookup Index
After pass 1, build a dict[str, list[str]] mapping normalized name keys → list of rowIds.
Normalization function _norm(s) -> str
- Lowercase.
- Strip surrounding
"and'. - Remove parenthetical substrings:
r"\([^)]*\)". - Collapse internal whitespace.
- Strip geographic/honorific suffixes:
aachen,mex.,mexiko,sen,jun,jr. - Strip trailing commas, dots.
Keys indexed per person
For a person with firstName F, lastName L, maidenName M:
_norm(f"{F} {L}")— canonical order_norm(f"{L} {F}")— reversed order (col I uses this heavily)_norm(f"{F} {M}")if maidenName is set — maiden-name reference_norm(L)alone — single-token fallback
Match resolution
Given a raw name string from col I or col J:
_norm(raw)→ look up in index.- Exactly one hit → match confirmed, use that
rowId. - Zero hits →
reason: "not_found"→unresolved[]. - Multiple hits →
reason: "ambiguous"→unresolved[].
7. Relationship Extraction
7.1 SPOUSE_OF (col I — verheiratet mit)
- Normalize col I value.
- Resolve via name index (§6).
- If matched: emit one edge
{ personId, relatedPersonId, type: "SPOUSE_OF", source: "verheiratet_mit" }.- Skip if an identical edge (regardless of direction) already exists in the relationship list.
- If unresolved: add to
unresolved[].
7.2 PARENT_OF (col J — Bemerkung)
Apply these regex patterns in order, case-insensitive, with optional whitespace:
| Pattern | Direction | Note |
|---|---|---|
(Sohn|Tochter)\s+v(?:on)?\s+(.+) |
Named person(s) → this person | "Sohn v Clara u Herbert" |
(Vater|Mutter)\s+v(?:on)?\s+(.+) |
This person → named person(s) | "Vater v Herbert" |
Multi-parent extraction: The parent string may contain two parents joined by \s+u(?:nd)?\s+. Split on this pattern, resolve each part independently.
Emit one PARENT_OF edge per resolved parent:
{
"personId": "<parent_rowId>",
"relatedPersonId": "<child_rowId>",
"type": "PARENT_OF",
"source": "bemerkung",
"rawBemerkung": "<original col J value>"
}
Skip (do not emit, do not add to unresolved[], leave in notes):
- Patterns starting with
Neffe,Nichte,Enkel,Enkelin,Urenkel,Urenkelin— too indirect. - Patterns starting with
Bruder,Schwester— SIBLING_OF is out of scope for this tool. - Any other Bemerkung text that does not match the parent patterns.
After extraction: the matched portion of the Bemerkung is removed; the remainder goes into person.notes.
8. Output JSON Schema
File: tools/import-normalizer/out/canonical-persons-tree.json
{
"generated_at": "<ISO-8601 timestamp>",
"source": "Personendatei 2.xlsx",
"stats": {
"persons": 163,
"relationships": 87,
"unresolved": 12
},
"persons": [
{
"rowId": "row_002",
"firstName": "Elsgard",
"lastName": "Allemeyer",
"maidenName": "Wöhler",
"alias": null,
"notes": "Nichte von Herbert",
"birthYear": 1920,
"deathYear": 1999,
"birthPlace": "Garz",
"deathPlace": "Espelkamp",
"generation": 3,
"familyMember": true
}
],
"relationships": [
{
"personId": "row_002",
"relatedPersonId": "row_003",
"type": "SPOUSE_OF",
"source": "verheiratet_mit"
},
{
"personId": "row_019",
"relatedPersonId": "row_021",
"type": "PARENT_OF",
"source": "bemerkung",
"rawBemerkung": "Tochter v Clara u Herbert"
}
],
"unresolved": [
{
"rowId": "row_007",
"field": "verheiratet_mit",
"raw": "\"Tante Lolly\"",
"reason": "not_found"
},
{
"rowId": "row_042",
"field": "bemerkung",
"raw": "Zwillingsbruder v Herbert",
"reason": "not_found"
}
]
}
9. CLI Interface
python3 persons_tree.py [--input PATH] [--output PATH] [--dry-run]
| Flag | Default | Description |
|---|---|---|
--input |
../../import/Personendatei 2.xlsx |
Source Excel file |
--output |
out/canonical-persons-tree.json |
Output JSON file |
--dry-run |
off | Print stats + first 5 unresolved entries; do not write file |
On success, print:
✓ 163 persons parsed
✓ 87 relationships emitted (52 SPOUSE_OF, 35 PARENT_OF)
⚠ 12 unresolved (see unresolved[] in output)
→ out/canonical-persons-tree.json
10. Module Reuse
| Existing module | What we reuse |
|---|---|
dates.parse_date() |
String date parsing — handles DD.MM.YYYY, year-only, month+year, approximate markers |
config.MONTHS |
Month name → integer mapping (German + Spanish month names already present) |
The Excel serial conversion is new logic added directly in persons_tree.py (3 lines).
11. What This Tool Does NOT Do
- Does not call the backend API or touch the database.
- Does not create
PersonNameAliasrecords — it emitsmaidenNameas a field; the future backend importer maps it. - Does not infer SIBLING_OF edges (requires symmetric lookup across multiple rows — deferred).
- Does not deduplicate persons that appear in both this file and
canonical-persons.xlsx— deduplication is the backend importer's responsibility. - Does produce
birthPlace/deathPlaceas top-level fields in the JSON (see §8) — they are free-text strings and informational only. ThePersonentity has no corresponding columns; the future backend importer decides whether to add columns or fold the values intonotes.
12. Resolved Decisions
| OQ | Question | Decision |
|---|---|---|
| OQ-01 | Duplicate rows (127/138 — Christa Schütz; 129/139 — Christoph Seils). | Tool deduplicates. On pass 1, after building the person list, detect rows with identical (firstName, lastName, birthYear) and keep only the first occurrence. Log skipped row ids to stdout. |
| OQ-02 | birthPlace / deathPlace absent from Person entity. |
Keep as separate top-level fields in the JSON (birthPlace, deathPlace). The future backend importer may add columns to the persons table; the field is preserved here to avoid data loss. |
| OQ-03 | firstName = "Charlotte,Meta,Jacobi" (multi-name comma string). |
Store verbatim as firstName. No splitting. |