Files
familienarchiv/docs/superpowers/specs/2026-05-25-personendatei-importer-design.md
Marcel 6103d5d229 docs(importer): resolve open questions in Personendatei importer spec
OQ-01: tool deduplicates rows with identical (firstName, lastName, birthYear)
OQ-02: birthPlace/deathPlace kept as separate JSON fields
OQ-03: multi-name firstName stored verbatim

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 20:28:45 +02:00

11 KiB
Raw Blame History

Personendatei Importer — Design Spec

Date: 2026-05-25 Source file: import/Personendatei 2.xlsx Output: tools/import-normalizer/out/canonical-persons-tree.json Tool location: tools/import-normalizer/persons_tree.py


1. Purpose

Normalize the 163-person family register in Personendatei 2.xlsx into a machine-readable JSON file that a future backend importer can consume to seed the persons and person_relationships tables. The tool is offline (no backend required) and produces a reviewable artifact with an explicit unresolved[] list for manual follow-up.


2. Source Data — Column Map

Sheet: Tabelle1 (rows 2164; row 1 is the header).

Col Header Content Notes
A Generation G 0G 5 Generation relative to Herbert & Clara Cram (G 2). Inconsistent formatting: "G3", "G 0", "G 2 de Gruyter" — strip non-digit chars and parse the integer.
B Familienname Last name Sometimes compound: "de Gruyter", "Cram Heydrich", "Burkhard- Meier"
C Vorname First name Sometimes multiple: "Charlotte,Meta,Jacobi", nicknames in parens: "Otto (Herbert)"
D geb als Maiden name Used as a name alias for matching
E Geburtsdatum Birth date Mixed types — see §4
F Geburtsort Birth place Free-text string, stored verbatim
G Todesdatum Death date Same mixed types as col E
H Sterbeort Death place Free-text string, stored verbatim
I verheiratet mit Spouse name Partial name in either "Firstname Lastname" or "Lastname Firstname" order
J Bemerkung German relationship notes "Sohn v Clara u Herbert", "Nichte v Herbert", free text

3. Two-Pass Architecture

Pass 1 — Parse & Normalize (rows → person records)

For each row:

  1. Read all 10 columns.
  2. Assign a stable rowId: "row_{i:03d}" where i is the 1-based row number (e.g. row_002).
  3. Normalize fields per §4 and §5.
  4. Build the name-lookup index (see §6).
  5. Emit a person record.

Pass 2 — Resolve Relationships

Walk every person record:

  1. Resolve col I (spouse) → emit SPOUSE_OF edge or unresolved entry.
  2. Parse col J (Bemerkung) for parent/child patterns → emit PARENT_OF edges or unresolved entries.
  3. Append unmatched Bemerkung text to person.notes.

4. Date Parsing

Both col E (birth) and col G (death) arrive as either an Excel numeric serial or a string.

Excel serial conversion

When the cell value is an integer (or a float with no string representation):

date = datetime(1899, 12, 30) + timedelta(days=int(value))
year = date.year

Excel's epoch is 1899-12-30 (accounts for the Lotus 1-2-3 leap-year bug).

String fallback — reuse existing dates.parse_date()

Pass the raw string to the existing tools/import-normalizer/dates.parse_date(). It already handles:

  • DD.MM.YYYY and D.M.YY
  • Year-only (1930)
  • Month + year (August 1941, Sept. 1913)
  • Partial/approximate markers

Extract .year from the returned ParsedDate.iso if iso is not None.

Unresolvable dates

If both paths yield None (e.g. "2.9.196", "4.3.1023", ".12.1955"):

  • Set birthYear/deathYear to null.
  • Append the raw value to person.notes as "[Geburtsdatum: <raw>]" or "[Todesdatum: <raw>]" for human review.

5. Person Record Normalization

Name fields

  • lastName = col B, stripped.
  • firstName = col C. Keep as-is (including multi-name strings and parenthetical nicknames) — the backend can split later.
  • maidenName = col D, stripped. Stored in the JSON; the backend maps this to a PersonNameAlias of type BIRTH_NAME.
  • alias = null (the tool does not invent aliases; maiden name is the alias).

Generation

Extract the first digit sequence from col A:

import re
m = re.search(r"\d+", raw_generation)
generation = int(m.group()) if m else None

Handles all observed variants: "G 3", "G3", "G 0", "G 2 de Gruyter", "G 0". Stored as generation: int | null in the JSON (informational; not mapped to a backend field directly).

familyMember

Set true for all records. Every person in this register is part of the family network. The backend can refine this.

notes

Constructed by concatenation:

  1. Unmatched Bemerkung text (after relationship pattern is stripped).
  2. Unresolvable date raw values (prefixed with field name).

6. Name Lookup Index

After pass 1, build a dict[str, list[str]] mapping normalized name keys → list of rowIds.

Normalization function _norm(s) -> str

  1. Lowercase.
  2. Strip surrounding " and '.
  3. Remove parenthetical substrings: r"\([^)]*\)".
  4. Collapse internal whitespace.
  5. Strip geographic/honorific suffixes: aachen, mex., mexiko, sen, jun, jr.
  6. Strip trailing commas, dots.

Keys indexed per person

For a person with firstName F, lastName L, maidenName M:

  • _norm(f"{F} {L}") — canonical order
  • _norm(f"{L} {F}") — reversed order (col I uses this heavily)
  • _norm(f"{F} {M}") if maidenName is set — maiden-name reference
  • _norm(L) alone — single-token fallback

Match resolution

Given a raw name string from col I or col J:

  1. _norm(raw) → look up in index.
  2. Exactly one hit → match confirmed, use that rowId.
  3. Zero hitsreason: "not_found"unresolved[].
  4. Multiple hitsreason: "ambiguous"unresolved[].

7. Relationship Extraction

7.1 SPOUSE_OF (col I — verheiratet mit)

  1. Normalize col I value.
  2. Resolve via name index (§6).
  3. If matched: emit one edge { personId, relatedPersonId, type: "SPOUSE_OF", source: "verheiratet_mit" }.
    • Skip if an identical edge (regardless of direction) already exists in the relationship list.
  4. If unresolved: add to unresolved[].

7.2 PARENT_OF (col J — Bemerkung)

Apply these regex patterns in order, case-insensitive, with optional whitespace:

Pattern Direction Note
(Sohn|Tochter)\s+v(?:on)?\s+(.+) Named person(s) → this person "Sohn v Clara u Herbert"
(Vater|Mutter)\s+v(?:on)?\s+(.+) This person → named person(s) "Vater v Herbert"

Multi-parent extraction: The parent string may contain two parents joined by \s+u(?:nd)?\s+. Split on this pattern, resolve each part independently.

Emit one PARENT_OF edge per resolved parent:

{
  "personId": "<parent_rowId>",
  "relatedPersonId": "<child_rowId>",
  "type": "PARENT_OF",
  "source": "bemerkung",
  "rawBemerkung": "<original col J value>"
}

Skip (do not emit, do not add to unresolved[], leave in notes):

  • Patterns starting with Neffe, Nichte, Enkel, Enkelin, Urenkel, Urenkelin — too indirect.
  • Patterns starting with Bruder, Schwester — SIBLING_OF is out of scope for this tool.
  • Any other Bemerkung text that does not match the parent patterns.

After extraction: the matched portion of the Bemerkung is removed; the remainder goes into person.notes.


8. Output JSON Schema

File: tools/import-normalizer/out/canonical-persons-tree.json

{
  "generated_at": "<ISO-8601 timestamp>",
  "source": "Personendatei 2.xlsx",
  "stats": {
    "persons": 163,
    "relationships": 87,
    "unresolved": 12
  },
  "persons": [
    {
      "rowId": "row_002",
      "firstName": "Elsgard",
      "lastName": "Allemeyer",
      "maidenName": "Wöhler",
      "alias": null,
      "notes": "Nichte von Herbert",
      "birthYear": 1920,
      "deathYear": 1999,
      "birthPlace": "Garz",
      "deathPlace": "Espelkamp",
      "generation": 3,
      "familyMember": true
    }
  ],
  "relationships": [
    {
      "personId": "row_002",
      "relatedPersonId": "row_003",
      "type": "SPOUSE_OF",
      "source": "verheiratet_mit"
    },
    {
      "personId": "row_019",
      "relatedPersonId": "row_021",
      "type": "PARENT_OF",
      "source": "bemerkung",
      "rawBemerkung": "Tochter v Clara u Herbert"
    }
  ],
  "unresolved": [
    {
      "rowId": "row_007",
      "field": "verheiratet_mit",
      "raw": "\"Tante Lolly\"",
      "reason": "not_found"
    },
    {
      "rowId": "row_042",
      "field": "bemerkung",
      "raw": "Zwillingsbruder v Herbert",
      "reason": "not_found"
    }
  ]
}

9. CLI Interface

python3 persons_tree.py [--input PATH] [--output PATH] [--dry-run]
Flag Default Description
--input ../../import/Personendatei 2.xlsx Source Excel file
--output out/canonical-persons-tree.json Output JSON file
--dry-run off Print stats + first 5 unresolved entries; do not write file

On success, print:

✓ 163 persons parsed
✓ 87 relationships emitted (52 SPOUSE_OF, 35 PARENT_OF)
⚠  12 unresolved (see unresolved[] in output)
→  out/canonical-persons-tree.json

10. Module Reuse

Existing module What we reuse
dates.parse_date() String date parsing — handles DD.MM.YYYY, year-only, month+year, approximate markers
config.MONTHS Month name → integer mapping (German + Spanish month names already present)

The Excel serial conversion is new logic added directly in persons_tree.py (3 lines).


11. What This Tool Does NOT Do

  • Does not call the backend API or touch the database.
  • Does not create PersonNameAlias records — it emits maidenName as a field; the future backend importer maps it.
  • Does not infer SIBLING_OF edges (requires symmetric lookup across multiple rows — deferred).
  • Does not deduplicate persons that appear in both this file and canonical-persons.xlsx — deduplication is the backend importer's responsibility.
  • Does produce birthPlace / deathPlace as top-level fields in the JSON (see §8) — they are free-text strings and informational only. The Person entity has no corresponding columns; the future backend importer decides whether to add columns or fold the values into notes.

12. Resolved Decisions

OQ Question Decision
OQ-01 Duplicate rows (127/138 — Christa Schütz; 129/139 — Christoph Seils). Tool deduplicates. On pass 1, after building the person list, detect rows with identical (firstName, lastName, birthYear) and keep only the first occurrence. Log skipped row ids to stdout.
OQ-02 birthPlace / deathPlace absent from Person entity. Keep as separate top-level fields in the JSON (birthPlace, deathPlace). The future backend importer may add columns to the persons table; the field is preserved here to avoid data loss.
OQ-03 firstName = "Charlotte,Meta,Jacobi" (multi-name comma string). Store verbatim as firstName. No splitting.