Files

Marcel 6103d5d229 docs(importer): resolve open questions in Personendatei importer spec

OQ-01: tool deduplicates rows with identical (firstName, lastName, birthYear)
OQ-02: birthPlace/deathPlace kept as separate JSON fields
OQ-03: multi-name firstName stored verbatim

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 20:28:45 +02:00

11 KiB

Raw Blame History

Personendatei Importer — Design Spec

Date: 2026-05-25 Source file: import/Personendatei 2.xlsx Output: tools/import-normalizer/out/canonical-persons-tree.json Tool location: tools/import-normalizer/persons_tree.py

1. Purpose

Normalize the 163-person family register in Personendatei 2.xlsx into a machine-readable JSON file that a future backend importer can consume to seed the persons and person_relationships tables. The tool is offline (no backend required) and produces a reviewable artifact with an explicit unresolved[] list for manual follow-up.

2. Source Data — Column Map

Sheet: Tabelle1 (rows 2–164; row 1 is the header).

Col	Header	Content	Notes
A	Generation	`G 0`–`G 5`	Generation relative to Herbert & Clara Cram (G 2). Inconsistent formatting: `"G3"`, `"G 0"`, `"G 2 de Gruyter"` — strip non-digit chars and parse the integer.
B	Familienname	Last name	Sometimes compound: `"de Gruyter"`, `"Cram Heydrich"`, `"Burkhard- Meier"`
C	Vorname	First name	Sometimes multiple: `"Charlotte,Meta,Jacobi"`, nicknames in parens: `"Otto (Herbert)"`
D	geb als	Maiden name	Used as a name alias for matching
E	Geburtsdatum	Birth date	Mixed types — see §4
F	Geburtsort	Birth place	Free-text string, stored verbatim
G	Todesdatum	Death date	Same mixed types as col E
H	Sterbeort	Death place	Free-text string, stored verbatim
I	verheiratet mit	Spouse name	Partial name in either `"Firstname Lastname"` or `"Lastname Firstname"` order
J	Bemerkung	German relationship notes	`"Sohn v Clara u Herbert"`, `"Nichte v Herbert"`, free text

3. Two-Pass Architecture

Pass 1 — Parse & Normalize (rows → person records)

For each row:

Read all 10 columns.
Assign a stable rowId: "row_{i:03d}" where i is the 1-based row number (e.g. row_002).
Normalize fields per §4 and §5.
Build the name-lookup index (see §6).
Emit a person record.

Pass 2 — Resolve Relationships

Walk every person record:

Resolve col I (spouse) → emit SPOUSE_OF edge or unresolved entry.
Parse col J (Bemerkung) for parent/child patterns → emit PARENT_OF edges or unresolved entries.
Append unmatched Bemerkung text to person.notes.

4. Date Parsing

Both col E (birth) and col G (death) arrive as either an Excel numeric serial or a string.

Excel serial conversion

When the cell value is an integer (or a float with no string representation):

date = datetime(1899, 12, 30) + timedelta(days=int(value))
year = date.year

Excel's epoch is 1899-12-30 (accounts for the Lotus 1-2-3 leap-year bug).

String fallback — reuse existing `dates.parse_date()`

Pass the raw string to the existing tools/import-normalizer/dates.parse_date(). It already handles:

DD.MM.YYYY and D.M.YY
Year-only (1930)
Month + year (August 1941, Sept. 1913)
Partial/approximate markers

Extract .year from the returned ParsedDate.iso if iso is not None.

Unresolvable dates

If both paths yield None (e.g. "2.9.196", "4.3.1023", ".12.1955"):

Set birthYear/deathYear to null.
Append the raw value to person.notes as "[Geburtsdatum: <raw>]" or "[Todesdatum: <raw>]" for human review.

5. Person Record Normalization

Name fields

lastName = col B, stripped.
firstName = col C. Keep as-is (including multi-name strings and parenthetical nicknames) — the backend can split later.
maidenName = col D, stripped. Stored in the JSON; the backend maps this to a PersonNameAlias of type BIRTH_NAME.
alias = null (the tool does not invent aliases; maiden name is the alias).

Generation

Extract the first digit sequence from col A:

import re
m = re.search(r"\d+", raw_generation)
generation = int(m.group()) if m else None

Handles all observed variants: "G 3", "G3", "G 0", "G 2 de Gruyter", "G 0". Stored as generation: int | null in the JSON (informational; not mapped to a backend field directly).

familyMember

Set true for all records. Every person in this register is part of the family network. The backend can refine this.

notes

Constructed by concatenation:

Unmatched Bemerkung text (after relationship pattern is stripped).
Unresolvable date raw values (prefixed with field name).

6. Name Lookup Index

After pass 1, build a dict[str, list[str]] mapping normalized name keys → list of rowIds.

Normalization function `_norm(s) -> str`

Lowercase.
Strip surrounding " and '.
Remove parenthetical substrings: r"\([^)]*\)".
Collapse internal whitespace.
Strip geographic/honorific suffixes: aachen, mex., mexiko, sen, jun, jr.
Strip trailing commas, dots.

Keys indexed per person

For a person with firstName F, lastName L, maidenName M:

_norm(f"{F} {L}") — canonical order
_norm(f"{L} {F}") — reversed order (col I uses this heavily)
_norm(f"{F} {M}") if maidenName is set — maiden-name reference
_norm(L) alone — single-token fallback

Match resolution

Given a raw name string from col I or col J:

_norm(raw) → look up in index.
Exactly one hit → match confirmed, use that rowId.
Zero hits → reason: "not_found" → unresolved[].
Multiple hits → reason: "ambiguous" → unresolved[].

7. Relationship Extraction

7.1 SPOUSE_OF (col I — `verheiratet mit`)

Normalize col I value.
Resolve via name index (§6).
If matched: emit one edge { personId, relatedPersonId, type: "SPOUSE_OF", source: "verheiratet_mit" }.
- Skip if an identical edge (regardless of direction) already exists in the relationship list.
If unresolved: add to unresolved[].

7.2 PARENT_OF (col J — `Bemerkung`)

Apply these regex patterns in order, case-insensitive, with optional whitespace:

Pattern	Direction	Note
`(Sohn\|Tochter)\s+v(?:on)?\s+(.+)`	Named person(s) → this person	"Sohn v Clara u Herbert"
`(Vater\|Mutter)\s+v(?:on)?\s+(.+)`	This person → named person(s)	"Vater v Herbert"

Multi-parent extraction: The parent string may contain two parents joined by \s+u(?:nd)?\s+. Split on this pattern, resolve each part independently.

Emit one PARENT_OF edge per resolved parent:

{
  "personId": "<parent_rowId>",
  "relatedPersonId": "<child_rowId>",
  "type": "PARENT_OF",
  "source": "bemerkung",
  "rawBemerkung": "<original col J value>"
}

Skip (do not emit, do not add to unresolved[], leave in notes):

Patterns starting with Neffe, Nichte, Enkel, Enkelin, Urenkel, Urenkelin — too indirect.
Patterns starting with Bruder, Schwester — SIBLING_OF is out of scope for this tool.
Any other Bemerkung text that does not match the parent patterns.

After extraction: the matched portion of the Bemerkung is removed; the remainder goes into person.notes.

8. Output JSON Schema

File: tools/import-normalizer/out/canonical-persons-tree.json

{
  "generated_at": "<ISO-8601 timestamp>",
  "source": "Personendatei 2.xlsx",
  "stats": {
    "persons": 163,
    "relationships": 87,
    "unresolved": 12
  },
  "persons": [
    {
      "rowId": "row_002",
      "firstName": "Elsgard",
      "lastName": "Allemeyer",
      "maidenName": "Wöhler",
      "alias": null,
      "notes": "Nichte von Herbert",
      "birthYear": 1920,
      "deathYear": 1999,
      "birthPlace": "Garz",
      "deathPlace": "Espelkamp",
      "generation": 3,
      "familyMember": true
    }
  ],
  "relationships": [
    {
      "personId": "row_002",
      "relatedPersonId": "row_003",
      "type": "SPOUSE_OF",
      "source": "verheiratet_mit"
    },
    {
      "personId": "row_019",
      "relatedPersonId": "row_021",
      "type": "PARENT_OF",
      "source": "bemerkung",
      "rawBemerkung": "Tochter v Clara u Herbert"
    }
  ],
  "unresolved": [
    {
      "rowId": "row_007",
      "field": "verheiratet_mit",
      "raw": "\"Tante Lolly\"",
      "reason": "not_found"
    },
    {
      "rowId": "row_042",
      "field": "bemerkung",
      "raw": "Zwillingsbruder v Herbert",
      "reason": "not_found"
    }
  ]
}

9. CLI Interface

python3 persons_tree.py [--input PATH] [--output PATH] [--dry-run]

Flag	Default	Description
`--input`	`../../import/Personendatei 2.xlsx`	Source Excel file
`--output`	`out/canonical-persons-tree.json`	Output JSON file
`--dry-run`	off	Print stats + first 5 unresolved entries; do not write file

On success, print:

✓ 163 persons parsed
✓ 87 relationships emitted (52 SPOUSE_OF, 35 PARENT_OF)
⚠  12 unresolved (see unresolved[] in output)
→  out/canonical-persons-tree.json

10. Module Reuse

Existing module	What we reuse
`dates.parse_date()`	String date parsing — handles DD.MM.YYYY, year-only, month+year, approximate markers
`config.MONTHS`	Month name → integer mapping (German + Spanish month names already present)

The Excel serial conversion is new logic added directly in persons_tree.py (3 lines).

11. What This Tool Does NOT Do

Does not call the backend API or touch the database.
Does not create PersonNameAlias records — it emits maidenName as a field; the future backend importer maps it.
Does not infer SIBLING_OF edges (requires symmetric lookup across multiple rows — deferred).
Does not deduplicate persons that appear in both this file and canonical-persons.xlsx — deduplication is the backend importer's responsibility.
Does produce birthPlace / deathPlace as top-level fields in the JSON (see §8) — they are free-text strings and informational only. The Person entity has no corresponding columns; the future backend importer decides whether to add columns or fold the values into notes.

12. Resolved Decisions

OQ	Question	Decision
OQ-01	Duplicate rows (127/138 — Christa Schütz; 129/139 — Christoph Seils).	Tool deduplicates. On pass 1, after building the person list, detect rows with identical `(firstName, lastName, birthYear)` and keep only the first occurrence. Log skipped row ids to stdout.
OQ-02	`birthPlace` / `deathPlace` absent from `Person` entity.	Keep as separate top-level fields in the JSON (`birthPlace`, `deathPlace`). The future backend importer may add columns to the `persons` table; the field is preserved here to avoid data loss.
OQ-03	`firstName` = `"Charlotte,Meta,Jacobi"` (multi-name comma string).	Store verbatim as `firstName`. No splitting.

11 KiB Raw Blame History Unescape Escape