Files
familienarchiv/docs/import-migration/04-unresolved-names-plan.md
Marcel 97db718f81
All checks were successful
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Unit & Component Tests (pull_request) Successful in 4m13s
CI / Semgrep Security Scan (pull_request) Successful in 20s
docs(import): add unresolved-names plan + worklog entry
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:01:18 +02:00

23 KiB

Unresolved-Name Classification Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add a focused review/unresolved-names.csv that isolates sender/receiver strings whose name itself is problematic (unknown/illegible, single-token, relational-only, collective/group, prose-in-name-column, or a genuine two-given-name pair), and fix the ambiguous-pair heuristic so a plain First Surname external person (e.g. Mieze Schefold) is no longer falsely flagged.

Architecture: A pure classify_name(raw, given_names) function in persons.py returns a NameClass. ResolutionContext classifies every unmatched name and records the non-RESOLVABLE ones in self.unresolved. A runtime-built given-name set (register first names + a small config supplement) lets the classifier distinguish a two-given-name pair (Ella Anita → two people) from a first+surname single person (Mieze Schefold). The orchestrator writes the aggregated report and per-category stats, replacing the noisy ambiguous-receivers.csv.

Tech Stack: Python 3.12, openpyxl, pytest — extends the existing tools/import-normalizer/.

Context: This builds on the completed normalizer (PR #663). Run all tests with CWD = the tool dir, e.g. cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_X.py -v. Reuse the existing venv at tools/import-normalizer/.venv (do NOT recreate it). Commit on the current branch docs/import-migration (never main, never push). Each commit message ends with a trailing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> line.


File Structure

tools/import-normalizer/
├── config.py        # + RELATIONAL_TERMS, COLLECTIVE_TERMS, UNKNOWN_NAME_MARKERS, PROSE_MAX_LEN, EXTRA_GIVEN_NAMES
├── persons.py       # + NameClass, classify_name(), build_given_names(); ResolutionContext gains given_names + self.unresolved
├── normalize.py     # writes unresolved-names.csv (replaces ambiguous-receivers.csv) + per-category stats
├── README.md        # + unresolved-names.csv row in the review-file table
└── tests/
    ├── test_config.py     # + name-table presence test
    ├── test_persons.py    # + classify_name + build_given_names tests
    ├── test_documents.py  # ambiguous test → unresolved test (+ resolvable-pair test)
    └── test_normalize.py  # integration asserts unresolved-names.csv

Task 1: Config — name-classification tables

Files:

  • Modify: tools/import-normalizer/config.py

  • Modify: tools/import-normalizer/tests/test_config.py

  • Step 1: Add the failing test to tests/test_config.py

def test_name_classification_tables():
    assert "tante" in config.RELATIONAL_TERMS
    assert "familie" in config.COLLECTIVE_TERMS
    assert "unbekannt" in config.UNKNOWN_NAME_MARKERS
    assert config.PROSE_MAX_LEN >= 30
    assert "anita" in config.EXTRA_GIVEN_NAMES
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py::test_name_classification_tables -v && cd - Expected: FAIL — AttributeError: module 'config' has no attribute 'RELATIONAL_TERMS'.

  • Step 3: Implement — append to config.py (after the existing tables, before/after KNOWN_LAST_NAMES — anywhere at module level)
# --- Name classification (unresolved-name review) ---
# Relational reference terms — a sender/receiver named by relation, not a proper name.
RELATIONAL_TERMS = {
    "tante", "onkel", "mutter", "vater", "oma", "opa", "großmutter", "grossmutter",
    "großvater", "grossvater", "schwester", "bruder", "cousin", "cousine", "kusine",
    "neffe", "nichte", "tochter", "sohn", "schwager", "schwägerin", "schwiegermutter",
    "schwiegervater", "enkel", "enkelin", "vetter", "base", "witwe", "witwer",
}
# Collective/group terms — not a single person. Matched against alpha-only word tokens
# (so "Fam.Cram" -> ["fam","cram"] matches "fam"), NOT as substrings/prefixes.
COLLECTIVE_TERMS = {
    "familie", "fam", "kinder", "eltern", "geschwister", "großeltern",
    "grosseltern", "alle", "diverse", "div", "gebrüder", "gebr",
}
# Markers of an unknown/illegible name (the literal "?" is handled separately in code).
# All long enough to be safe as SUBSTRING matches — do NOT add short tokens like "nn"
# (it occurs inside real names: Hanni, Johanna, Anna).
UNKNOWN_NAME_MARKERS = {"unbekannt", "unbek", "unleserlich", "unklar", "unsicher"}
# A name-column value longer than this (chars) is treated as prose/description, not a name.
PROSE_MAX_LEN = 40
# Common given names that may appear in two-given-name pairs (e.g. "Ella Anita") but are not
# in the family register. Only used to detect AMBIGUOUS_PAIR — extend as review surfaces more.
EXTRA_GIVEN_NAMES = {
    "ella", "anita", "kurt", "georg", "hanni", "mieze", "ellen", "leni", "klara",
    "margret", "gustava", "emmy", "minna", "sophie", "helga", "raymonde", "augusta",
}
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd - Expected: PASS (all config tests).

  • Step 5: Commit
git add tools/import-normalizer/config.py tools/import-normalizer/tests/test_config.py
git commit -m "feat(normalizer): config tables for name classification"

Task 2: classify_name + NameClass

Files:

  • Modify: tools/import-normalizer/persons.py

  • Modify: tools/import-normalizer/tests/test_persons.py

  • Step 1: Add failing tests to tests/test_persons.py

from persons import NameClass

GIVEN = {"ella", "anita", "kurt", "georg", "clara", "eugenie"}

def test_classify_unknown():
    assert persons.classify_name("?", GIVEN) is NameClass.UNKNOWN
    assert persons.classify_name("A. Kredell?", GIVEN) is NameClass.UNKNOWN
    assert persons.classify_name("unbekannt", GIVEN) is NameClass.UNKNOWN

def test_classify_prose():
    assert persons.classify_name("Adressenliste v Clara Cram zur Kondolenz", GIVEN) is NameClass.PROSE
    assert persons.classify_name("Clara de Gruyter(*1871)", GIVEN) is NameClass.PROSE  # digit
    assert persons.classify_name('"Cramiade" Gedicht', GIVEN) is NameClass.PROSE        # quote

def test_classify_collective():
    assert persons.classify_name("Familie", GIVEN) is NameClass.COLLECTIVE
    assert persons.classify_name("Fam.Cram", GIVEN) is NameClass.COLLECTIVE
    assert persons.classify_name("Eltern Cram", GIVEN) is NameClass.COLLECTIVE
    assert persons.classify_name("seine Kinder", GIVEN) is NameClass.COLLECTIVE

def test_classify_relational():
    assert persons.classify_name("Cousine Emmy Haniel", GIVEN) is NameClass.RELATIONAL
    assert persons.classify_name("Schwester Hanni", GIVEN) is NameClass.RELATIONAL

def test_classify_single_token():
    assert persons.classify_name("Agnes", GIVEN) is NameClass.SINGLE_TOKEN
    assert persons.classify_name("A.B.", GIVEN) is NameClass.SINGLE_TOKEN

def test_classify_ambiguous_pair():
    assert persons.classify_name("Ella Anita", GIVEN) is NameClass.AMBIGUOUS_PAIR
    assert persons.classify_name("Kurt Georg", GIVEN) is NameClass.AMBIGUOUS_PAIR

def test_classify_resolvable_single_person():
    # first + surname (surname not a given name) -> one real person, NOT ambiguous
    assert persons.classify_name("Mieze Schefold", GIVEN) is NameClass.RESOLVABLE
    assert persons.classify_name("Adolf Butenandt", GIVEN) is NameClass.RESOLVABLE
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k classify -v && cd - Expected: FAIL — NameClass / classify_name not defined.

  • Step 3: Implement — add to persons.py. Add from enum import StrEnum to the imports if not present, then add:
class NameClass(StrEnum):
    RESOLVABLE = "resolvable"
    UNKNOWN = "unknown"
    SINGLE_TOKEN = "single_token"
    RELATIONAL = "relational"
    COLLECTIVE = "collective"
    PROSE = "prose"
    AMBIGUOUS_PAIR = "ambiguous_pair"


_QUOTE_CHARS = "\"'“”„‚‘’"


def classify_name(raw: str, given_names: set[str]) -> NameClass:
    """Classify a (post-split) sender/receiver string by why it may be unresolvable.

    Precedence (first match wins): UNKNOWN -> PROSE -> COLLECTIVE -> RELATIONAL ->
    SINGLE_TOKEN -> AMBIGUOUS_PAIR -> RESOLVABLE.
    """
    s = raw.strip()
    if not s:
        return NameClass.RESOLVABLE
    low = s.lower()
    tokens = s.split()
    # alpha-only word tokens: "Fam.Cram" -> ["fam","cram"], so collective/relational terms
    # are matched as whole words (no substring/prefix false positives like "Allerton").
    alpha_words = re.findall(r"[a-zäöüß]+", low)
    if "?" in s or any(m in low for m in config.UNKNOWN_NAME_MARKERS):
        return NameClass.UNKNOWN
    if (len(s) > config.PROSE_MAX_LEN or any(c.isdigit() for c in s)
            or any(q in s for q in _QUOTE_CHARS) or len(tokens) > 3):
        return NameClass.PROSE
    if any(w in config.COLLECTIVE_TERMS for w in alpha_words):
        return NameClass.COLLECTIVE
    if any(w in config.RELATIONAL_TERMS for w in alpha_words):
        return NameClass.RELATIONAL
    if len(tokens) == 1:
        return NameClass.SINGLE_TOKEN
    if len(tokens) == 2 and all(_norm(t) in given_names for t in tokens):
        return NameClass.AMBIGUOUS_PAIR
    return NameClass.RESOLVABLE


# Known limitation: a 4+-token name with no digits/quotes (e.g. "Anna von der Heide") is
# classified PROSE. Such multi-particle names are rare here and usually resolve via the
# register; if they surface in review, lower-priority than the real prose entries.

Note: _norm already exists in persons.py (added in the alias-index task) and strips accents + lowercases. classify_name uses it so given-name matching is accent-insensitive.

  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd - Expected: PASS (all persons tests, including the 7 new classify tests).

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): classify_name + NameClass"

Task 3: build_given_names

Files:

  • Modify: tools/import-normalizer/persons.py

  • Modify: tools/import-normalizer/tests/test_persons.py

  • Step 1: Add failing test to tests/test_persons.py

def test_build_given_names():
    people = persons.parse_register([
        {"last_name": "de Gruyter", "first_name": "Eugenie"},
        {"last_name": "Cram", "first_name": "Charlotte,Meta"},  # comma -> primary + extra given
    ])
    g = persons.build_given_names(people, {"Anita"})
    assert "eugenie" in g
    assert "charlotte" in g and "meta" in g   # primary + extra given names
    assert "anita" in g                        # from the extra set, normalized
    assert "schefold" not in g
  • Step 2: Run to verify it fails

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py::test_build_given_names -v && cd - Expected: FAIL — build_given_names not defined.

  • Step 3: Implement — add to persons.py
def build_given_names(register: list[Person], extra: set[str]) -> set[str]:
    """Set of normalized given names from the register (first + extra given) plus a supplement.

    Used by classify_name to tell a two-given-name pair (two people) from a first+surname.
    """
    names: set[str] = set()
    for p in register:
        if p.first_name:
            names.add(_norm(p.first_name))
        for g in p.extra_given_names:
            names.add(_norm(g))
    for e in extra:
        names.add(_norm(e))
    return names
  • Step 4: Run to verify it passes

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd - Expected: PASS.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): build_given_names from register + supplement"

Task 4: Integrate — ResolutionContext records unresolved; orchestrator writes the report

This task touches persons.py, normalize.py, and two test files together so the whole suite stays green in one commit (removing ctx.ambiguous requires updating its only consumer, normalize.py, in the same change).

Files:

  • Modify: tools/import-normalizer/persons.py (ResolutionContext)

  • Modify: tools/import-normalizer/normalize.py

  • Modify: tools/import-normalizer/tests/test_documents.py

  • Modify: tools/import-normalizer/tests/test_normalize.py

  • Step 1: Update the failing tests first

In tests/test_documents.py, replace the existing test_ambiguous_space_pair_flagged_not_split function entirely with these two functions:

def test_ambiguous_pair_recorded_in_unresolved():
    people = persons.parse_register([{"last_name": "de Gruyter", "first_name": "Walter"}])
    ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={},
                                    given_names={"ella", "anita"})
    raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita")
    doc = documents.to_canonical(raw, ctx, date_overrides={})
    assert len(doc.receiver_person_ids) == 1   # not split — one provisional
    assert any(name == "Ella Anita" and cat == "ambiguous_pair" for name, cat, _ in ctx.unresolved)

def test_resolvable_first_surname_pair_not_unresolved():
    ctx = persons.ResolutionContext(persons.AliasIndex([]), name_overrides={},
                                    given_names={"ella", "anita"})
    ctx.resolve_one("Mieze Schefold", source_row=1)   # surname is not a given name
    assert ctx.unresolved == []                        # RESOLVABLE -> not recorded

In tests/test_normalize.py, in the _doc_wb fixture, change the C-0001 row's receiver from empty to "?" so the run produces an unresolved entry. Find the line that appends the C-0001 row and set its EmpfängerIn cell to "?". For example the row currently reads:

    ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""])

change the 6th cell (EmpfängerIn) from "" to "?":

    ws.append(["C-0001", "", "", "", "Hans Wittkopf", "?", "Freitag 1919", "", "", ""])

Then add these assertions inside test_run_end_to_end, right after the existing assert (review_dir / "unparsed-dates.csv").exists() line:

    assert (out_dir / "canonical-documents.xlsx").exists()  # (keep existing asserts above)
    assert (review_dir / "unresolved-names.csv").exists()
    unresolved_text = (review_dir / "unresolved-names.csv").read_text(encoding="utf-8")
    assert "unknown" in unresolved_text and "?" in unresolved_text   # the "?" receiver
    assert not (review_dir / "ambiguous-receivers.csv").exists()      # replaced
  • Step 2: Run to verify they fail

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py tests/test_normalize.py -v && cd - Expected: FAIL — ResolutionContext has no given_names/unresolved; unresolved-names.csv not written.

  • Step 3a: Implement — ResolutionContext in persons.py

Replace the ResolutionContext.__init__ body's two lines (self.ambiguous and add given_names) and the relevant methods. The new __init__:

    def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str],
                 given_names: set[str] | None = None):
        self.index = alias_index
        self.name_overrides = name_overrides
        self.given_names = given_names or set()
        self.provisional: dict[str, Person] = {}
        self.unmatched: dict[str, list] = {}
        self.unresolved: list[tuple] = []   # (raw_name, category, source_row) for non-RESOLVABLE names
        self._raw_to_pid: dict[str, str] = {}
        self.override_hits = 0

In resolve_one, the provisional branch must classify the name. Replace this existing block:

        # provisional person (unmatched) — never reuse a register id
        self.unmatched.setdefault(name, []).append(source_row)
        if name in self._raw_to_pid:
            return self._raw_to_pid[name], name, False

with:

        # provisional person (unmatched) — never reuse a register id
        self.unmatched.setdefault(name, []).append(source_row)
        category = classify_name(name, self.given_names)
        if category is not NameClass.RESOLVABLE:
            self.unresolved.append((name, str(category), source_row))
        if name in self._raw_to_pid:
            return self._raw_to_pid[name], name, False

Replace the entire resolve_receivers method (the ambiguous detection now lives in resolve_one via classify_name):

    def resolve_receivers(self, raw: str, source_row: int):
        return [self.resolve_one(part, source_row) for part in split_receivers(raw)]
  • Step 3b: Implement — normalize.py

Find the line that builds the context:

    ctx = persons.ResolutionContext(alias_index, name_overrides)

replace it with (build the given-name set from the register + config supplement):

    given_names = persons.build_given_names(register, config.EXTRA_GIVEN_NAMES)
    ctx = persons.ResolutionContext(alias_index, name_overrides, given_names=given_names)

Replace the ambiguous-receivers.csv write line:

    writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous)

with an aggregated unresolved-names report:

    unresolved_agg: dict[tuple, list] = {}
    for name, category, row in ctx.unresolved:
        unresolved_agg.setdefault((category, name), []).append(row)
    unresolved_rows = sorted(
        ([cat, name, len(rows), " ".join(map(str, sorted(rows)[:5]))]
         for (cat, name), rows in unresolved_agg.items()),
        key=lambda r: (r[0], -r[2], r[1]))
    writers.write_review_csv(review_dir / "unresolved-names.csv",
                             ["category", "raw", "count", "example_rows"], unresolved_rows)

In the stats dict, replace the "ambiguous_receivers" line:

        "ambiguous_receivers": len(ctx.ambiguous),

with a per-category breakdown:

        "unresolved_name_occurrences": len(ctx.unresolved),
        "unresolved_unknown": sum(1 for _, c, _ in ctx.unresolved if c == "unknown"),
        "unresolved_single_token": sum(1 for _, c, _ in ctx.unresolved if c == "single_token"),
        "unresolved_relational": sum(1 for _, c, _ in ctx.unresolved if c == "relational"),
        "unresolved_collective": sum(1 for _, c, _ in ctx.unresolved if c == "collective"),
        "unresolved_prose": sum(1 for _, c, _ in ctx.unresolved if c == "prose"),
        "unresolved_ambiguous_pair": sum(1 for _, c, _ in ctx.unresolved if c == "ambiguous_pair"),
  • Step 4: Run the whole suite to verify green

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd - Expected: PASS (all tests, no ambiguous references remain).

Also grep to confirm no dangling references: Run: grep -rn "ctx.ambiguous\|ambiguous-receivers\|ambiguous_receivers\|self.ambiguous" tools/import-normalizer/*.py Expected: no matches.

  • Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_documents.py tools/import-normalizer/tests/test_normalize.py
git commit -m "feat(normalizer): unresolved-names report + fix ambiguous-pair over-flagging"

Task 5: README — document the new report

Files:

  • Modify: tools/import-normalizer/README.md

  • Step 1: Update the review-file table in README.md. Replace the ambiguous-receivers.csv row with an unresolved-names.csv row. Find the table row referencing ambiguous-receivers.csv and replace it with:

| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv`. |

If the README has no such row (older version), add the row above to the review-file table.

  • Step 2: Add a note to the iteration-loop section of README.md (after the table):
> `unresolved-names.csv` is the focused "names that need a human" list — distinct from
> `unmatched-names.csv` (which is just non-family correspondents that got provisional persons).
> The given-name set that drives `ambiguous_pair` detection is the register's first names plus
> `config.EXTRA_GIVEN_NAMES` — add names there if a real two-person cell isn't being flagged.
  • Step 3: Verify the suite is still green (README-only change, but confirm nothing references the old file)

Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd - Expected: PASS.

  • Step 4: Commit
git add tools/import-normalizer/README.md
git commit -m "docs(normalizer): document unresolved-names.csv review report"

Self-Review

Spec coverage (against the agreed proposal):

  • Focused report isolating problem name classes → Task 4 writes review/unresolved-names.csv with a category column; categories defined in Task 2 classify_name. ✓
  • Fix ambiguous over-flagging of First Surname → Task 2 AMBIGUOUS_PAIR requires both tokens in the given-name set; Mieze SchefoldRESOLVABLE (tested). ✓
  • Distinguish "not fully known" (unknown/single-token/relational/collective/prose) from "can't split cleanly" (ambiguous_pair) → all are NameClass values, each its own category column value. ✓
  • Per-category counts in summary → Task 4 stats. ✓
  • Senders covered too (not just receivers) → classification happens in resolve_one, which both resolve_sender and resolve_receivers call. ✓

Placeholder scan: No TBD/TODO; every code step has complete code. The README replacement gives the exact row text.

Type consistency: NameClass (StrEnum) defined Task 2; classify_name(raw, given_names) and build_given_names(register, extra) signatures used consistently in Task 4; ResolutionContext(alias_index, name_overrides, given_names=…) matches the new __init__; self.unresolved is list[tuple] of (raw, category, source_row) and read with that shape in both the report and the stats. str(category) yields the StrEnum value (e.g. "ambiguous_pair"), matching the stat comparisons and the test assertions.

Cross-task green: Task 4 deliberately bundles the persons.py + normalize.py + test changes into one commit because removing ctx.ambiguous breaks its consumer otherwise — no red commit is left behind (lesson from the prior build).

Out of scope (future): Spanish month names + Mon DD-YYYY date form (separate date-parser enhancement); promoting unresolved rows into a document-level needs_review flag; auto-splitting confirmed ambiguous_pair entries via overrides.