Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
23 KiB
Unresolved-Name Classification Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Add a focused review/unresolved-names.csv that isolates sender/receiver strings whose name itself is problematic (unknown/illegible, single-token, relational-only, collective/group, prose-in-name-column, or a genuine two-given-name pair), and fix the ambiguous-pair heuristic so a plain First Surname external person (e.g. Mieze Schefold) is no longer falsely flagged.
Architecture: A pure classify_name(raw, given_names) function in persons.py returns a NameClass. ResolutionContext classifies every unmatched name and records the non-RESOLVABLE ones in self.unresolved. A runtime-built given-name set (register first names + a small config supplement) lets the classifier distinguish a two-given-name pair (Ella Anita → two people) from a first+surname single person (Mieze Schefold). The orchestrator writes the aggregated report and per-category stats, replacing the noisy ambiguous-receivers.csv.
Tech Stack: Python 3.12, openpyxl, pytest — extends the existing tools/import-normalizer/.
Context: This builds on the completed normalizer (PR #663). Run all tests with CWD = the tool dir, e.g. cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_X.py -v. Reuse the existing venv at tools/import-normalizer/.venv (do NOT recreate it). Commit on the current branch docs/import-migration (never main, never push). Each commit message ends with a trailing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> line.
File Structure
tools/import-normalizer/
├── config.py # + RELATIONAL_TERMS, COLLECTIVE_TERMS, UNKNOWN_NAME_MARKERS, PROSE_MAX_LEN, EXTRA_GIVEN_NAMES
├── persons.py # + NameClass, classify_name(), build_given_names(); ResolutionContext gains given_names + self.unresolved
├── normalize.py # writes unresolved-names.csv (replaces ambiguous-receivers.csv) + per-category stats
├── README.md # + unresolved-names.csv row in the review-file table
└── tests/
├── test_config.py # + name-table presence test
├── test_persons.py # + classify_name + build_given_names tests
├── test_documents.py # ambiguous test → unresolved test (+ resolvable-pair test)
└── test_normalize.py # integration asserts unresolved-names.csv
Task 1: Config — name-classification tables
Files:
-
Modify:
tools/import-normalizer/config.py -
Modify:
tools/import-normalizer/tests/test_config.py -
Step 1: Add the failing test to
tests/test_config.py
def test_name_classification_tables():
assert "tante" in config.RELATIONAL_TERMS
assert "familie" in config.COLLECTIVE_TERMS
assert "unbekannt" in config.UNKNOWN_NAME_MARKERS
assert config.PROSE_MAX_LEN >= 30
assert "anita" in config.EXTRA_GIVEN_NAMES
- Step 2: Run to verify it fails
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py::test_name_classification_tables -v && cd -
Expected: FAIL — AttributeError: module 'config' has no attribute 'RELATIONAL_TERMS'.
- Step 3: Implement — append to
config.py(after the existing tables, before/afterKNOWN_LAST_NAMES— anywhere at module level)
# --- Name classification (unresolved-name review) ---
# Relational reference terms — a sender/receiver named by relation, not a proper name.
RELATIONAL_TERMS = {
"tante", "onkel", "mutter", "vater", "oma", "opa", "großmutter", "grossmutter",
"großvater", "grossvater", "schwester", "bruder", "cousin", "cousine", "kusine",
"neffe", "nichte", "tochter", "sohn", "schwager", "schwägerin", "schwiegermutter",
"schwiegervater", "enkel", "enkelin", "vetter", "base", "witwe", "witwer",
}
# Collective/group terms — not a single person. Matched against alpha-only word tokens
# (so "Fam.Cram" -> ["fam","cram"] matches "fam"), NOT as substrings/prefixes.
COLLECTIVE_TERMS = {
"familie", "fam", "kinder", "eltern", "geschwister", "großeltern",
"grosseltern", "alle", "diverse", "div", "gebrüder", "gebr",
}
# Markers of an unknown/illegible name (the literal "?" is handled separately in code).
# All long enough to be safe as SUBSTRING matches — do NOT add short tokens like "nn"
# (it occurs inside real names: Hanni, Johanna, Anna).
UNKNOWN_NAME_MARKERS = {"unbekannt", "unbek", "unleserlich", "unklar", "unsicher"}
# A name-column value longer than this (chars) is treated as prose/description, not a name.
PROSE_MAX_LEN = 40
# Common given names that may appear in two-given-name pairs (e.g. "Ella Anita") but are not
# in the family register. Only used to detect AMBIGUOUS_PAIR — extend as review surfaces more.
EXTRA_GIVEN_NAMES = {
"ella", "anita", "kurt", "georg", "hanni", "mieze", "ellen", "leni", "klara",
"margret", "gustava", "emmy", "minna", "sophie", "helga", "raymonde", "augusta",
}
- Step 4: Run to verify it passes
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd -
Expected: PASS (all config tests).
- Step 5: Commit
git add tools/import-normalizer/config.py tools/import-normalizer/tests/test_config.py
git commit -m "feat(normalizer): config tables for name classification"
Task 2: classify_name + NameClass
Files:
-
Modify:
tools/import-normalizer/persons.py -
Modify:
tools/import-normalizer/tests/test_persons.py -
Step 1: Add failing tests to
tests/test_persons.py
from persons import NameClass
GIVEN = {"ella", "anita", "kurt", "georg", "clara", "eugenie"}
def test_classify_unknown():
assert persons.classify_name("?", GIVEN) is NameClass.UNKNOWN
assert persons.classify_name("A. Kredell?", GIVEN) is NameClass.UNKNOWN
assert persons.classify_name("unbekannt", GIVEN) is NameClass.UNKNOWN
def test_classify_prose():
assert persons.classify_name("Adressenliste v Clara Cram zur Kondolenz", GIVEN) is NameClass.PROSE
assert persons.classify_name("Clara de Gruyter(*1871)", GIVEN) is NameClass.PROSE # digit
assert persons.classify_name('"Cramiade" Gedicht', GIVEN) is NameClass.PROSE # quote
def test_classify_collective():
assert persons.classify_name("Familie", GIVEN) is NameClass.COLLECTIVE
assert persons.classify_name("Fam.Cram", GIVEN) is NameClass.COLLECTIVE
assert persons.classify_name("Eltern Cram", GIVEN) is NameClass.COLLECTIVE
assert persons.classify_name("seine Kinder", GIVEN) is NameClass.COLLECTIVE
def test_classify_relational():
assert persons.classify_name("Cousine Emmy Haniel", GIVEN) is NameClass.RELATIONAL
assert persons.classify_name("Schwester Hanni", GIVEN) is NameClass.RELATIONAL
def test_classify_single_token():
assert persons.classify_name("Agnes", GIVEN) is NameClass.SINGLE_TOKEN
assert persons.classify_name("A.B.", GIVEN) is NameClass.SINGLE_TOKEN
def test_classify_ambiguous_pair():
assert persons.classify_name("Ella Anita", GIVEN) is NameClass.AMBIGUOUS_PAIR
assert persons.classify_name("Kurt Georg", GIVEN) is NameClass.AMBIGUOUS_PAIR
def test_classify_resolvable_single_person():
# first + surname (surname not a given name) -> one real person, NOT ambiguous
assert persons.classify_name("Mieze Schefold", GIVEN) is NameClass.RESOLVABLE
assert persons.classify_name("Adolf Butenandt", GIVEN) is NameClass.RESOLVABLE
- Step 2: Run to verify it fails
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k classify -v && cd -
Expected: FAIL — NameClass / classify_name not defined.
- Step 3: Implement — add to
persons.py. Addfrom enum import StrEnumto the imports if not present, then add:
class NameClass(StrEnum):
RESOLVABLE = "resolvable"
UNKNOWN = "unknown"
SINGLE_TOKEN = "single_token"
RELATIONAL = "relational"
COLLECTIVE = "collective"
PROSE = "prose"
AMBIGUOUS_PAIR = "ambiguous_pair"
_QUOTE_CHARS = "\"'“”„‚‘’"
def classify_name(raw: str, given_names: set[str]) -> NameClass:
"""Classify a (post-split) sender/receiver string by why it may be unresolvable.
Precedence (first match wins): UNKNOWN -> PROSE -> COLLECTIVE -> RELATIONAL ->
SINGLE_TOKEN -> AMBIGUOUS_PAIR -> RESOLVABLE.
"""
s = raw.strip()
if not s:
return NameClass.RESOLVABLE
low = s.lower()
tokens = s.split()
# alpha-only word tokens: "Fam.Cram" -> ["fam","cram"], so collective/relational terms
# are matched as whole words (no substring/prefix false positives like "Allerton").
alpha_words = re.findall(r"[a-zäöüß]+", low)
if "?" in s or any(m in low for m in config.UNKNOWN_NAME_MARKERS):
return NameClass.UNKNOWN
if (len(s) > config.PROSE_MAX_LEN or any(c.isdigit() for c in s)
or any(q in s for q in _QUOTE_CHARS) or len(tokens) > 3):
return NameClass.PROSE
if any(w in config.COLLECTIVE_TERMS for w in alpha_words):
return NameClass.COLLECTIVE
if any(w in config.RELATIONAL_TERMS for w in alpha_words):
return NameClass.RELATIONAL
if len(tokens) == 1:
return NameClass.SINGLE_TOKEN
if len(tokens) == 2 and all(_norm(t) in given_names for t in tokens):
return NameClass.AMBIGUOUS_PAIR
return NameClass.RESOLVABLE
# Known limitation: a 4+-token name with no digits/quotes (e.g. "Anna von der Heide") is
# classified PROSE. Such multi-particle names are rare here and usually resolve via the
# register; if they surface in review, lower-priority than the real prose entries.
Note:
_normalready exists inpersons.py(added in the alias-index task) and strips accents + lowercases.classify_nameuses it so given-name matching is accent-insensitive.
- Step 4: Run to verify it passes
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -
Expected: PASS (all persons tests, including the 7 new classify tests).
- Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): classify_name + NameClass"
Task 3: build_given_names
Files:
-
Modify:
tools/import-normalizer/persons.py -
Modify:
tools/import-normalizer/tests/test_persons.py -
Step 1: Add failing test to
tests/test_persons.py
def test_build_given_names():
people = persons.parse_register([
{"last_name": "de Gruyter", "first_name": "Eugenie"},
{"last_name": "Cram", "first_name": "Charlotte,Meta"}, # comma -> primary + extra given
])
g = persons.build_given_names(people, {"Anita"})
assert "eugenie" in g
assert "charlotte" in g and "meta" in g # primary + extra given names
assert "anita" in g # from the extra set, normalized
assert "schefold" not in g
- Step 2: Run to verify it fails
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py::test_build_given_names -v && cd -
Expected: FAIL — build_given_names not defined.
- Step 3: Implement — add to
persons.py
def build_given_names(register: list[Person], extra: set[str]) -> set[str]:
"""Set of normalized given names from the register (first + extra given) plus a supplement.
Used by classify_name to tell a two-given-name pair (two people) from a first+surname.
"""
names: set[str] = set()
for p in register:
if p.first_name:
names.add(_norm(p.first_name))
for g in p.extra_given_names:
names.add(_norm(g))
for e in extra:
names.add(_norm(e))
return names
- Step 4: Run to verify it passes
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -
Expected: PASS.
- Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
git commit -m "feat(normalizer): build_given_names from register + supplement"
Task 4: Integrate — ResolutionContext records unresolved; orchestrator writes the report
This task touches persons.py, normalize.py, and two test files together so the whole suite stays green in one commit (removing ctx.ambiguous requires updating its only consumer, normalize.py, in the same change).
Files:
-
Modify:
tools/import-normalizer/persons.py(ResolutionContext) -
Modify:
tools/import-normalizer/normalize.py -
Modify:
tools/import-normalizer/tests/test_documents.py -
Modify:
tools/import-normalizer/tests/test_normalize.py -
Step 1: Update the failing tests first
In tests/test_documents.py, replace the existing test_ambiguous_space_pair_flagged_not_split function entirely with these two functions:
def test_ambiguous_pair_recorded_in_unresolved():
people = persons.parse_register([{"last_name": "de Gruyter", "first_name": "Walter"}])
ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={},
given_names={"ella", "anita"})
raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita")
doc = documents.to_canonical(raw, ctx, date_overrides={})
assert len(doc.receiver_person_ids) == 1 # not split — one provisional
assert any(name == "Ella Anita" and cat == "ambiguous_pair" for name, cat, _ in ctx.unresolved)
def test_resolvable_first_surname_pair_not_unresolved():
ctx = persons.ResolutionContext(persons.AliasIndex([]), name_overrides={},
given_names={"ella", "anita"})
ctx.resolve_one("Mieze Schefold", source_row=1) # surname is not a given name
assert ctx.unresolved == [] # RESOLVABLE -> not recorded
In tests/test_normalize.py, in the _doc_wb fixture, change the C-0001 row's receiver from empty to "?" so the run produces an unresolved entry. Find the line that appends the C-0001 row and set its EmpfängerIn cell to "?". For example the row currently reads:
ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""])
change the 6th cell (EmpfängerIn) from "" to "?":
ws.append(["C-0001", "", "", "", "Hans Wittkopf", "?", "Freitag 1919", "", "", ""])
Then add these assertions inside test_run_end_to_end, right after the existing assert (review_dir / "unparsed-dates.csv").exists() line:
assert (out_dir / "canonical-documents.xlsx").exists() # (keep existing asserts above)
assert (review_dir / "unresolved-names.csv").exists()
unresolved_text = (review_dir / "unresolved-names.csv").read_text(encoding="utf-8")
assert "unknown" in unresolved_text and "?" in unresolved_text # the "?" receiver
assert not (review_dir / "ambiguous-receivers.csv").exists() # replaced
- Step 2: Run to verify they fail
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py tests/test_normalize.py -v && cd -
Expected: FAIL — ResolutionContext has no given_names/unresolved; unresolved-names.csv not written.
- Step 3a: Implement —
ResolutionContextinpersons.py
Replace the ResolutionContext.__init__ body's two lines (self.ambiguous and add given_names) and the relevant methods. The new __init__:
def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str],
given_names: set[str] | None = None):
self.index = alias_index
self.name_overrides = name_overrides
self.given_names = given_names or set()
self.provisional: dict[str, Person] = {}
self.unmatched: dict[str, list] = {}
self.unresolved: list[tuple] = [] # (raw_name, category, source_row) for non-RESOLVABLE names
self._raw_to_pid: dict[str, str] = {}
self.override_hits = 0
In resolve_one, the provisional branch must classify the name. Replace this existing block:
# provisional person (unmatched) — never reuse a register id
self.unmatched.setdefault(name, []).append(source_row)
if name in self._raw_to_pid:
return self._raw_to_pid[name], name, False
with:
# provisional person (unmatched) — never reuse a register id
self.unmatched.setdefault(name, []).append(source_row)
category = classify_name(name, self.given_names)
if category is not NameClass.RESOLVABLE:
self.unresolved.append((name, str(category), source_row))
if name in self._raw_to_pid:
return self._raw_to_pid[name], name, False
Replace the entire resolve_receivers method (the ambiguous detection now lives in resolve_one via classify_name):
def resolve_receivers(self, raw: str, source_row: int):
return [self.resolve_one(part, source_row) for part in split_receivers(raw)]
- Step 3b: Implement —
normalize.py
Find the line that builds the context:
ctx = persons.ResolutionContext(alias_index, name_overrides)
replace it with (build the given-name set from the register + config supplement):
given_names = persons.build_given_names(register, config.EXTRA_GIVEN_NAMES)
ctx = persons.ResolutionContext(alias_index, name_overrides, given_names=given_names)
Replace the ambiguous-receivers.csv write line:
writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous)
with an aggregated unresolved-names report:
unresolved_agg: dict[tuple, list] = {}
for name, category, row in ctx.unresolved:
unresolved_agg.setdefault((category, name), []).append(row)
unresolved_rows = sorted(
([cat, name, len(rows), " ".join(map(str, sorted(rows)[:5]))]
for (cat, name), rows in unresolved_agg.items()),
key=lambda r: (r[0], -r[2], r[1]))
writers.write_review_csv(review_dir / "unresolved-names.csv",
["category", "raw", "count", "example_rows"], unresolved_rows)
In the stats dict, replace the "ambiguous_receivers" line:
"ambiguous_receivers": len(ctx.ambiguous),
with a per-category breakdown:
"unresolved_name_occurrences": len(ctx.unresolved),
"unresolved_unknown": sum(1 for _, c, _ in ctx.unresolved if c == "unknown"),
"unresolved_single_token": sum(1 for _, c, _ in ctx.unresolved if c == "single_token"),
"unresolved_relational": sum(1 for _, c, _ in ctx.unresolved if c == "relational"),
"unresolved_collective": sum(1 for _, c, _ in ctx.unresolved if c == "collective"),
"unresolved_prose": sum(1 for _, c, _ in ctx.unresolved if c == "prose"),
"unresolved_ambiguous_pair": sum(1 for _, c, _ in ctx.unresolved if c == "ambiguous_pair"),
- Step 4: Run the whole suite to verify green
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd -
Expected: PASS (all tests, no ambiguous references remain).
Also grep to confirm no dangling references:
Run: grep -rn "ctx.ambiguous\|ambiguous-receivers\|ambiguous_receivers\|self.ambiguous" tools/import-normalizer/*.py
Expected: no matches.
- Step 5: Commit
git add tools/import-normalizer/persons.py tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_documents.py tools/import-normalizer/tests/test_normalize.py
git commit -m "feat(normalizer): unresolved-names report + fix ambiguous-pair over-flagging"
Task 5: README — document the new report
Files:
-
Modify:
tools/import-normalizer/README.md -
Step 1: Update the review-file table in
README.md. Replace theambiguous-receivers.csvrow with anunresolved-names.csvrow. Find the table row referencingambiguous-receivers.csvand replace it with:
| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv`. |
If the README has no such row (older version), add the row above to the review-file table.
- Step 2: Add a note to the iteration-loop section of
README.md(after the table):
> `unresolved-names.csv` is the focused "names that need a human" list — distinct from
> `unmatched-names.csv` (which is just non-family correspondents that got provisional persons).
> The given-name set that drives `ambiguous_pair` detection is the register's first names plus
> `config.EXTRA_GIVEN_NAMES` — add names there if a real two-person cell isn't being flagged.
- Step 3: Verify the suite is still green (README-only change, but confirm nothing references the old file)
Run: cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd -
Expected: PASS.
- Step 4: Commit
git add tools/import-normalizer/README.md
git commit -m "docs(normalizer): document unresolved-names.csv review report"
Self-Review
Spec coverage (against the agreed proposal):
- Focused report isolating problem name classes → Task 4 writes
review/unresolved-names.csvwith acategorycolumn; categories defined in Task 2classify_name. ✓ - Fix ambiguous over-flagging of
First Surname→ Task 2AMBIGUOUS_PAIRrequires both tokens in the given-name set;Mieze Schefold→RESOLVABLE(tested). ✓ - Distinguish "not fully known" (unknown/single-token/relational/collective/prose) from "can't split cleanly" (ambiguous_pair) → all are
NameClassvalues, each its own category column value. ✓ - Per-category counts in summary → Task 4 stats. ✓
- Senders covered too (not just receivers) → classification happens in
resolve_one, which bothresolve_senderandresolve_receiverscall. ✓
Placeholder scan: No TBD/TODO; every code step has complete code. The README replacement gives the exact row text.
Type consistency: NameClass (StrEnum) defined Task 2; classify_name(raw, given_names) and build_given_names(register, extra) signatures used consistently in Task 4; ResolutionContext(alias_index, name_overrides, given_names=…) matches the new __init__; self.unresolved is list[tuple] of (raw, category, source_row) and read with that shape in both the report and the stats. str(category) yields the StrEnum value (e.g. "ambiguous_pair"), matching the stat comparisons and the test assertions.
Cross-task green: Task 4 deliberately bundles the persons.py + normalize.py + test changes into one commit because removing ctx.ambiguous breaks its consumer otherwise — no red commit is left behind (lesson from the prior build).
Out of scope (future): Spanish month names + Mon DD-YYYY date form (separate date-parser enhancement); promoting unresolved rows into a document-level needs_review flag; auto-splitting confirmed ambiguous_pair entries via overrides.