docs(import): add unresolved-names plan + worklog entry
All checks were successful
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Unit & Component Tests (pull_request) Successful in 4m13s
CI / Semgrep Security Scan (pull_request) Successful in 20s
All checks were successful
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
CI / Backend Unit Tests (pull_request) Successful in 3m52s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Unit & Component Tests (pull_request) Successful in 4m13s
CI / Semgrep Security Scan (pull_request) Successful in 20s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
502
docs/import-migration/04-unresolved-names-plan.md
Normal file
502
docs/import-migration/04-unresolved-names-plan.md
Normal file
@@ -0,0 +1,502 @@
|
|||||||
|
# Unresolved-Name Classification Implementation Plan
|
||||||
|
|
||||||
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||||
|
|
||||||
|
**Goal:** Add a focused `review/unresolved-names.csv` that isolates sender/receiver strings whose *name itself* is problematic (unknown/illegible, single-token, relational-only, collective/group, prose-in-name-column, or a genuine two-given-name pair), and fix the ambiguous-pair heuristic so a plain `First Surname` external person (e.g. `Mieze Schefold`) is no longer falsely flagged.
|
||||||
|
|
||||||
|
**Architecture:** A pure `classify_name(raw, given_names)` function in `persons.py` returns a `NameClass`. `ResolutionContext` classifies every *unmatched* name and records the non-`RESOLVABLE` ones in `self.unresolved`. A runtime-built given-name set (register first names + a small config supplement) lets the classifier distinguish a two-given-name pair (`Ella Anita` → two people) from a first+surname single person (`Mieze Schefold`). The orchestrator writes the aggregated report and per-category stats, replacing the noisy `ambiguous-receivers.csv`.
|
||||||
|
|
||||||
|
**Tech Stack:** Python 3.12, openpyxl, pytest — extends the existing `tools/import-normalizer/`.
|
||||||
|
|
||||||
|
**Context:** This builds on the completed normalizer (PR #663). Run all tests with CWD = the tool dir, e.g. `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_X.py -v`. Reuse the existing venv at `tools/import-normalizer/.venv` (do NOT recreate it). Commit on the current branch `docs/import-migration` (never main, never push). Each commit message ends with a trailing `Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>` line.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
tools/import-normalizer/
|
||||||
|
├── config.py # + RELATIONAL_TERMS, COLLECTIVE_TERMS, UNKNOWN_NAME_MARKERS, PROSE_MAX_LEN, EXTRA_GIVEN_NAMES
|
||||||
|
├── persons.py # + NameClass, classify_name(), build_given_names(); ResolutionContext gains given_names + self.unresolved
|
||||||
|
├── normalize.py # writes unresolved-names.csv (replaces ambiguous-receivers.csv) + per-category stats
|
||||||
|
├── README.md # + unresolved-names.csv row in the review-file table
|
||||||
|
└── tests/
|
||||||
|
├── test_config.py # + name-table presence test
|
||||||
|
├── test_persons.py # + classify_name + build_given_names tests
|
||||||
|
├── test_documents.py # ambiguous test → unresolved test (+ resolvable-pair test)
|
||||||
|
└── test_normalize.py # integration asserts unresolved-names.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 1: Config — name-classification tables
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `tools/import-normalizer/config.py`
|
||||||
|
- Modify: `tools/import-normalizer/tests/test_config.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add the failing test** to `tests/test_config.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_name_classification_tables():
|
||||||
|
assert "tante" in config.RELATIONAL_TERMS
|
||||||
|
assert "familie" in config.COLLECTIVE_TERMS
|
||||||
|
assert "unbekannt" in config.UNKNOWN_NAME_MARKERS
|
||||||
|
assert config.PROSE_MAX_LEN >= 30
|
||||||
|
assert "anita" in config.EXTRA_GIVEN_NAMES
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run to verify it fails**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py::test_name_classification_tables -v && cd -`
|
||||||
|
Expected: FAIL — `AttributeError: module 'config' has no attribute 'RELATIONAL_TERMS'`.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Implement** — append to `config.py` (after the existing tables, before/after `KNOWN_LAST_NAMES` — anywhere at module level)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# --- Name classification (unresolved-name review) ---
|
||||||
|
# Relational reference terms — a sender/receiver named by relation, not a proper name.
|
||||||
|
RELATIONAL_TERMS = {
|
||||||
|
"tante", "onkel", "mutter", "vater", "oma", "opa", "großmutter", "grossmutter",
|
||||||
|
"großvater", "grossvater", "schwester", "bruder", "cousin", "cousine", "kusine",
|
||||||
|
"neffe", "nichte", "tochter", "sohn", "schwager", "schwägerin", "schwiegermutter",
|
||||||
|
"schwiegervater", "enkel", "enkelin", "vetter", "base", "witwe", "witwer",
|
||||||
|
}
|
||||||
|
# Collective/group terms — not a single person. Matched against alpha-only word tokens
|
||||||
|
# (so "Fam.Cram" -> ["fam","cram"] matches "fam"), NOT as substrings/prefixes.
|
||||||
|
COLLECTIVE_TERMS = {
|
||||||
|
"familie", "fam", "kinder", "eltern", "geschwister", "großeltern",
|
||||||
|
"grosseltern", "alle", "diverse", "div", "gebrüder", "gebr",
|
||||||
|
}
|
||||||
|
# Markers of an unknown/illegible name (the literal "?" is handled separately in code).
|
||||||
|
# All long enough to be safe as SUBSTRING matches — do NOT add short tokens like "nn"
|
||||||
|
# (it occurs inside real names: Hanni, Johanna, Anna).
|
||||||
|
UNKNOWN_NAME_MARKERS = {"unbekannt", "unbek", "unleserlich", "unklar", "unsicher"}
|
||||||
|
# A name-column value longer than this (chars) is treated as prose/description, not a name.
|
||||||
|
PROSE_MAX_LEN = 40
|
||||||
|
# Common given names that may appear in two-given-name pairs (e.g. "Ella Anita") but are not
|
||||||
|
# in the family register. Only used to detect AMBIGUOUS_PAIR — extend as review surfaces more.
|
||||||
|
EXTRA_GIVEN_NAMES = {
|
||||||
|
"ella", "anita", "kurt", "georg", "hanni", "mieze", "ellen", "leni", "klara",
|
||||||
|
"margret", "gustava", "emmy", "minna", "sophie", "helga", "raymonde", "augusta",
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run to verify it passes**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_config.py -v && cd -`
|
||||||
|
Expected: PASS (all config tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tools/import-normalizer/config.py tools/import-normalizer/tests/test_config.py
|
||||||
|
git commit -m "feat(normalizer): config tables for name classification"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 2: `classify_name` + `NameClass`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `tools/import-normalizer/persons.py`
|
||||||
|
- Modify: `tools/import-normalizer/tests/test_persons.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add failing tests** to `tests/test_persons.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
from persons import NameClass
|
||||||
|
|
||||||
|
GIVEN = {"ella", "anita", "kurt", "georg", "clara", "eugenie"}
|
||||||
|
|
||||||
|
def test_classify_unknown():
|
||||||
|
assert persons.classify_name("?", GIVEN) is NameClass.UNKNOWN
|
||||||
|
assert persons.classify_name("A. Kredell?", GIVEN) is NameClass.UNKNOWN
|
||||||
|
assert persons.classify_name("unbekannt", GIVEN) is NameClass.UNKNOWN
|
||||||
|
|
||||||
|
def test_classify_prose():
|
||||||
|
assert persons.classify_name("Adressenliste v Clara Cram zur Kondolenz", GIVEN) is NameClass.PROSE
|
||||||
|
assert persons.classify_name("Clara de Gruyter(*1871)", GIVEN) is NameClass.PROSE # digit
|
||||||
|
assert persons.classify_name('"Cramiade" Gedicht', GIVEN) is NameClass.PROSE # quote
|
||||||
|
|
||||||
|
def test_classify_collective():
|
||||||
|
assert persons.classify_name("Familie", GIVEN) is NameClass.COLLECTIVE
|
||||||
|
assert persons.classify_name("Fam.Cram", GIVEN) is NameClass.COLLECTIVE
|
||||||
|
assert persons.classify_name("Eltern Cram", GIVEN) is NameClass.COLLECTIVE
|
||||||
|
assert persons.classify_name("seine Kinder", GIVEN) is NameClass.COLLECTIVE
|
||||||
|
|
||||||
|
def test_classify_relational():
|
||||||
|
assert persons.classify_name("Cousine Emmy Haniel", GIVEN) is NameClass.RELATIONAL
|
||||||
|
assert persons.classify_name("Schwester Hanni", GIVEN) is NameClass.RELATIONAL
|
||||||
|
|
||||||
|
def test_classify_single_token():
|
||||||
|
assert persons.classify_name("Agnes", GIVEN) is NameClass.SINGLE_TOKEN
|
||||||
|
assert persons.classify_name("A.B.", GIVEN) is NameClass.SINGLE_TOKEN
|
||||||
|
|
||||||
|
def test_classify_ambiguous_pair():
|
||||||
|
assert persons.classify_name("Ella Anita", GIVEN) is NameClass.AMBIGUOUS_PAIR
|
||||||
|
assert persons.classify_name("Kurt Georg", GIVEN) is NameClass.AMBIGUOUS_PAIR
|
||||||
|
|
||||||
|
def test_classify_resolvable_single_person():
|
||||||
|
# first + surname (surname not a given name) -> one real person, NOT ambiguous
|
||||||
|
assert persons.classify_name("Mieze Schefold", GIVEN) is NameClass.RESOLVABLE
|
||||||
|
assert persons.classify_name("Adolf Butenandt", GIVEN) is NameClass.RESOLVABLE
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run to verify it fails**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -k classify -v && cd -`
|
||||||
|
Expected: FAIL — `NameClass` / `classify_name` not defined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Implement** — add to `persons.py`. Add `from enum import StrEnum` to the imports if not present, then add:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class NameClass(StrEnum):
|
||||||
|
RESOLVABLE = "resolvable"
|
||||||
|
UNKNOWN = "unknown"
|
||||||
|
SINGLE_TOKEN = "single_token"
|
||||||
|
RELATIONAL = "relational"
|
||||||
|
COLLECTIVE = "collective"
|
||||||
|
PROSE = "prose"
|
||||||
|
AMBIGUOUS_PAIR = "ambiguous_pair"
|
||||||
|
|
||||||
|
|
||||||
|
_QUOTE_CHARS = "\"'“”„‚‘’"
|
||||||
|
|
||||||
|
|
||||||
|
def classify_name(raw: str, given_names: set[str]) -> NameClass:
|
||||||
|
"""Classify a (post-split) sender/receiver string by why it may be unresolvable.
|
||||||
|
|
||||||
|
Precedence (first match wins): UNKNOWN -> PROSE -> COLLECTIVE -> RELATIONAL ->
|
||||||
|
SINGLE_TOKEN -> AMBIGUOUS_PAIR -> RESOLVABLE.
|
||||||
|
"""
|
||||||
|
s = raw.strip()
|
||||||
|
if not s:
|
||||||
|
return NameClass.RESOLVABLE
|
||||||
|
low = s.lower()
|
||||||
|
tokens = s.split()
|
||||||
|
# alpha-only word tokens: "Fam.Cram" -> ["fam","cram"], so collective/relational terms
|
||||||
|
# are matched as whole words (no substring/prefix false positives like "Allerton").
|
||||||
|
alpha_words = re.findall(r"[a-zäöüß]+", low)
|
||||||
|
if "?" in s or any(m in low for m in config.UNKNOWN_NAME_MARKERS):
|
||||||
|
return NameClass.UNKNOWN
|
||||||
|
if (len(s) > config.PROSE_MAX_LEN or any(c.isdigit() for c in s)
|
||||||
|
or any(q in s for q in _QUOTE_CHARS) or len(tokens) > 3):
|
||||||
|
return NameClass.PROSE
|
||||||
|
if any(w in config.COLLECTIVE_TERMS for w in alpha_words):
|
||||||
|
return NameClass.COLLECTIVE
|
||||||
|
if any(w in config.RELATIONAL_TERMS for w in alpha_words):
|
||||||
|
return NameClass.RELATIONAL
|
||||||
|
if len(tokens) == 1:
|
||||||
|
return NameClass.SINGLE_TOKEN
|
||||||
|
if len(tokens) == 2 and all(_norm(t) in given_names for t in tokens):
|
||||||
|
return NameClass.AMBIGUOUS_PAIR
|
||||||
|
return NameClass.RESOLVABLE
|
||||||
|
|
||||||
|
|
||||||
|
# Known limitation: a 4+-token name with no digits/quotes (e.g. "Anna von der Heide") is
|
||||||
|
# classified PROSE. Such multi-particle names are rare here and usually resolve via the
|
||||||
|
# register; if they surface in review, lower-priority than the real prose entries.
|
||||||
|
```
|
||||||
|
|
||||||
|
> Note: `_norm` already exists in `persons.py` (added in the alias-index task) and strips accents + lowercases. `classify_name` uses it so given-name matching is accent-insensitive.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run to verify it passes**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -`
|
||||||
|
Expected: PASS (all persons tests, including the 7 new classify tests).
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
|
||||||
|
git commit -m "feat(normalizer): classify_name + NameClass"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 3: `build_given_names`
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `tools/import-normalizer/persons.py`
|
||||||
|
- Modify: `tools/import-normalizer/tests/test_persons.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Add failing test** to `tests/test_persons.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_build_given_names():
|
||||||
|
people = persons.parse_register([
|
||||||
|
{"last_name": "de Gruyter", "first_name": "Eugenie"},
|
||||||
|
{"last_name": "Cram", "first_name": "Charlotte,Meta"}, # comma -> primary + extra given
|
||||||
|
])
|
||||||
|
g = persons.build_given_names(people, {"Anita"})
|
||||||
|
assert "eugenie" in g
|
||||||
|
assert "charlotte" in g and "meta" in g # primary + extra given names
|
||||||
|
assert "anita" in g # from the extra set, normalized
|
||||||
|
assert "schefold" not in g
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run to verify it fails**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py::test_build_given_names -v && cd -`
|
||||||
|
Expected: FAIL — `build_given_names` not defined.
|
||||||
|
|
||||||
|
- [ ] **Step 3: Implement** — add to `persons.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
def build_given_names(register: list[Person], extra: set[str]) -> set[str]:
|
||||||
|
"""Set of normalized given names from the register (first + extra given) plus a supplement.
|
||||||
|
|
||||||
|
Used by classify_name to tell a two-given-name pair (two people) from a first+surname.
|
||||||
|
"""
|
||||||
|
names: set[str] = set()
|
||||||
|
for p in register:
|
||||||
|
if p.first_name:
|
||||||
|
names.add(_norm(p.first_name))
|
||||||
|
for g in p.extra_given_names:
|
||||||
|
names.add(_norm(g))
|
||||||
|
for e in extra:
|
||||||
|
names.add(_norm(e))
|
||||||
|
return names
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run to verify it passes**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_persons.py -v && cd -`
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tools/import-normalizer/persons.py tools/import-normalizer/tests/test_persons.py
|
||||||
|
git commit -m "feat(normalizer): build_given_names from register + supplement"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 4: Integrate — ResolutionContext records unresolved; orchestrator writes the report
|
||||||
|
|
||||||
|
This task touches `persons.py`, `normalize.py`, and two test files together so the whole suite stays green in one commit (removing `ctx.ambiguous` requires updating its only consumer, `normalize.py`, in the same change).
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `tools/import-normalizer/persons.py` (ResolutionContext)
|
||||||
|
- Modify: `tools/import-normalizer/normalize.py`
|
||||||
|
- Modify: `tools/import-normalizer/tests/test_documents.py`
|
||||||
|
- Modify: `tools/import-normalizer/tests/test_normalize.py`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update the failing tests first**
|
||||||
|
|
||||||
|
In `tests/test_documents.py`, **replace** the existing `test_ambiguous_space_pair_flagged_not_split` function entirely with these two functions:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_ambiguous_pair_recorded_in_unresolved():
|
||||||
|
people = persons.parse_register([{"last_name": "de Gruyter", "first_name": "Walter"}])
|
||||||
|
ctx = persons.ResolutionContext(persons.AliasIndex(people), name_overrides={},
|
||||||
|
given_names={"ella", "anita"})
|
||||||
|
raw = documents.RawRow(source_row=7, index="C-0200", sender="", receivers="Ella Anita")
|
||||||
|
doc = documents.to_canonical(raw, ctx, date_overrides={})
|
||||||
|
assert len(doc.receiver_person_ids) == 1 # not split — one provisional
|
||||||
|
assert any(name == "Ella Anita" and cat == "ambiguous_pair" for name, cat, _ in ctx.unresolved)
|
||||||
|
|
||||||
|
def test_resolvable_first_surname_pair_not_unresolved():
|
||||||
|
ctx = persons.ResolutionContext(persons.AliasIndex([]), name_overrides={},
|
||||||
|
given_names={"ella", "anita"})
|
||||||
|
ctx.resolve_one("Mieze Schefold", source_row=1) # surname is not a given name
|
||||||
|
assert ctx.unresolved == [] # RESOLVABLE -> not recorded
|
||||||
|
```
|
||||||
|
|
||||||
|
In `tests/test_normalize.py`, in the `_doc_wb` fixture, change the `C-0001` row's receiver from empty to `"?"` so the run produces an unresolved entry. Find the line that appends the `C-0001` row and set its `EmpfängerIn` cell to `"?"`. For example the row currently reads:
|
||||||
|
|
||||||
|
```python
|
||||||
|
ws.append(["C-0001", "", "", "", "Hans Wittkopf", "", "Freitag 1919", "", "", ""])
|
||||||
|
```
|
||||||
|
|
||||||
|
change the 6th cell (EmpfängerIn) from `""` to `"?"`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
ws.append(["C-0001", "", "", "", "Hans Wittkopf", "?", "Freitag 1919", "", "", ""])
|
||||||
|
```
|
||||||
|
|
||||||
|
Then add these assertions inside `test_run_end_to_end`, right after the existing `assert (review_dir / "unparsed-dates.csv").exists()` line:
|
||||||
|
|
||||||
|
```python
|
||||||
|
assert (out_dir / "canonical-documents.xlsx").exists() # (keep existing asserts above)
|
||||||
|
assert (review_dir / "unresolved-names.csv").exists()
|
||||||
|
unresolved_text = (review_dir / "unresolved-names.csv").read_text(encoding="utf-8")
|
||||||
|
assert "unknown" in unresolved_text and "?" in unresolved_text # the "?" receiver
|
||||||
|
assert not (review_dir / "ambiguous-receivers.csv").exists() # replaced
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 2: Run to verify they fail**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/test_documents.py tests/test_normalize.py -v && cd -`
|
||||||
|
Expected: FAIL — `ResolutionContext` has no `given_names`/`unresolved`; `unresolved-names.csv` not written.
|
||||||
|
|
||||||
|
- [ ] **Step 3a: Implement — `ResolutionContext` in `persons.py`**
|
||||||
|
|
||||||
|
Replace the `ResolutionContext.__init__` body's two lines (`self.ambiguous` and add `given_names`) and the relevant methods. The new `__init__`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def __init__(self, alias_index: AliasIndex, name_overrides: dict[str, str],
|
||||||
|
given_names: set[str] | None = None):
|
||||||
|
self.index = alias_index
|
||||||
|
self.name_overrides = name_overrides
|
||||||
|
self.given_names = given_names or set()
|
||||||
|
self.provisional: dict[str, Person] = {}
|
||||||
|
self.unmatched: dict[str, list] = {}
|
||||||
|
self.unresolved: list[tuple] = [] # (raw_name, category, source_row) for non-RESOLVABLE names
|
||||||
|
self._raw_to_pid: dict[str, str] = {}
|
||||||
|
self.override_hits = 0
|
||||||
|
```
|
||||||
|
|
||||||
|
In `resolve_one`, the provisional branch must classify the name. Replace this existing block:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# provisional person (unmatched) — never reuse a register id
|
||||||
|
self.unmatched.setdefault(name, []).append(source_row)
|
||||||
|
if name in self._raw_to_pid:
|
||||||
|
return self._raw_to_pid[name], name, False
|
||||||
|
```
|
||||||
|
|
||||||
|
with:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# provisional person (unmatched) — never reuse a register id
|
||||||
|
self.unmatched.setdefault(name, []).append(source_row)
|
||||||
|
category = classify_name(name, self.given_names)
|
||||||
|
if category is not NameClass.RESOLVABLE:
|
||||||
|
self.unresolved.append((name, str(category), source_row))
|
||||||
|
if name in self._raw_to_pid:
|
||||||
|
return self._raw_to_pid[name], name, False
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace the entire `resolve_receivers` method (the ambiguous detection now lives in `resolve_one` via `classify_name`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def resolve_receivers(self, raw: str, source_row: int):
|
||||||
|
return [self.resolve_one(part, source_row) for part in split_receivers(raw)]
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3b: Implement — `normalize.py`**
|
||||||
|
|
||||||
|
Find the line that builds the context:
|
||||||
|
|
||||||
|
```python
|
||||||
|
ctx = persons.ResolutionContext(alias_index, name_overrides)
|
||||||
|
```
|
||||||
|
|
||||||
|
replace it with (build the given-name set from the register + config supplement):
|
||||||
|
|
||||||
|
```python
|
||||||
|
given_names = persons.build_given_names(register, config.EXTRA_GIVEN_NAMES)
|
||||||
|
ctx = persons.ResolutionContext(alias_index, name_overrides, given_names=given_names)
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace the `ambiguous-receivers.csv` write line:
|
||||||
|
|
||||||
|
```python
|
||||||
|
writers.write_review_csv(review_dir / "ambiguous-receivers.csv", ["raw", "part", "source_row"], ctx.ambiguous)
|
||||||
|
```
|
||||||
|
|
||||||
|
with an aggregated unresolved-names report:
|
||||||
|
|
||||||
|
```python
|
||||||
|
unresolved_agg: dict[tuple, list] = {}
|
||||||
|
for name, category, row in ctx.unresolved:
|
||||||
|
unresolved_agg.setdefault((category, name), []).append(row)
|
||||||
|
unresolved_rows = sorted(
|
||||||
|
([cat, name, len(rows), " ".join(map(str, sorted(rows)[:5]))]
|
||||||
|
for (cat, name), rows in unresolved_agg.items()),
|
||||||
|
key=lambda r: (r[0], -r[2], r[1]))
|
||||||
|
writers.write_review_csv(review_dir / "unresolved-names.csv",
|
||||||
|
["category", "raw", "count", "example_rows"], unresolved_rows)
|
||||||
|
```
|
||||||
|
|
||||||
|
In the `stats` dict, replace the `"ambiguous_receivers"` line:
|
||||||
|
|
||||||
|
```python
|
||||||
|
"ambiguous_receivers": len(ctx.ambiguous),
|
||||||
|
```
|
||||||
|
|
||||||
|
with a per-category breakdown:
|
||||||
|
|
||||||
|
```python
|
||||||
|
"unresolved_name_occurrences": len(ctx.unresolved),
|
||||||
|
"unresolved_unknown": sum(1 for _, c, _ in ctx.unresolved if c == "unknown"),
|
||||||
|
"unresolved_single_token": sum(1 for _, c, _ in ctx.unresolved if c == "single_token"),
|
||||||
|
"unresolved_relational": sum(1 for _, c, _ in ctx.unresolved if c == "relational"),
|
||||||
|
"unresolved_collective": sum(1 for _, c, _ in ctx.unresolved if c == "collective"),
|
||||||
|
"unresolved_prose": sum(1 for _, c, _ in ctx.unresolved if c == "prose"),
|
||||||
|
"unresolved_ambiguous_pair": sum(1 for _, c, _ in ctx.unresolved if c == "ambiguous_pair"),
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 4: Run the whole suite to verify green**
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd -`
|
||||||
|
Expected: PASS (all tests, no `ambiguous` references remain).
|
||||||
|
|
||||||
|
Also grep to confirm no dangling references:
|
||||||
|
Run: `grep -rn "ctx.ambiguous\|ambiguous-receivers\|ambiguous_receivers\|self.ambiguous" tools/import-normalizer/*.py`
|
||||||
|
Expected: no matches.
|
||||||
|
|
||||||
|
- [ ] **Step 5: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tools/import-normalizer/persons.py tools/import-normalizer/normalize.py tools/import-normalizer/tests/test_documents.py tools/import-normalizer/tests/test_normalize.py
|
||||||
|
git commit -m "feat(normalizer): unresolved-names report + fix ambiguous-pair over-flagging"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Task 5: README — document the new report
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Modify: `tools/import-normalizer/README.md`
|
||||||
|
|
||||||
|
- [ ] **Step 1: Update the review-file table** in `README.md`. Replace the `ambiguous-receivers.csv` row with an `unresolved-names.csv` row. Find the table row referencing `ambiguous-receivers.csv` and replace it with:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv`. |
|
||||||
|
```
|
||||||
|
|
||||||
|
If the README has no such row (older version), add the row above to the review-file table.
|
||||||
|
|
||||||
|
- [ ] **Step 2: Add a note** to the iteration-loop section of `README.md` (after the table):
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
> `unresolved-names.csv` is the focused "names that need a human" list — distinct from
|
||||||
|
> `unmatched-names.csv` (which is just non-family correspondents that got provisional persons).
|
||||||
|
> The given-name set that drives `ambiguous_pair` detection is the register's first names plus
|
||||||
|
> `config.EXTRA_GIVEN_NAMES` — add names there if a real two-person cell isn't being flagged.
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Step 3: Verify the suite is still green** (README-only change, but confirm nothing references the old file)
|
||||||
|
|
||||||
|
Run: `cd tools/import-normalizer && .venv/bin/python -m pytest tests/ -q && cd -`
|
||||||
|
Expected: PASS.
|
||||||
|
|
||||||
|
- [ ] **Step 4: Commit**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add tools/import-normalizer/README.md
|
||||||
|
git commit -m "docs(normalizer): document unresolved-names.csv review report"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Self-Review
|
||||||
|
|
||||||
|
**Spec coverage** (against the agreed proposal):
|
||||||
|
- Focused report isolating problem name classes → Task 4 writes `review/unresolved-names.csv` with a `category` column; categories defined in Task 2 `classify_name`. ✓
|
||||||
|
- Fix ambiguous over-flagging of `First Surname` → Task 2 `AMBIGUOUS_PAIR` requires *both* tokens in the given-name set; `Mieze Schefold` → `RESOLVABLE` (tested). ✓
|
||||||
|
- Distinguish "not fully known" (unknown/single-token/relational/collective/prose) from "can't split cleanly" (ambiguous_pair) → all are `NameClass` values, each its own category column value. ✓
|
||||||
|
- Per-category counts in summary → Task 4 stats. ✓
|
||||||
|
- Senders covered too (not just receivers) → classification happens in `resolve_one`, which both `resolve_sender` and `resolve_receivers` call. ✓
|
||||||
|
|
||||||
|
**Placeholder scan:** No TBD/TODO; every code step has complete code. The README replacement gives the exact row text.
|
||||||
|
|
||||||
|
**Type consistency:** `NameClass` (StrEnum) defined Task 2; `classify_name(raw, given_names)` and `build_given_names(register, extra)` signatures used consistently in Task 4; `ResolutionContext(alias_index, name_overrides, given_names=…)` matches the new `__init__`; `self.unresolved` is `list[tuple]` of `(raw, category, source_row)` and read with that shape in both the report and the stats. `str(category)` yields the StrEnum value (e.g. `"ambiguous_pair"`), matching the stat comparisons and the test assertions.
|
||||||
|
|
||||||
|
**Cross-task green:** Task 4 deliberately bundles the `persons.py` + `normalize.py` + test changes into one commit because removing `ctx.ambiguous` breaks its consumer otherwise — no red commit is left behind (lesson from the prior build).
|
||||||
|
|
||||||
|
**Out of scope (future):** Spanish month names + `Mon DD-YYYY` date form (separate date-parser enhancement); promoting `unresolved` rows into a document-level `needs_review` flag; auto-splitting confirmed `ambiguous_pair` entries via overrides.
|
||||||
@@ -4,6 +4,27 @@ Running log of each working session. **Resume here.** Newest entry on top.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-05-25 (session 5) — Unresolved-name classification
|
||||||
|
|
||||||
|
**Did:** Implemented [`04-unresolved-names-plan.md`](./04-unresolved-names-plan.md) subagent-driven
|
||||||
|
(5 tasks, TDD, per-task spec + code-quality review; 67 tests pass). Added `classify_name` +
|
||||||
|
`NameClass` + `build_given_names` in `persons.py`; `ResolutionContext` now records non-RESOLVABLE
|
||||||
|
names in `self.unresolved`; orchestrator writes `review/unresolved-names.csv` (replaces the noisy
|
||||||
|
`ambiguous-receivers.csv`) with per-category stats.
|
||||||
|
|
||||||
|
**Why:** `unmatched-names.csv` mixes boring non-family correspondents (expected) with genuinely
|
||||||
|
unresolvable entries. The new report isolates the latter so review focuses on ~440 real cases.
|
||||||
|
|
||||||
|
**Real-run result:** unresolved-names.csv = single_token 191 / prose 103 / unknown 74 /
|
||||||
|
collective 46 / relational 21 / ambiguous_pair **5** (distinct). The ambiguous over-flagging fix
|
||||||
|
cut `ambiguous_pair` from 303 → 5 (genuine two-given-name pairs only; `Mieze Schefold` etc. now
|
||||||
|
correctly RESOLVABLE). given-name set = register first names ∪ `config.EXTRA_GIVEN_NAMES`.
|
||||||
|
|
||||||
|
**Next:** populate `overrides/names.csv` from unresolved-names.csv (highest-count first); extend
|
||||||
|
`EXTRA_GIVEN_NAMES` if a real pair isn't flagged; still-open date work (Spanish months, 58–72 band).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)
|
## 2026-05-25 (session 4) — Built the normalizer (subagent-driven, all 17 tasks)
|
||||||
|
|
||||||
**Did:** Executed the plan subagent-driven (implementer + spec review + code-quality review per
|
**Did:** Executed the plan subagent-driven (implementer + spec review + code-quality review per
|
||||||
|
|||||||
Reference in New Issue
Block a user