feat(normalizer): drop unmatched-names.csv; unresolved-names is the names report
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m32s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m26s
CI / fail2ban Regex (pull_request) Successful in 47s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m32s
CI / OCR Service Tests (pull_request) Successful in 19s
CI / Backend Unit Tests (pull_request) Successful in 3m26s
CI / fail2ban Regex (pull_request) Successful in 47s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m0s
The unmatched list was just non-family correspondents (expected noise); their count stays in summary.txt and they remain in canonical-persons.xlsx. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -25,15 +25,16 @@ Outputs:
|
|||||||
| Review file | What to do |
|
| Review file | What to do |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
|
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
|
||||||
| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). |
|
| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv` (look up valid ids in `out/canonical-persons.xlsx`). |
|
||||||
| `unresolved-names.csv` | Names whose value is itself problematic, grouped by `category`: `unknown` (`?`/illegible), `single_token` (first OR last name only), `relational` (`Tante …`), `collective` (`Familie …`), `prose` (a description landed in a name column), `ambiguous_pair` (two given names → likely two people, not auto-split). Review highest-impact categories first; add decisions to `overrides/names.csv`. |
|
|
||||||
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
|
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
|
||||||
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |
|
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |
|
||||||
|
|
||||||
> `unresolved-names.csv` is the focused "names that need a human" list — distinct from
|
> `unresolved-names.csv` is the focused "names that need a human" list. Non-family
|
||||||
> `unmatched-names.csv` (which is just non-family correspondents that got provisional persons).
|
> correspondents that simply aren't in the register are NOT reported — they just become
|
||||||
> The given-name set that drives `ambiguous_pair` detection is the register's first names plus
|
> provisional persons in `out/canonical-persons.xlsx` (the `unmatched_name_strings` count in
|
||||||
> `config.EXTRA_GIVEN_NAMES` — add names there if a real two-person cell isn't being flagged.
|
> `summary.txt` tracks how many). The given-name set that drives `ambiguous_pair` detection is
|
||||||
|
> the register's first names plus `config.EXTRA_GIVEN_NAMES` — add names there if a real
|
||||||
|
> two-person cell isn't being flagged.
|
||||||
|
|
||||||
**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.
|
**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.
|
||||||
|
|
||||||
|
|||||||
@@ -83,14 +83,6 @@ def run(*, document_workbook, document_sheet, person_workbook, person_sheet,
|
|||||||
writers.write_review_csv(review_dir / "unparsed-dates.csv",
|
writers.write_review_csv(review_dir / "unparsed-dates.csv",
|
||||||
["raw", "count", "example_rows", "suggested_iso", "suggested_precision"], unparsed_rows)
|
["raw", "count", "example_rows", "suggested_iso", "suggested_precision"], unparsed_rows)
|
||||||
|
|
||||||
unmatched_rows = []
|
|
||||||
for name, rows in sorted(ctx.unmatched.items()):
|
|
||||||
sid, score = alias_index.suggest(name)
|
|
||||||
unmatched_rows.append([name, len(rows), " ".join(map(str, rows[:5])),
|
|
||||||
sid or "", f"{score:.2f}" if sid else ""])
|
|
||||||
writers.write_review_csv(review_dir / "unmatched-names.csv",
|
|
||||||
["raw", "count", "example_rows", "suggested_id", "suggested_score"], unmatched_rows)
|
|
||||||
|
|
||||||
writers.write_review_csv(review_dir / "duplicate-index.csv", ["source_row", "index"], duplicates)
|
writers.write_review_csv(review_dir / "duplicate-index.csv", ["source_row", "index"], duplicates)
|
||||||
writers.write_review_csv(review_dir / "blank-index-rows.csv", ["source_row", "kind", "content"], blank_index)
|
writers.write_review_csv(review_dir / "blank-index-rows.csv", ["source_row", "kind", "content"], blank_index)
|
||||||
writers.write_review_csv(review_dir / "skipped-x-suffix.csv", ["source_row", "index", "base_index"], skipped_x)
|
writers.write_review_csv(review_dir / "skipped-x-suffix.csv", ["source_row", "index", "base_index"], skipped_x)
|
||||||
|
|||||||
Reference in New Issue
Block a user