From 0f1f9055c391479567d2522460e9161c30fd536a Mon Sep 17 00:00:00 2001 From: Marcel Date: Mon, 25 May 2026 16:53:03 +0200 Subject: [PATCH] docs(normalizer): add overrides/ README with structure + examples Co-Authored-By: Claude Opus 4.7 --- tools/import-normalizer/overrides/README.md | 81 +++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 tools/import-normalizer/overrides/README.md diff --git a/tools/import-normalizer/overrides/README.md b/tools/import-normalizer/overrides/README.md new file mode 100644 index 00000000..f5ee0a9b --- /dev/null +++ b/tools/import-normalizer/overrides/README.md @@ -0,0 +1,81 @@ +# Overrides + +Human corrections applied **deterministically on every run**. An override **wins** over the +automatic date parser / name matcher, so this is how you fix the residue the tool can't resolve +on its own. Two CSV files live here; both are read by `overrides.load_overrides()`. + +- Missing or header-only files are fine — they just contribute zero overrides. +- Keep these files committed to git (they're your curated corrections); the generated `out/` + and `review/` folders are *not* committed. +- Matching is **exact** on the `raw` value after trimming surrounding whitespace. Copy the + `raw` value verbatim from the matching `review/*.csv`. + +## The iteration loop + +1. Run `python normalize.py`. +2. Open `review/unparsed-dates.csv` and `review/unresolved-names.csv` (sorted by frequency). +3. Add correction rows here, then re-run. Repeat until the residue is acceptable. + +--- + +## `dates.csv` — fix unparseable dates + +Header: `raw,iso,precision` + +| column | meaning | +| --- | --- | +| `raw` | the date string exactly as written in the spreadsheet (= the `raw` column in `review/unparsed-dates.csv`). | +| `iso` | the corrected date as `YYYY-MM-DD`. For partial dates use the 1st: month-only → `YYYY-MM-01`, year-only → `YYYY-01-01`. Leave **empty** if truly unknown. | +| `precision` | one of `DAY`, `MONTH`, `SEASON`, `YEAR`, `RANGE`, `APPROX`, `UNKNOWN`. | + +### Example + +```csv +raw,iso,precision +23.Juni 58,1958-06-23,DAY +8.März 60,1960-03-08,DAY +Mayo 18-1929,1929-05-18,DAY +Abril 10-929,1929-04-10,DAY +30.April,1909-04-30,DAY +Mai 1895,1895-05-01,MONTH +Herbst 1913,1913-10-01,SEASON +1945/46,1945-01-01,RANGE +um 1920,1920-01-01,APPROX +?,,UNKNOWN +``` + +Notes: +- `23.Juni 58` / `8.März 60` — two-digit years `58`/`60` fall in the parser's ambiguous + `58–72` band (just past the 1873–1957 window), so they aren't auto-parsed; here you assert 1958/1960. +- `Mayo`/`Abril` — Spanish month names (Mexican-branch letters) the parser doesn't know yet. +- `30.April` — month+day with no year; pick the year from the letter's context. +- Empty `iso` + `UNKNOWN` records a deliberate "unknown date" (stops it showing up as residue). + +--- + +## `names.csv` — map a name string to a canonical person + +Header: `raw,person_id` + +| column | meaning | +| --- | --- | +| `raw` | the sender/receiver name string exactly as written (= the `raw` column in `review/unresolved-names.csv`). For a multi-name cell that was split (e.g. `"Walter und Eugenie"`), use the **individual** name part. | +| `person_id` | the canonical id to map it to. **Must be a real id** from the `person_id` column of `out/canonical-persons.xlsx` (a register person or an already-created provisional). | + +### Example + +```csv +raw,person_id +A.Klucke,klucke-anna +? Hans de Gruyter,de-gruyter-hans +Eltern Cram,cram-john-james +Tante Lolly,blomquist-charlotte +``` + +Notes: +- Use this for partial / misspelled / illegible / aliased names that should point at a known person. +- It maps one string → **one** person. It does **not** split a two-person cell: for genuine + pairs like `Ella Anita` (flagged `ambiguous_pair`), there is no split-via-override yet — leave + them, or add both given names to `config.EXTRA_GIVEN_NAMES` so they keep getting flagged. +- Look up valid `person_id` values in `out/canonical-persons.xlsx`. An id that doesn't exist + there will create a dangling reference (no validation yet).