Files
familienarchiv/tools/import-normalizer/overrides/README.md
Marcel 0f1f9055c3
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m27s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m40s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
docs(normalizer): add overrides/ README with structure + examples
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:53:03 +02:00

82 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Overrides
Human corrections applied **deterministically on every run**. An override **wins** over the
automatic date parser / name matcher, so this is how you fix the residue the tool can't resolve
on its own. Two CSV files live here; both are read by `overrides.load_overrides()`.
- Missing or header-only files are fine — they just contribute zero overrides.
- Keep these files committed to git (they're your curated corrections); the generated `out/`
and `review/` folders are *not* committed.
- Matching is **exact** on the `raw` value after trimming surrounding whitespace. Copy the
`raw` value verbatim from the matching `review/*.csv`.
## The iteration loop
1. Run `python normalize.py`.
2. Open `review/unparsed-dates.csv` and `review/unresolved-names.csv` (sorted by frequency).
3. Add correction rows here, then re-run. Repeat until the residue is acceptable.
---
## `dates.csv` — fix unparseable dates
Header: `raw,iso,precision`
| column | meaning |
| --- | --- |
| `raw` | the date string exactly as written in the spreadsheet (= the `raw` column in `review/unparsed-dates.csv`). |
| `iso` | the corrected date as `YYYY-MM-DD`. For partial dates use the 1st: month-only → `YYYY-MM-01`, year-only → `YYYY-01-01`. Leave **empty** if truly unknown. |
| `precision` | one of `DAY`, `MONTH`, `SEASON`, `YEAR`, `RANGE`, `APPROX`, `UNKNOWN`. |
### Example
```csv
raw,iso,precision
23.Juni 58,1958-06-23,DAY
8.März 60,1960-03-08,DAY
Mayo 18-1929,1929-05-18,DAY
Abril 10-929,1929-04-10,DAY
30.April,1909-04-30,DAY
Mai 1895,1895-05-01,MONTH
Herbst 1913,1913-10-01,SEASON
1945/46,1945-01-01,RANGE
um 1920,1920-01-01,APPROX
?,,UNKNOWN
```
Notes:
- `23.Juni 58` / `8.März 60` — two-digit years `58`/`60` fall in the parser's ambiguous
`5872` band (just past the 18731957 window), so they aren't auto-parsed; here you assert 1958/1960.
- `Mayo`/`Abril` — Spanish month names (Mexican-branch letters) the parser doesn't know yet.
- `30.April` — month+day with no year; pick the year from the letter's context.
- Empty `iso` + `UNKNOWN` records a deliberate "unknown date" (stops it showing up as residue).
---
## `names.csv` — map a name string to a canonical person
Header: `raw,person_id`
| column | meaning |
| --- | --- |
| `raw` | the sender/receiver name string exactly as written (= the `raw` column in `review/unresolved-names.csv`). For a multi-name cell that was split (e.g. `"Walter und Eugenie"`), use the **individual** name part. |
| `person_id` | the canonical id to map it to. **Must be a real id** from the `person_id` column of `out/canonical-persons.xlsx` (a register person or an already-created provisional). |
### Example
```csv
raw,person_id
A.Klucke,klucke-anna
? Hans de Gruyter,de-gruyter-hans
Eltern Cram,cram-john-james
Tante Lolly,blomquist-charlotte
```
Notes:
- Use this for partial / misspelled / illegible / aliased names that should point at a known person.
- It maps one string → **one** person. It does **not** split a two-person cell: for genuine
pairs like `Ella Anita` (flagged `ambiguous_pair`), there is no split-via-override yet — leave
them, or add both given names to `config.EXTRA_GIVEN_NAMES` so they keep getting flagged.
- Look up valid `person_id` values in `out/canonical-persons.xlsx`. An id that doesn't exist
there will create a dangling reference (no validation yet).