docs(normalizer): add overrides/ README with structure + examples
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m27s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m40s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m27s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m40s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
81
tools/import-normalizer/overrides/README.md
Normal file
81
tools/import-normalizer/overrides/README.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Overrides
|
||||
|
||||
Human corrections applied **deterministically on every run**. An override **wins** over the
|
||||
automatic date parser / name matcher, so this is how you fix the residue the tool can't resolve
|
||||
on its own. Two CSV files live here; both are read by `overrides.load_overrides()`.
|
||||
|
||||
- Missing or header-only files are fine — they just contribute zero overrides.
|
||||
- Keep these files committed to git (they're your curated corrections); the generated `out/`
|
||||
and `review/` folders are *not* committed.
|
||||
- Matching is **exact** on the `raw` value after trimming surrounding whitespace. Copy the
|
||||
`raw` value verbatim from the matching `review/*.csv`.
|
||||
|
||||
## The iteration loop
|
||||
|
||||
1. Run `python normalize.py`.
|
||||
2. Open `review/unparsed-dates.csv` and `review/unresolved-names.csv` (sorted by frequency).
|
||||
3. Add correction rows here, then re-run. Repeat until the residue is acceptable.
|
||||
|
||||
---
|
||||
|
||||
## `dates.csv` — fix unparseable dates
|
||||
|
||||
Header: `raw,iso,precision`
|
||||
|
||||
| column | meaning |
|
||||
| --- | --- |
|
||||
| `raw` | the date string exactly as written in the spreadsheet (= the `raw` column in `review/unparsed-dates.csv`). |
|
||||
| `iso` | the corrected date as `YYYY-MM-DD`. For partial dates use the 1st: month-only → `YYYY-MM-01`, year-only → `YYYY-01-01`. Leave **empty** if truly unknown. |
|
||||
| `precision` | one of `DAY`, `MONTH`, `SEASON`, `YEAR`, `RANGE`, `APPROX`, `UNKNOWN`. |
|
||||
|
||||
### Example
|
||||
|
||||
```csv
|
||||
raw,iso,precision
|
||||
23.Juni 58,1958-06-23,DAY
|
||||
8.März 60,1960-03-08,DAY
|
||||
Mayo 18-1929,1929-05-18,DAY
|
||||
Abril 10-929,1929-04-10,DAY
|
||||
30.April,1909-04-30,DAY
|
||||
Mai 1895,1895-05-01,MONTH
|
||||
Herbst 1913,1913-10-01,SEASON
|
||||
1945/46,1945-01-01,RANGE
|
||||
um 1920,1920-01-01,APPROX
|
||||
?,,UNKNOWN
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `23.Juni 58` / `8.März 60` — two-digit years `58`/`60` fall in the parser's ambiguous
|
||||
`58–72` band (just past the 1873–1957 window), so they aren't auto-parsed; here you assert 1958/1960.
|
||||
- `Mayo`/`Abril` — Spanish month names (Mexican-branch letters) the parser doesn't know yet.
|
||||
- `30.April` — month+day with no year; pick the year from the letter's context.
|
||||
- Empty `iso` + `UNKNOWN` records a deliberate "unknown date" (stops it showing up as residue).
|
||||
|
||||
---
|
||||
|
||||
## `names.csv` — map a name string to a canonical person
|
||||
|
||||
Header: `raw,person_id`
|
||||
|
||||
| column | meaning |
|
||||
| --- | --- |
|
||||
| `raw` | the sender/receiver name string exactly as written (= the `raw` column in `review/unresolved-names.csv`). For a multi-name cell that was split (e.g. `"Walter und Eugenie"`), use the **individual** name part. |
|
||||
| `person_id` | the canonical id to map it to. **Must be a real id** from the `person_id` column of `out/canonical-persons.xlsx` (a register person or an already-created provisional). |
|
||||
|
||||
### Example
|
||||
|
||||
```csv
|
||||
raw,person_id
|
||||
A.Klucke,klucke-anna
|
||||
? Hans de Gruyter,de-gruyter-hans
|
||||
Eltern Cram,cram-john-james
|
||||
Tante Lolly,blomquist-charlotte
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Use this for partial / misspelled / illegible / aliased names that should point at a known person.
|
||||
- It maps one string → **one** person. It does **not** split a two-person cell: for genuine
|
||||
pairs like `Ella Anita` (flagged `ambiguous_pair`), there is no split-via-override yet — leave
|
||||
them, or add both given names to `config.EXTRA_GIVEN_NAMES` so they keep getting flagged.
|
||||
- Look up valid `person_id` values in `out/canonical-persons.xlsx`. An id that doesn't exist
|
||||
there will create a dangling reference (no validation yet).
|
||||
Reference in New Issue
Block a user