All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m27s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m40s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 21s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.3 KiB
3.3 KiB
Overrides
Human corrections applied deterministically on every run. An override wins over the
automatic date parser / name matcher, so this is how you fix the residue the tool can't resolve
on its own. Two CSV files live here; both are read by overrides.load_overrides().
- Missing or header-only files are fine — they just contribute zero overrides.
- Keep these files committed to git (they're your curated corrections); the generated
out/andreview/folders are not committed. - Matching is exact on the
rawvalue after trimming surrounding whitespace. Copy therawvalue verbatim from the matchingreview/*.csv.
The iteration loop
- Run
python normalize.py. - Open
review/unparsed-dates.csvandreview/unresolved-names.csv(sorted by frequency). - Add correction rows here, then re-run. Repeat until the residue is acceptable.
dates.csv — fix unparseable dates
Header: raw,iso,precision
| column | meaning |
|---|---|
raw |
the date string exactly as written in the spreadsheet (= the raw column in review/unparsed-dates.csv). |
iso |
the corrected date as YYYY-MM-DD. For partial dates use the 1st: month-only → YYYY-MM-01, year-only → YYYY-01-01. Leave empty if truly unknown. |
precision |
one of DAY, MONTH, SEASON, YEAR, RANGE, APPROX, UNKNOWN. |
Example
raw,iso,precision
23.Juni 58,1958-06-23,DAY
8.März 60,1960-03-08,DAY
Mayo 18-1929,1929-05-18,DAY
Abril 10-929,1929-04-10,DAY
30.April,1909-04-30,DAY
Mai 1895,1895-05-01,MONTH
Herbst 1913,1913-10-01,SEASON
1945/46,1945-01-01,RANGE
um 1920,1920-01-01,APPROX
?,,UNKNOWN
Notes:
23.Juni 58/8.März 60— two-digit years58/60fall in the parser's ambiguous58–72band (just past the 1873–1957 window), so they aren't auto-parsed; here you assert 1958/1960.Mayo/Abril— Spanish month names (Mexican-branch letters) the parser doesn't know yet.30.April— month+day with no year; pick the year from the letter's context.- Empty
iso+UNKNOWNrecords a deliberate "unknown date" (stops it showing up as residue).
names.csv — map a name string to a canonical person
Header: raw,person_id
| column | meaning |
|---|---|
raw |
the sender/receiver name string exactly as written (= the raw column in review/unresolved-names.csv). For a multi-name cell that was split (e.g. "Walter und Eugenie"), use the individual name part. |
person_id |
the canonical id to map it to. Must be a real id from the person_id column of out/canonical-persons.xlsx (a register person or an already-created provisional). |
Example
raw,person_id
A.Klucke,klucke-anna
? Hans de Gruyter,de-gruyter-hans
Eltern Cram,cram-john-james
Tante Lolly,blomquist-charlotte
Notes:
- Use this for partial / misspelled / illegible / aliased names that should point at a known person.
- It maps one string → one person. It does not split a two-person cell: for genuine
pairs like
Ella Anita(flaggedambiguous_pair), there is no split-via-override yet — leave them, or add both given names toconfig.EXTRA_GIVEN_NAMESso they keep getting flagged. - Look up valid
person_idvalues inout/canonical-persons.xlsx. An id that doesn't exist there will create a dangling reference (no validation yet).