Files

Marcel 94a40237f4 feat(normalizer): generate structured tags from Schlagwort + Inhalt fields

Adds tags.py module implementing a three-outcome heuristic:
- Individual-to-individual correspondence tags ("Clara an Herbert") → dropped
- Group/collective correspondence ("Clara an Kinder", "Walter an Geschwister") → Briefwechsel/<value>
- Semantic/event tags ("Brautbriefe", "Alltag", "zur Hochzeit") → Themen/<value>

Three correspondence patterns detected: space-an-space, starts-with-"an ",
and abbreviated-sender form ("Maria W.an Clara").

COLLECTIVE_TERMS in config.py extended with 17 plural/group relational terms
(söhne, brüder, schwiegereltern, cousinen, etc.) confirmed against the full Excel.

Also adds two-phase summary mining: every run emits review/tag-candidates.csv;
subsequent runs apply keywords from overrides/approved-themes.csv as Themen tags.

Outputs: canonical-documents.xlsx gets pipe-separated "Parent/Child" tag paths;
canonical-tag-tree.xlsx provides the full tag hierarchy for backend pre-import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 19:47:36 +02:00

approved-themes.csv

feat(normalizer): generate structured tags from Schlagwort + Inhalt fields

2026-05-25 19:47:36 +02:00

dates.csv

docs(normalizer): README + seed overrides

2026-05-25 14:51:20 +02:00

names.csv

docs(normalizer): README + seed overrides

2026-05-25 14:51:20 +02:00

README.md

docs(normalizer): add overrides/ README with structure + examples

2026-05-25 16:53:03 +02:00

README.md

Overrides

Human corrections applied deterministically on every run. An override wins over the automatic date parser / name matcher, so this is how you fix the residue the tool can't resolve on its own. Two CSV files live here; both are read by overrides.load_overrides().

Missing or header-only files are fine — they just contribute zero overrides.
Keep these files committed to git (they're your curated corrections); the generated out/ and review/ folders are not committed.
Matching is exact on the raw value after trimming surrounding whitespace. Copy the raw value verbatim from the matching review/*.csv.

The iteration loop

Run python normalize.py.
Open review/unparsed-dates.csv and review/unresolved-names.csv (sorted by frequency).
Add correction rows here, then re-run. Repeat until the residue is acceptable.

`dates.csv` — fix unparseable dates

Header: raw,iso,precision

column	meaning
`raw`	the date string exactly as written in the spreadsheet (= the `raw` column in `review/unparsed-dates.csv`).
`iso`	the corrected date as `YYYY-MM-DD`. For partial dates use the 1st: month-only → `YYYY-MM-01`, year-only → `YYYY-01-01`. Leave empty if truly unknown.
`precision`	one of `DAY`, `MONTH`, `SEASON`, `YEAR`, `RANGE`, `APPROX`, `UNKNOWN`.

Example

raw,iso,precision
23.Juni 58,1958-06-23,DAY
8.März 60,1960-03-08,DAY
Mayo 18-1929,1929-05-18,DAY
Abril 10-929,1929-04-10,DAY
30.April,1909-04-30,DAY
Mai 1895,1895-05-01,MONTH
Herbst 1913,1913-10-01,SEASON
1945/46,1945-01-01,RANGE
um 1920,1920-01-01,APPROX
?,,UNKNOWN

Notes:

23.Juni 58 / 8.März 60 — two-digit years 58/60 fall in the parser's ambiguous 58–72 band (just past the 1873–1957 window), so they aren't auto-parsed; here you assert 1958/1960.
Mayo/Abril — Spanish month names (Mexican-branch letters) the parser doesn't know yet.
30.April — month+day with no year; pick the year from the letter's context.
Empty iso + UNKNOWN records a deliberate "unknown date" (stops it showing up as residue).

`names.csv` — map a name string to a canonical person

Header: raw,person_id

column	meaning
`raw`	the sender/receiver name string exactly as written (= the `raw` column in `review/unresolved-names.csv`). For a multi-name cell that was split (e.g. `"Walter und Eugenie"`), use the individual name part.
`person_id`	the canonical id to map it to. Must be a real id from the `person_id` column of `out/canonical-persons.xlsx` (a register person or an already-created provisional).

Example

raw,person_id
A.Klucke,klucke-anna
? Hans de Gruyter,de-gruyter-hans
Eltern Cram,cram-john-james
Tante Lolly,blomquist-charlotte

Notes:

Use this for partial / misspelled / illegible / aliased names that should point at a known person.
It maps one string → one person. It does not split a two-person cell: for genuine pairs like Ella Anita (flagged ambiguous_pair), there is no split-via-override yet — leave them, or add both given names to config.EXTRA_GIVEN_NAMES so they keep getting flagged.
Look up valid person_id values in out/canonical-persons.xlsx. An id that doesn't exist there will create a dangling reference (no validation yet).

README.md Unescape Escape

Overrides

The iteration loop

dates.csv — fix unparseable dates

Example

names.csv — map a name string to a canonical person

Example

README.md

`dates.csv` — fix unparseable dates

`names.csv` — map a name string to a canonical person