feat(normalizer): generate structured tags from Schlagwort + Inhalt fields

Adds tags.py module implementing a three-outcome heuristic:
- Individual-to-individual correspondence tags ("Clara an Herbert") → dropped
- Group/collective correspondence ("Clara an Kinder", "Walter an Geschwister") → Briefwechsel/<value>
- Semantic/event tags ("Brautbriefe", "Alltag", "zur Hochzeit") → Themen/<value>

Three correspondence patterns detected: space-an-space, starts-with-"an ",
and abbreviated-sender form ("Maria W.an Clara").

COLLECTIVE_TERMS in config.py extended with 17 plural/group relational terms
(söhne, brüder, schwiegereltern, cousinen, etc.) confirmed against the full Excel.

Also adds two-phase summary mining: every run emits review/tag-candidates.csv;
subsequent runs apply keywords from overrides/approved-themes.csv as Themen tags.

Outputs: canonical-documents.xlsx gets pipe-separated "Parent/Child" tag paths;
canonical-tag-tree.xlsx provides the full tag hierarchy for backend pre-import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-25 19:47:36 +02:00
parent 5efe3b8a7c
commit 94a40237f4
9 changed files with 405 additions and 6 deletions

View File

@@ -47,6 +47,19 @@ def write_documents_xlsx(docs, path: Path):
_write_xlsx(docs, DOC_COLUMNS, path)
def write_tag_tree_xlsx(tree: list[dict], path: Path):
columns = ["tag_path", "parent_name", "tag_name"]
wb = openpyxl.Workbook()
ws = wb.active
ws.append(columns)
for row in tree:
ws.append([row.get(col, "") for col in columns])
wb.properties.created = _FIXED_TS
wb.properties.modified = _FIXED_TS
Path(path).parent.mkdir(parents=True, exist_ok=True)
wb.save(path)
def write_persons_xlsx(people, path: Path):
_write_xlsx(people, PERSON_COLUMNS, path)