feat(normalizer): generate structured tags from Schlagwort + Inhalt fields
Adds tags.py module implementing a three-outcome heuristic:
- Individual-to-individual correspondence tags ("Clara an Herbert") → dropped
- Group/collective correspondence ("Clara an Kinder", "Walter an Geschwister") → Briefwechsel/<value>
- Semantic/event tags ("Brautbriefe", "Alltag", "zur Hochzeit") → Themen/<value>
Three correspondence patterns detected: space-an-space, starts-with-"an ",
and abbreviated-sender form ("Maria W.an Clara").
COLLECTIVE_TERMS in config.py extended with 17 plural/group relational terms
(söhne, brüder, schwiegereltern, cousinen, etc.) confirmed against the full Excel.
Also adds two-phase summary mining: every run emits review/tag-candidates.csv;
subsequent runs apply keywords from overrides/approved-themes.csv as Themen tags.
Outputs: canonical-documents.xlsx gets pipe-separated "Parent/Child" tag paths;
canonical-tag-tree.xlsx provides the full tag hierarchy for backend pre-import.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -47,6 +47,19 @@ def write_documents_xlsx(docs, path: Path):
|
||||
_write_xlsx(docs, DOC_COLUMNS, path)
|
||||
|
||||
|
||||
def write_tag_tree_xlsx(tree: list[dict], path: Path):
|
||||
columns = ["tag_path", "parent_name", "tag_name"]
|
||||
wb = openpyxl.Workbook()
|
||||
ws = wb.active
|
||||
ws.append(columns)
|
||||
for row in tree:
|
||||
ws.append([row.get(col, "") for col in columns])
|
||||
wb.properties.created = _FIXED_TS
|
||||
wb.properties.modified = _FIXED_TS
|
||||
Path(path).parent.mkdir(parents=True, exist_ok=True)
|
||||
wb.save(path)
|
||||
|
||||
|
||||
def write_persons_xlsx(people, path: Path):
|
||||
_write_xlsx(people, PERSON_COLUMNS, path)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user