feat(normalizer): generate structured tags from Schlagwort + Inhalt fields
Adds tags.py module implementing a three-outcome heuristic:
- Individual-to-individual correspondence tags ("Clara an Herbert") → dropped
- Group/collective correspondence ("Clara an Kinder", "Walter an Geschwister") → Briefwechsel/<value>
- Semantic/event tags ("Brautbriefe", "Alltag", "zur Hochzeit") → Themen/<value>
Three correspondence patterns detected: space-an-space, starts-with-"an ",
and abbreviated-sender form ("Maria W.an Clara").
COLLECTIVE_TERMS in config.py extended with 17 plural/group relational terms
(söhne, brüder, schwiegereltern, cousinen, etc.) confirmed against the full Excel.
Also adds two-phase summary mining: every run emits review/tag-candidates.csv;
subsequent runs apply keywords from overrides/approved-themes.csv as Themen tags.
Outputs: canonical-documents.xlsx gets pipe-separated "Parent/Child" tag paths;
canonical-tag-tree.xlsx provides the full tag hierarchy for backend pre-import.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -116,6 +116,10 @@ RELATIONAL_TERMS = {
|
||||
COLLECTIVE_TERMS = {
|
||||
"familie", "fam", "kinder", "eltern", "geschwister", "großeltern",
|
||||
"grosseltern", "alle", "diverse", "div", "gebrüder", "gebr",
|
||||
# Plural/group relational terms — added for tag generation heuristic
|
||||
"söhne", "töchter", "brüder", "schwestern", "schwiegereltern",
|
||||
"vettern", "kusinen", "cousinen", "nichten", "neffen", "tanten",
|
||||
"freunde", "bekannte", "geschw", "enkelkinder", "jungens", "verwandten",
|
||||
}
|
||||
# Markers of an unknown/illegible name (the literal "?" is handled separately in code).
|
||||
# All long enough to be safe as SUBSTRING matches — do NOT add short tokens like "nn"
|
||||
|
||||
Reference in New Issue
Block a user