diff --git a/docs/import-migration/02-normalization-spec.md b/docs/import-migration/02-normalization-spec.md index b2829d23..b301c42c 100644 --- a/docs/import-migration/02-normalization-spec.md +++ b/docs/import-migration/02-normalization-spec.md @@ -176,6 +176,14 @@ letter actually said.* Silvester=12-31, …). Seasons map to representative months: Frühling/Frühjahr=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan. The feast/season tables and Easter algorithm live in `config.py` (NFR-MAINT-01). +- **REQ-DATE-07** — **Intra-month day ranges carry an end day; half-resolved ranges are + flagged.** For a day range like `7./8. Sept.1923`, `date_iso` holds the start day, the end + day is resolved against the shared month/year into `date_end`, and `date_precision` = + `RANGE`. If the **start** parses but the **end day is impossible** (e.g. `10./40.1.1917`), + the row keeps the start and `RANGE` precision, leaves `date_end` **empty**, and is flagged + `needs_review = range_end_unparsed` — the unparseable end is dropped honestly (surfaced for + review), never silently invented or clamped. A `RANGE` row **may** therefore legitimately + have an empty `date_end`; the importer must treat `date_end` as optional even on a `RANGE`. ### 4.4 Person resolution & dedup (`FR-PERS`, `FR-DEDUP`) — resolves IMP-04, IMP-05, IMP-11 @@ -262,6 +270,7 @@ DB schema. | Field | Required | Format / values | Notes | | --- | --- | --- | --- | | `index` | yes | string | Stable key; basis for PDF matching. | +| `file` | no | string | verbatim `Datei` value (e.g. `H-0730.pdf`); carried through for the importer to link the scanned PDF. | | `box` | no | string | from `Box`. | | `folder` | no | string | from `Mappe`. | | `sender_person_id` | no | person_id | resolved; empty if no sender. | @@ -271,11 +280,12 @@ DB schema. | `date_iso` | no | `YYYY-MM-DD` | best-effort; empty if `UNKNOWN`. | | `date_raw` | no | string | verbatim source date. | | `date_precision` | yes | enum | `DAY\|MONTH\|SEASON\|YEAR\|RANGE\|APPROX\|UNKNOWN`. | +| `date_end` | no | `YYYY-MM-DD` or empty | RANGE end day (e.g. `7./8. Sept.1923` → `date_iso` = start, `date_end` = end). Empty for every non-RANGE precision **and** for a half-resolved RANGE whose end did not parse (see REQ-DATE-07). | | `location` | no | string | from `Ort`. | | `tags` | no | `tag\|tag` | from `Schlagwort`. | | `summary` | no | string | from `Inhalt`. | | `source_row` | yes | int | provenance (NFR-DATA-01). | -| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). | +| `needs_review` | yes | `flag\|flag` or empty | review flags (REQ-PROV-02). Flags include `unparsed_date`, `range_end_unparsed` (half-resolved RANGE, REQ-DATE-07), `unmatched_sender`, `unmatched_receiver`, `multi_sender`, `index_file_mismatch`, `duplicate_index`. | ### 6.2 `canonical-persons.xlsx` @@ -295,6 +305,27 @@ DB schema. | `aliases` | no | `a\|b\|c` | every surface form that maps here. | | `provisional` | yes | bool | true if created from a document string, not the register. | +### 6.3 `canonical-persons-tree.json` + +The de-duplicated genealogical tree (family members + their relationships) the importer +uses to seed the family graph. Each `persons[]` entry carries a `personId` that **joins +1:1 onto** `person_id` in `canonical-persons.xlsx`. + +| Field | Required | Format | Notes | +| --- | --- | --- | --- | +| `personId` | yes | slug | The register's **verbatim** `person_id` (e.g. `cram-hans-1`), propagated — never re-slugified — so collision suffixes match `canonical-persons.xlsx` exactly. Every tree `personId` exists in the register; the register is the sole slug authority. | +| `firstName` / `lastName` / `maidenName` | first/last yes | string | name parts. | +| `birthYear` / `deathYear` | no | int or null | year only (tree granularity). | +| `birthPlace` / `deathPlace` | no | string or null | from the register. | +| `generation` | no | int or null | parsed from `G n`. | +| `notes` | no | string or null | leftover Bemerkung text after relationship extraction. | +| `familyMember` | yes | bool | always true for tree persons. | + +A top-level `generated_at` is pinned to a fixed timestamp (`2020-01-01T00:00:00`) for +reproducibility (NFR-IDEM-01), not a wall-clock value. `relationships[]` carry `SPOUSE_OF` +and `PARENT_OF` edges keyed by `rowId`; `unresolved[]` lists relationship strings that did +not match a tree person. + --- ## 7. Prioritized Backlog (MoSCoW) @@ -339,7 +370,7 @@ DB schema. | ID | Question | Why it matters | Ref | Resolution | | --- | --- | --- | --- | --- | | OQ-01 ✅ | Season/holiday → date. | Accuracy of ~70 SEASON/feast rows. | REQ-DATE-06 | **Resolved (2026-05-25):** movable feasts (Ostern, Pfingsten, Himmelfahrt, Advent, …) **computed per year from Easter — never a fixed month**; fixed feasts looked up (Weihnachten=12-25, Neujahr=01-01, …); seasons = mid-season month (Frühling=Apr, Sommer=Jul, Herbst=Oct, Winter=Jan). | -| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02 | **Confirmed:** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`. | +| OQ-02 ✅ | Date ranges: start only, or start+end? | Sorting/display of ~315 range values. | REQ-DATE-02, REQ-DATE-07 | **Confirmed (updated #670):** store **start** in `date_iso`, precision `RANGE`, full text in `date_raw`, **and the resolved end day in `date_end`** for intra-month day ranges. A half-resolved range (start parsed, end impossible) keeps `date_end` empty and is flagged `range_end_unparsed`. | | OQ-03 ✅ | `person_id` format. | Stability across re-runs; diffability. | §6 | **Confirmed:** readable slug `lastname-firstname`, numeric suffix on collision. | | OQ-04 ✅ | `x`-suffix row handling. | 42 rows. | REQ-TRIAGE-03 | **Resolved (2026-05-25):** `x` rows are transcriptions of the base letter but not yet mappable → **skip this pass**, log to `review/skipped-x-suffix.csv` for later linking. | | OQ-05 ✅ | Importer output format. | Phase-2 reader. | B11 | **Confirmed:** `.xlsx` (openpyxl-native, headered). |