Import normalizer: offline tool to normalize the raw archive spreadsheets #663

Merged
marcel merged 172 commits from docs/import-migration into main 2026-05-28 15:05:51 +02:00
Showing only changes of commit 6103d5d229 - Show all commits

View File

@@ -283,10 +283,10 @@ The Excel serial conversion is new logic added directly in `persons_tree.py` (3
--- ---
## 12. Open Questions ## 12. Resolved Decisions
| OQ | Question | Blocks | | OQ | Question | Decision |
|----|----------|--------| |----|----------|----------|
| OQ-01 | Some persons appear twice with slightly different data (rows 127/138 — Christa Schütz/Siebert; rows 129/139 — Christoph Seils). Deduplicate in the tool or leave as duplicates for the backend to handle? | §8 persons array | | OQ-01 | Duplicate rows (127/138 — Christa Schütz; 129/139 — Christoph Seils). | **Tool deduplicates.** On pass 1, after building the person list, detect rows with identical `(firstName, lastName, birthYear)` and keep only the first occurrence. Log skipped row ids to stdout. |
| OQ-02 | `birthPlace` / `deathPlace` are in the source but absent from the `Person` entity. Should they go into `notes`, or should the backend importer add new columns? | §8 persons array, future backend importer | | OQ-02 | `birthPlace` / `deathPlace` absent from `Person` entity. | **Keep as separate top-level fields** in the JSON (`birthPlace`, `deathPlace`). The future backend importer may add columns to the `persons` table; the field is preserved here to avoid data loss. |
| OQ-03 | The `firstName` for `"Charlotte,Meta,Jacobi"` (row 7 / row 120) is a comma-separated multi-name. Store verbatim or split into `firstName` + `alias`? | §5 name normalization | | OQ-03 | `firstName` = `"Charlotte,Meta,Jacobi"` (multi-name comma string). | **Store verbatim as `firstName`.** No splitting. |