Files
familienarchiv/docs/import-migration
Marcel 46d1f5c6d8 chore(import): stop tracking real family PII canonical artifacts
The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.

The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 10:20:38 +02:00
..

Import Migration — Working Folder

This folder tracks the iterative work of mass-importing the real, raw family archive spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.

It is intentionally local docs, not Gitea issues. We only open a Gitea issue when a finding requires a software change (e.g. a new date parser). Pure data observations and the running plan live here so any agent can pick the work up cold.

Source files (in /import)

File What it is Importer support today
zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx The real raw archive — 7,943 rows, sheet Familienarchiv. Human-readable, dates as written in the letters. layout does not match importer defaults
Personendatei 2.xlsx Genealogical person register — 163 people, sheet Tabelle1 (maiden names, birth/death, marriages, relationships). no importer at all
zzfamilienarchiv Walter und Eugenie 2025-04-10.ods A small, already-normalized subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. this is what MassImportService was built for

The PDFs (~7,000) will follow later. The importer matches files by the Index column (e.g. W-0001W-0001.pdf), and already imports metadata-only when a file is missing — so we can import all metadata now and the PDFs will attach on a re-run.

How to inspect the spreadsheets

openpyxl is installed in the OCR service venv:

/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"

Documents in this folder

  • 01-findings-spreadsheet-analysis.md — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID IMP-NN.
  • 02-normalization-spec.md — requirements spec for the offline import normalizer (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements FR-*/NFR-*, traceable to the IMP-NN findings.
  • WORKLOG.md — running log of what each session did and what's next. Start here when resuming.

Strategy (decided 2026-05-25)

Normalize before import. A standalone Python tool (tools/import-normalizer/, not yet built) transforms the raw xlsx + person register into a clean canonical dataset (canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs. Residual cases (unparseable dates, unmatched names) are fixed via a version-controlled overrides file and re-run. The Java importer is adjusted to consume the canonical contract in a later Phase 2. See the spec for the full contract.

The canonical artifacts themselves (the out/ files) are produced locally and not version-controlled — they contain real family PII. They are synced onto the ops host's IMPORT_HOST_DIR alongside the PDFs, out-of-band. The contract is the header schema in 02-normalization-spec.md §6, not any particular file in out/. See ADR-025 for the full rationale.

Status board

ID Issue Severity Status
IMP-01 New xlsx column layout ≠ importer defaults 🔴 blocker open
IMP-02 90% of dates are free-text the parser can't read 🔴 blocker open
IMP-03 No ISO/normalized date column in the new xlsx 🔴 blocker open
IMP-04 Person register (Personendatei 2.xlsx) not imported 🟠 major open
IMP-05 Name variations = duplicate Persons (maiden vs married) 🟠 major open
IMP-06 93 data rows with blank Index are silently dropped 🟠 major open
IMP-07 43 duplicate Index values 🟡 minor open
IMP-08 Section/title rows interleaved in data 🟡 minor open
IMP-09 Index↔Datei filename mismatches 🟡 minor open
IMP-10 x-suffix rows (letter backsides/enclosures) 🟡 minor open
IMP-11 Multi-receiver separators incl. bare u/u. 🟡 minor open
IMP-12 Importer reads only the first sheet, no validation 🟡 minor open

See the findings doc for detail and proposed approach per issue.