Files

Marcel 46d1f5c6d8 chore(import): stop tracking real family PII canonical artifacts

The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.

The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 10:20:38 +02:00

4.1 KiB

Raw Blame History

Import Migration — Working Folder

This folder tracks the iterative work of mass-importing the real, raw family archive spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.

It is intentionally local docs, not Gitea issues. We only open a Gitea issue when a finding requires a software change (e.g. a new date parser). Pure data observations and the running plan live here so any agent can pick the work up cold.

Source files (in `/import`)

File	What it is	Importer support today
`zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx`	The real raw archive — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters.	❌ layout does not match importer defaults
`Personendatei 2.xlsx`	Genealogical person register — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships).	❌ no importer at all
`zzfamilienarchiv Walter und Eugenie 2025-04-10.ods`	A small, already-normalized subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates.	✅ this is what `MassImportService` was built for

The PDFs (~7,000) will follow later. The importer matches files by the Index column (e.g. W-0001 → W-0001.pdf), and already imports metadata-only when a file is missing — so we can import all metadata now and the PDFs will attach on a re-run.

How to inspect the spreadsheets

openpyxl is installed in the OCR service venv:

/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"

Documents in this folder

01-findings-spreadsheet-analysis.md — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID IMP-NN.
02-normalization-spec.md — requirements spec for the offline import normalizer (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements FR-*/NFR-*, traceable to the IMP-NN findings.
WORKLOG.md — running log of what each session did and what's next. Start here when resuming.

Strategy (decided 2026-05-25)

Normalize before import. A standalone Python tool (tools/import-normalizer/, not yet built) transforms the raw xlsx + person register into a clean canonical dataset (canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs. Residual cases (unparseable dates, unmatched names) are fixed via a version-controlled overrides file and re-run. The Java importer is adjusted to consume the canonical contract in a later Phase 2. See the spec for the full contract.

The canonical artifacts themselves (the out/ files) are produced locally and not version-controlled — they contain real family PII. They are synced onto the ops host's IMPORT_HOST_DIR alongside the PDFs, out-of-band. The contract is the header schema in 02-normalization-spec.md §6, not any particular file in out/. See ADR-025 for the full rationale.

Status board

ID	Issue	Severity	Status
IMP-01	New xlsx column layout ≠ importer defaults	🔴 blocker	open
IMP-02	90% of dates are free-text the parser can't read	🔴 blocker	open
IMP-03	No ISO/normalized date column in the new xlsx	🔴 blocker	open
IMP-04	Person register (`Personendatei 2.xlsx`) not imported	🟠 major	open
IMP-05	Name variations = duplicate Persons (maiden vs married)	🟠 major	open
IMP-06	93 data rows with blank Index are silently dropped	🟠 major	open
IMP-07	43 duplicate Index values	🟡 minor	open
IMP-08	Section/title rows interleaved in data	🟡 minor	open
IMP-09	Index↔Datei filename mismatches	🟡 minor	open
IMP-10	`x`-suffix rows (letter backsides/enclosures)	🟡 minor	open
IMP-11	Multi-receiver separators incl. bare `u`/`u.`	🟡 minor	open
IMP-12	Importer reads only the first sheet, no validation	🟡 minor	open

See the findings doc for detail and proposed approach per issue.

4.1 KiB Raw Blame History