The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.1 KiB
Import Migration — Working Folder
This folder tracks the iterative work of mass-importing the real, raw family archive spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.
It is intentionally local docs, not Gitea issues. We only open a Gitea issue when a finding requires a software change (e.g. a new date parser). Pure data observations and the running plan live here so any agent can pick the work up cold.
Source files (in /import)
| File | What it is | Importer support today |
|---|---|---|
zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx |
The real raw archive — 7,943 rows, sheet Familienarchiv. Human-readable, dates as written in the letters. |
❌ layout does not match importer defaults |
Personendatei 2.xlsx |
Genealogical person register — 163 people, sheet Tabelle1 (maiden names, birth/death, marriages, relationships). |
❌ no importer at all |
zzfamilienarchiv Walter und Eugenie 2025-04-10.ods |
A small, already-normalized subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what MassImportService was built for |
The PDFs (~7,000) will follow later. The importer matches files by the Index column
(e.g. W-0001 → W-0001.pdf), and already imports metadata-only when a file is missing —
so we can import all metadata now and the PDFs will attach on a re-run.
How to inspect the spreadsheets
openpyxl is installed in the OCR service venv:
/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"
Documents in this folder
01-findings-spreadsheet-analysis.md— full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an IDIMP-NN.02-normalization-spec.md— requirements spec for the offline import normalizer (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). RequirementsFR-*/NFR-*, traceable to theIMP-NNfindings.WORKLOG.md— running log of what each session did and what's next. Start here when resuming.
Strategy (decided 2026-05-25)
Normalize before import. A standalone Python tool (tools/import-normalizer/, not yet
built) transforms the raw xlsx + person register into a clean canonical dataset
(canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs. Residual cases
(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and
re-run. The Java importer is adjusted to consume the canonical contract in a later Phase 2.
See the spec for the full contract.
The canonical artifacts themselves (the out/ files) are produced locally and not
version-controlled — they contain real family PII. They are synced onto the ops host's
IMPORT_HOST_DIR alongside the PDFs, out-of-band. The contract is the header schema in
02-normalization-spec.md §6, not any particular file in out/. See ADR-025 for the full
rationale.
Status board
| ID | Issue | Severity | Status |
|---|---|---|---|
| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open |
| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open |
| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open |
| IMP-04 | Person register (Personendatei 2.xlsx) not imported |
🟠 major | open |
| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open |
| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open |
| IMP-07 | 43 duplicate Index values | 🟡 minor | open |
| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open |
| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open |
| IMP-10 | x-suffix rows (letter backsides/enclosures) |
🟡 minor | open |
| IMP-11 | Multi-receiver separators incl. bare u/u. |
🟡 minor | open |
| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open |
See the findings doc for detail and proposed approach per issue.