Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Import Migration — Working Folder
This folder tracks the iterative work of mass-importing the real, raw family archive spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.
It is intentionally local docs, not Gitea issues. We only open a Gitea issue when a finding requires a software change (e.g. a new date parser). Pure data observations and the running plan live here so any agent can pick the work up cold.
Source files (in /import)
| File | What it is | Importer support today |
|---|---|---|
zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx |
The real raw archive — 7,943 rows, sheet Familienarchiv. Human-readable, dates as written in the letters. |
❌ layout does not match importer defaults |
Personendatei 2.xlsx |
Genealogical person register — 163 people, sheet Tabelle1 (maiden names, birth/death, marriages, relationships). |
❌ no importer at all |
zzfamilienarchiv Walter und Eugenie 2025-04-10.ods |
A small, already-normalized subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what MassImportService was built for |
The PDFs (~7,000) will follow later. The importer matches files by the Index column
(e.g. W-0001 → W-0001.pdf), and already imports metadata-only when a file is missing —
so we can import all metadata now and the PDFs will attach on a re-run.
How to inspect the spreadsheets
openpyxl is installed in the OCR service venv:
/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"
Documents in this folder
01-findings-spreadsheet-analysis.md— full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an IDIMP-NN.02-normalization-spec.md— requirements spec for the offline import normalizer (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). RequirementsFR-*/NFR-*, traceable to theIMP-NNfindings.WORKLOG.md— running log of what each session did and what's next. Start here when resuming.
Strategy (decided 2026-05-25)
Normalize before import. A standalone Python tool (tools/import-normalizer/, not yet
built) transforms the raw xlsx + person register into a clean canonical dataset
(canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs. Residual cases
(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and
re-run. The Java importer is adjusted to consume the canonical contract in a later Phase 2.
See the spec for the full contract.
Status board
| ID | Issue | Severity | Status |
|---|---|---|---|
| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open |
| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open |
| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open |
| IMP-04 | Person register (Personendatei 2.xlsx) not imported |
🟠 major | open |
| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open |
| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open |
| IMP-07 | 43 duplicate Index values | 🟡 minor | open |
| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open |
| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open |
| IMP-10 | x-suffix rows (letter backsides/enclosures) |
🟡 minor | open |
| IMP-11 | Multi-receiver separators incl. bare u/u. |
🟡 minor | open |
| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open |
See the findings doc for detail and proposed approach per issue.