The four files in tools/import-normalizer/out/ contain real names, addresses, and attribution prose for ~163 living/deceased family members and were committed by mistake. They are now removed from the index (kept on disk for local development) and gitignored. The canonical artifacts are produced locally from the Python normalizer and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The contract between normalizer and importer is the header schema, not the file contents — CanonicalSheetReader fails closed on a missing header, which is what locks the contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
69 lines
4.1 KiB
Markdown
69 lines
4.1 KiB
Markdown
# Import Migration — Working Folder
|
|
|
|
This folder tracks the iterative work of mass-importing the **real, raw family archive**
|
|
spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.
|
|
|
|
It is intentionally **local docs, not Gitea issues**. We only open a Gitea issue when a
|
|
finding requires a *software* change (e.g. a new date parser). Pure data observations and
|
|
the running plan live here so any agent can pick the work up cold.
|
|
|
|
## Source files (in `/import`)
|
|
|
|
| File | What it is | Importer support today |
|
|
| --- | --- | --- |
|
|
| `zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx` | The **real raw archive** — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters. | ❌ layout does **not** match importer defaults |
|
|
| `Personendatei 2.xlsx` | Genealogical **person register** — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships). | ❌ no importer at all |
|
|
| `zzfamilienarchiv Walter und Eugenie 2025-04-10.ods` | A small, **already-normalized** subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what `MassImportService` was built for |
|
|
|
|
The PDFs (~7,000) will follow later. The importer matches files by the **Index** column
|
|
(e.g. `W-0001` → `W-0001.pdf`), and already imports metadata-only when a file is missing —
|
|
so we can import all metadata now and the PDFs will attach on a re-run.
|
|
|
|
## How to inspect the spreadsheets
|
|
|
|
`openpyxl` is installed in the OCR service venv:
|
|
|
|
```bash
|
|
/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"
|
|
```
|
|
|
|
## Documents in this folder
|
|
|
|
- [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID `IMP-NN`.
|
|
- [`02-normalization-spec.md`](./02-normalization-spec.md) — requirements spec for the offline **import normalizer** (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements `FR-*`/`NFR-*`, traceable to the `IMP-NN` findings.
|
|
- `WORKLOG.md` — running log of what each session did and what's next. **Start here when resuming.**
|
|
|
|
## Strategy (decided 2026-05-25)
|
|
|
|
Normalize **before** import. A standalone Python tool (`tools/import-normalizer/`, not yet
|
|
built) transforms the raw xlsx + person register into a clean canonical dataset
|
|
(`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs. Residual cases
|
|
(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and
|
|
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
|
|
See the spec for the full contract.
|
|
|
|
The canonical artifacts themselves (the `out/` files) are **produced locally and not
|
|
version-controlled** — they contain real family PII. They are synced onto the ops host's
|
|
`IMPORT_HOST_DIR` alongside the PDFs, out-of-band. The contract is the header schema in
|
|
`02-normalization-spec.md` §6, not any particular file in `out/`. See ADR-025 for the full
|
|
rationale.
|
|
|
|
## Status board
|
|
|
|
| ID | Issue | Severity | Status |
|
|
| --- | --- | --- | --- |
|
|
| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open |
|
|
| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open |
|
|
| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open |
|
|
| IMP-04 | Person register (`Personendatei 2.xlsx`) not imported | 🟠 major | open |
|
|
| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open |
|
|
| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open |
|
|
| IMP-07 | 43 duplicate Index values | 🟡 minor | open |
|
|
| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open |
|
|
| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open |
|
|
| IMP-10 | `x`-suffix rows (letter backsides/enclosures) | 🟡 minor | open |
|
|
| IMP-11 | Multi-receiver separators incl. bare `u`/`u.` | 🟡 minor | open |
|
|
| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open |
|
|
|
|
See the findings doc for detail and proposed approach per issue.
|