Document the raw archive spreadsheet findings (IMP-01..12) and a requirements spec for an offline normalizer that produces a clean canonical dataset before import. Local docs only; no Gitea issue yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
63 lines
3.8 KiB
Markdown
63 lines
3.8 KiB
Markdown
# Import Migration — Working Folder
|
|
|
|
This folder tracks the iterative work of mass-importing the **real, raw family archive**
|
|
spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.
|
|
|
|
It is intentionally **local docs, not Gitea issues**. We only open a Gitea issue when a
|
|
finding requires a *software* change (e.g. a new date parser). Pure data observations and
|
|
the running plan live here so any agent can pick the work up cold.
|
|
|
|
## Source files (in `/import`)
|
|
|
|
| File | What it is | Importer support today |
|
|
| --- | --- | --- |
|
|
| `zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx` | The **real raw archive** — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters. | ❌ layout does **not** match importer defaults |
|
|
| `Personendatei 2.xlsx` | Genealogical **person register** — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships). | ❌ no importer at all |
|
|
| `zzfamilienarchiv Walter und Eugenie 2025-04-10.ods` | A small, **already-normalized** subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates. | ✅ this is what `MassImportService` was built for |
|
|
|
|
The PDFs (~7,000) will follow later. The importer matches files by the **Index** column
|
|
(e.g. `W-0001` → `W-0001.pdf`), and already imports metadata-only when a file is missing —
|
|
so we can import all metadata now and the PDFs will attach on a re-run.
|
|
|
|
## How to inspect the spreadsheets
|
|
|
|
`openpyxl` is installed in the OCR service venv:
|
|
|
|
```bash
|
|
/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"
|
|
```
|
|
|
|
## Documents in this folder
|
|
|
|
- [`01-findings-spreadsheet-analysis.md`](./01-findings-spreadsheet-analysis.md) — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID `IMP-NN`.
|
|
- [`02-normalization-spec.md`](./02-normalization-spec.md) — requirements spec for the offline **import normalizer** (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements `FR-*`/`NFR-*`, traceable to the `IMP-NN` findings.
|
|
- `WORKLOG.md` — running log of what each session did and what's next. **Start here when resuming.**
|
|
|
|
## Strategy (decided 2026-05-25)
|
|
|
|
Normalize **before** import. A standalone Python tool (`tools/import-normalizer/`, not yet
|
|
built) transforms the raw xlsx + person register into a clean canonical dataset
|
|
(`canonical-documents.xlsx`, `canonical-persons.xlsx`) plus review CSVs. Residual cases
|
|
(unparseable dates, unmatched names) are fixed via a version-controlled overrides file and
|
|
re-run. The Java importer is adjusted to consume the canonical contract in a later **Phase 2**.
|
|
See the spec for the full contract.
|
|
|
|
## Status board
|
|
|
|
| ID | Issue | Severity | Status |
|
|
| --- | --- | --- | --- |
|
|
| IMP-01 | New xlsx column layout ≠ importer defaults | 🔴 blocker | open |
|
|
| IMP-02 | 90% of dates are free-text the parser can't read | 🔴 blocker | open |
|
|
| IMP-03 | No ISO/normalized date column in the new xlsx | 🔴 blocker | open |
|
|
| IMP-04 | Person register (`Personendatei 2.xlsx`) not imported | 🟠 major | open |
|
|
| IMP-05 | Name variations = duplicate Persons (maiden vs married) | 🟠 major | open |
|
|
| IMP-06 | 93 data rows with blank Index are silently dropped | 🟠 major | open |
|
|
| IMP-07 | 43 duplicate Index values | 🟡 minor | open |
|
|
| IMP-08 | Section/title rows interleaved in data | 🟡 minor | open |
|
|
| IMP-09 | Index↔Datei filename mismatches | 🟡 minor | open |
|
|
| IMP-10 | `x`-suffix rows (letter backsides/enclosures) | 🟡 minor | open |
|
|
| IMP-11 | Multi-receiver separators incl. bare `u`/`u.` | 🟡 minor | open |
|
|
| IMP-12 | Importer reads only the first sheet, no validation | 🟡 minor | open |
|
|
|
|
See the findings doc for detail and proposed approach per issue.
|