marcel/familienarchiv

Fork 0

Files

History

Marcel 97db718f81

CI / OCR Service Tests (pull_request) Successful in 22s

Details

CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s

Details

CI / Backend Unit Tests (pull_request) Successful in 3m52s

Details

CI / fail2ban Regex (pull_request) Successful in 42s

Details

CI / Unit & Component Tests (pull_request) Successful in 4m13s

Details

CI / Semgrep Security Scan (pull_request) Successful in 20s

Details

docs(import): add unresolved-names plan + worklog entry

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 16:01:18 +02:00

01-findings-spreadsheet-analysis.md

docs(import): add import-migration analysis + normalizer spec

2026-05-25 12:32:37 +02:00

02-normalization-spec.md

docs(import): add normalizer implementation plan + apply persona review

2026-05-25 12:55:50 +02:00

03-normalizer-implementation-plan.md

docs(import): record normalizer completion + dry-run results in worklog

2026-05-25 14:56:20 +02:00

04-unresolved-names-plan.md

docs(import): add unresolved-names plan + worklog entry

2026-05-25 16:01:18 +02:00

README.md

docs(import): add import-migration analysis + normalizer spec

2026-05-25 12:32:37 +02:00

WORKLOG.md

docs(import): add unresolved-names plan + worklog entry

2026-05-25 16:01:18 +02:00

README.md

Import Migration — Working Folder

This folder tracks the iterative work of mass-importing the real, raw family archive spreadsheets (≈7,600 letter rows + ~7,000 PDFs that arrive later) into Familienarchiv.

It is intentionally local docs, not Gitea issues. We only open a Gitea issue when a finding requires a software change (e.g. a new date parser). Pure data observations and the running plan live here so any agent can pick the work up cold.

Source files (in `/import`)

File	What it is	Importer support today
`zzfamilienarchiv aktuell 2 - Kopie 2025-07-05.xlsx`	The real raw archive — 7,943 rows, sheet `Familienarchiv`. Human-readable, dates as written in the letters.	❌ layout does not match importer defaults
`Personendatei 2.xlsx`	Genealogical person register — 163 people, sheet `Tabelle1` (maiden names, birth/death, marriages, relationships).	❌ no importer at all
`zzfamilienarchiv Walter und Eugenie 2025-04-10.ods`	A small, already-normalized subset (Walter & Eugenie brautbriefe). 14 clean columns incl. ISO dates.	✅ this is what `MassImportService` was built for

The PDFs (~7,000) will follow later. The importer matches files by the Index column (e.g. W-0001 → W-0001.pdf), and already imports metadata-only when a file is missing — so we can import all metadata now and the PDFs will attach on a re-run.

How to inspect the spreadsheets

openpyxl is installed in the OCR service venv:

/home/marcel/Desktop/familienarchiv/ocr-service/.venv/bin/python3 -c "import openpyxl; print(openpyxl.__version__)"

Documents in this folder

01-findings-spreadsheet-analysis.md — full analysis of every data-quality / importer issue found (2026-05-25). Each issue has an ID IMP-NN.
02-normalization-spec.md — requirements spec for the offline import normalizer (the agreed strategy: normalize the raw sheets into a clean canonical dataset before import). Requirements FR-*/NFR-*, traceable to the IMP-NN findings.
WORKLOG.md — running log of what each session did and what's next. Start here when resuming.

Strategy (decided 2026-05-25)

Normalize before import. A standalone Python tool (tools/import-normalizer/, not yet built) transforms the raw xlsx + person register into a clean canonical dataset (canonical-documents.xlsx, canonical-persons.xlsx) plus review CSVs. Residual cases (unparseable dates, unmatched names) are fixed via a version-controlled overrides file and re-run. The Java importer is adjusted to consume the canonical contract in a later Phase 2. See the spec for the full contract.

Status board

ID	Issue	Severity	Status
IMP-01	New xlsx column layout ≠ importer defaults	🔴 blocker	open
IMP-02	90% of dates are free-text the parser can't read	🔴 blocker	open
IMP-03	No ISO/normalized date column in the new xlsx	🔴 blocker	open
IMP-04	Person register (`Personendatei 2.xlsx`) not imported	🟠 major	open
IMP-05	Name variations = duplicate Persons (maiden vs married)	🟠 major	open
IMP-06	93 data rows with blank Index are silently dropped	🟠 major	open
IMP-07	43 duplicate Index values	🟡 minor	open
IMP-08	Section/title rows interleaved in data	🟡 minor	open
IMP-09	Index↔Datei filename mismatches	🟡 minor	open
IMP-10	`x`-suffix rows (letter backsides/enclosures)	🟡 minor	open
IMP-11	Multi-receiver separators incl. bare `u`/`u.`	🟡 minor	open
IMP-12	Importer reads only the first sheet, no validation	🟡 minor	open

See the findings doc for detail and proposed approach per issue.

README.md

Import Migration — Working Folder

Source files (in /import)

How to inspect the spreadsheets

Documents in this folder

Strategy (decided 2026-05-25)

Status board

Source files (in `/import`)