Files

Marcel adfff420a5 docs(import): add import-migration analysis + normalizer spec

Document the raw archive spreadsheet findings (IMP-01..12) and a
requirements spec for an offline normalizer that produces a clean
canonical dataset before import. Local docs only; no Gitea issue yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 12:32:37 +02:00

3.2 KiB

Raw Blame History

Import Migration — Worklog

Running log of each working session. Resume here. Newest entry on top.

2026-05-25 (session 2) — Strategy + normalizer spec

Did:

Decided strategy with Marcel: normalize the raw sheets first, then import (higher leverage than making the Java importer tolerate every mess).
Locked design decisions (see spec §3): new canonical layout; dates = parsed + raw + precision; include person register + dedup in this effort; overrides-file + re-run loop; Python tool at tools/import-normalizer/.
Century rule fixed by Marcel: archive spans 1873–1957; 2-digit 00–57→19YY, 73–99→18YY, 58–72→flag; 3-digit→1DDD; never 20xx.
Wrote 02-normalization-spec.md in the requirements-engineer persona (FR/NFR, Given-When-Then ACs, traceability to IMP-NN, TBD register).

All 6 open questions resolved (spec §9): OQ-01 — movable feasts (Ostern, Pfingsten, …) computed per year from Easter, never a fixed month; seasons → mid-season month (Sommer=Jul, Herbst=Oct). OQ-02 ranges → start+RANGE. OQ-03 slug ids. OQ-04 — x-suffix rows skipped + logged this pass (they're transcriptions of the base letter, not yet mappable). OQ-05 → .xlsx. OQ-06 → conservative, no silent merge.

Git: moved off the unrelated feat/issue-356-… branch; pulled main; created clean branch docs/import-migration and committed these docs there. (The dirty .venv pycache + skills/implement/SKILL.md in the tree are pre-existing/environmental noise — left uncommitted, not ours.)

Next:

Marcel reviews the spec.
Then writing-plans → build the normalizer at tools/import-normalizer/ (backlog B1–B7 are the Musts; B3 date parser incl. Easter computus is the big one).

2026-05-25 (session 1) — Initial analysis

Did:

Got the real raw archive xlsx (7,943 rows) + person register (163 people). PDFs to follow.
Compared the new xlsx layout against MassImportService defaults and the old ODS.
Full statistical scan of all rows: dates, indices, senders/receivers, file column.
Wrote 01-findings-spreadsheet-analysis.md with 12 issues (IMP-01..IMP-12) + recommended sequencing.
Installed openpyxl into the OCR service venv for inspection.

Key facts established:

Importer defaults match the ODS, not the new xlsx → wrong column mapping (IMP-01).
90% of dated rows (6,571 / 7,319) are free-text dates the ISO-only parser drops (IMP-02).
Person register is rich but unimported; holds the maiden-name dedup key (IMP-04/05).

Decisions pending from Marcel (blockers for any code work):

IMP-01: positional re-config of app.import.col.* vs header-driven mapping rewrite?
IMP-02: how to store imprecise dates — new dateOriginal + precision columns, or lossy?
IMP-04/05: format for the person/alias mapping; import persons before documents?
IMP-10: are x-suffix rows separate documents, attachments, or skipped?

Next:

Get Marcel's calls on the 4 decisions above.
Then split the code-change items into Gitea issues (IMP-01b, IMP-02, IMP-04, IMP-06, IMP-12).
Pure-data tasks (IMP-07 dup list, IMP-09 file reconcile) stay here.

3.2 KiB Raw Blame History Unescape Escape

Import Migration — Worklog

2026-05-25 (session 2) — Strategy + normalizer spec

2026-05-25 (session 1) — Initial analysis

3.2 KiB

Raw Blame History