Files

Marcel adfff420a5 docs(import): add import-migration analysis + normalizer spec

Document the raw archive spreadsheet findings (IMP-01..12) and a
requirements spec for an offline normalizer that produces a clean
canonical dataset before import. Local docs only; no Gitea issue yet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 12:32:37 +02:00

16 KiB

Raw Blame History

Spreadsheet Analysis — Findings (2026-05-25)

Analysis of the real raw archive spreadsheets against the current MassImportService (backend/.../importing/MassImportService.java). Goal: import ~7,600 letter rows + a 163-person register, with PDFs to follow.

Every issue has an ID (IMP-NN), severity, evidence, and a proposed approach.

0. Context: how the importer reads a row today

MassImportService reads sheet index 0 and maps columns by configurable indices (app.import.col.*, defaults in the source):

Property	Default col	Meaning
`colIndex`	0	Index (→ filename `<index>.pdf`)
`colBox`	1	Box
`colFolder`	2	Mappe
`colSender`	3	Sender (raw)
`colReceivers`	5	Receivers (raw)
`colDate`	7	Date
`colLocation`	9	Location
`colTags`	10	Tag (single)
`colSummary`	11	Summary
`colTranscription`	13	Transcription

These defaults match the ODS file exactly (Index, Box, Mappe, Von, BriefeschreiberIn, An, EmpfängerIn, Datum, Datum Originalformat, Ort, Schlagwort, Inhalt, Zeitlicher Kontext, Transkript = 14 cols). The ODS was the development target. The new xlsx is a different beast.

Per-row pipeline: skip if Index blank → derive filename from Index → validate filename → look for file on disk (recursive; metadata-only if absent) → check PDF magic bytes → importSingleDocument (upsert by originalFilename, dedupe non-placeholders as ALREADY_EXISTS). Date parsing is ISO-only (LocalDate.parse).

IMP-01 — New xlsx column layout ≠ importer defaults 🔴 BLOCKER

The new …aktuell…xlsx (sheet Familienarchiv, 7,943 rows × 12 cols) has a denser, different layout. There is an extra Datei column at index 1, and the normalized Von/An/ISO-Datum columns from the ODS do not exist.

col	New xlsx header	Importer default expects	Result with defaults
0	Index	Index	✅ ok
1	Datei (path)	Box	❌ Box ← `..\__scan\W-0001.pdf`
2	Box	Mappe	❌ Mappe ← `V`
3	Mappe	Sender	❌ Sender ← `1`
4	BriefeschreiberIn (sender)	— (unused)	❌ sender ignored
5	EmpfängerIn (receiver)	Receivers	✅ coincidentally ok
6	Datum des Briefes	— (unused)	❌ date ignored
7	Ort (location)	Date	❌ Date ← `Rotterdam` → null
8	Schlagwort (tag)	— (unused)	❌ tag ignored
9	Inhalt (summary)	Location	❌ Location ← summary text
10	—	Tag	❌ empty
11	—	Summary	❌ empty
13	—	Transcription	❌ column doesn't exist

Impact: importing as-is produces almost entirely garbage metadata.

Proposed approach (decide with Marcel):

(a) Re-map via the existing app.import.col.* properties — fast, no code. New mapping: index=0, box=2, folder=3, sender=4, receivers=5, date=6, location=7, tags=8, summary=9, and there is no transcription column (point it past the end or add a "missing column" convention). Caveat: tags land in colTags but the real per-letter keywords are in Inhalt (col 9) — see IMP-08 note on tags vs summary.
(b) Make the importer header-driven (map by header name, not index) so it survives layout drift across files. More robust, needs a code change (→ Gitea issue).

Recommendation: (b) is the durable fix given we have ≥3 different layouts already.

IMP-02 — 90% of dates are free-text the parser can't read 🔴 BLOCKER

The dates are written as in the letter. parseDate() only does LocalDate.parse() (ISO yyyy-MM-dd), so anything non-ISO becomes null.

Of 7,319 rows with a date value (col 6):

kind	count	parses today?
Real Excel date cells (→ ISO via POI)	748	✅
Free-text date strings	6,571	❌ → null

→ 90% of dated rows lose their date. (623 rows have no date at all.)

Observed free-text formats (counts approximate, from col 6):

Format	Count	Examples
`D.M.YY`	1,338	`11.10.08`, `13.5.09`
`D.RomanMonth.YY/YYYY`	~1,527	`22.III.18`, `19.XII.1954`, `1.III.27`
`D.Month YYYY`	950	`6.März 1888`, `9.März 1888` (note: no space after the dot)
`D.M.YYYY`	358	`15.2.1888`, `7.3.1888`
Approximate / unknown	146	`?`, `13.7.18?`, `17.Nov (?) 1887`, `13.Januar ? 1907`
`Month YYYY` / season / holiday	41+27	`Mai 1895`, `Herbst 1913`, `Pfingsten 1922`, `Ostern 1890`
`YYYY` only	17	`1905`, `1949`
`D.M.` no year	10	`8.9.`, `14.3.`
Ranges	5+	`8.1.1916 - 15.3.1916`, `1881/82`, `1945/46?`
Abbrev/English months, no space	many	`29.Sept.1891`, `10.Oct.95`, `9.December1889`, `18.Dez.1916`
Slash separator	~315	`2/2. 18`, `17/6. 1916`, `10/4. 1917`
English `Month D. YYYY`	several	`April 12. 1922`, `Oct.5. 1916`, `Mai 23. 1917`
Trailing notes	5+	`26.4.1888, 2. Brief`, `31.8.1888,2.Brief`
3-digit year (typo)	107	`30.1.889` (→ 1889), `4.3.1023` (in person file → 1923)
Day-range within month	several	`7./8. Sept.1923`

Proposed approach: build a tolerant German/historical date parser (→ Gitea issue, it's a code change). Requirements:

Numeric D.M.YY[YY] and D/M. YY[YY] (slash = dot).
Roman-numeral months (I–XII).
German + English month names, full + abbreviated, with/without separating space (März, Sept., Dez, December, Oct.).
2-digit and 3-digit year normalization (08→1908? needs a century rule; 889→1889).
Partial dates → store what's known. The schema only has a single documentDate LocalDate; decide whether to (i) store first-of-month/year, (ii) add a datePrecision enum + dateOriginal text column, or (iii) keep raw text in a new documentDateRaw field and leave documentate null when imprecise. Recommendation: preserve the original string always (new column) + best-effort parsed date + precision flag, so nothing is lost and the UI can show "ca. 1916".
Unparseable/approximate (?, Herbst 1913) → keep raw, leave parsed date null, do not drop the row.

Cross-check: even after IMP-01 is fixed so the date column is read, IMP-02 still bites. Both must be solved before a real import.

IMP-03 — New xlsx has no normalized/ISO date or name columns 🔴 BLOCKER

The ODS had helper columns the importer relied on: Von/An (normalized names) and Datum (ISO) alongside Datum Originalformat. The new xlsx has only the raw BriefeschreiberIn / EmpfängerIn / Datum des Briefes. So:

Names must be parsed from raw strings (PersonNameParser already does receivers; sender is taken raw, never split — fine for senders, which are single, but no normalization).
Dates must be parsed from raw (IMP-02).

This is the root reason IMP-01/02 exist: the new file is the uncurated source, not the hand-normalized ODS. Tie any importer redesign to this reality — we will not get clean helper columns in the 7k-row file.

IMP-04 — Person register not imported at all 🟠 MAJOR

Personendatei 2.xlsx → sheet Tabelle1, 163 people, columns: Generation, Familienname, Vorname, geb als (maiden), Geburtsdatum, Geburtsort, Todesdatum, Sterbeort, verheiratet mit, Bemerkung.

Today MassImportService has no person-register import. Persons are only auto-created as bare aliases from the document sender/receiver strings (personService.findOrCreateByAlias). All this rich genealogical data is unused:

birth/death dates + places,
maiden names (the key to dedup — see IMP-05),
verheiratet mit (marriage links → PersonRelationship domain),
Bemerkung relationship hints ("Schwester v Marie Cram", "Nichte von Herbert"),
Generation (G 1–G 4),
nicknames in quotes ("Tante Lolly").

Data-quality notes in this file too: multi-value Vorname (Charlotte,Meta,Jacobi); mixed Excel-date vs text dates; typos (4.3.1023); missing-day dates (.12.1955); trailing spaces (30.8.1862 ).

Proposed approach: a separate Person import (→ Gitea issue). Order matters: import persons first so documents can link to real people instead of creating alias stubs. Use geb als + verheiratet mit to pre-build the alias/relationship graph.

IMP-05 — Name variations create duplicate Persons 🟠 MAJOR

The same person appears under several surface forms across the document sheet:

Eugenie Müller (151) vs Eugenie de Gruyter (452) — maiden vs married.
Clara Cram (sender 1,284) vs Clara de Gruyter (455) vs Clara de Gruyter sen. (66).
Walter de Gruyter (589) vs bare Walter (78).

findOrCreateByAlias keys on the raw string, so each variant becomes (or matches) a distinct alias and likely a distinct Person. Result: fragmented person records, broken Briefwechsel pairing, wrong stats.

Proposed approach: drive dedup from the register's geb als column (IMP-04) — Eugenie de Gruyter geb Müller tells us the two strings are one person. Build an alias map (married ↔ maiden ↔ nickname) before/while importing documents. This is partly data (an alias mapping table/sheet) and partly code (consume it). Likely a Gitea issue once the mapping format is decided.

945 distinct sender strings / 274 distinct receiver strings — expect a long-tail of variants to reconcile. Don't try to be perfect on the first pass; get the high-frequency names right.

IMP-06 — 93 data rows with blank Index are silently dropped 🟠 MAJOR

processRows does if (index.isBlank()) continue;. 93 rows have a blank Index but carry other data (sender/receiver/date/etc.). These are silently skipped — they don't even appear in the skippedFiles report (that list only covers rows that had an index but failed file checks).

Proposed approach: before import, triage these 93 rows — are they continuation rows, section markers, or genuine letters missing an ID? At minimum, surface a count/warning so nothing vanishes unnoticed. Possibly a small importer change to report blank-index skips.

IMP-07 — 43 duplicate Index values 🟡 MINOR

43 Index values repeat (e.g. W-0388, Eu-0332, C-0234, C-0235, C-0236, J-0175). Since the filename is derived from Index, the importer's upsert keys both rows on the same originalFilename: the second occurrence is treated as ALREADY_EXISTS (if the first isn't a placeholder) and its metadata is lost, or it overwrites a placeholder.

Proposed approach: list the 43 duplicates, check whether they're true duplicates or two distinct letters that share an ID by mistake. Fix in the source data, or extend the ID scheme. Data task first; software only if the ID scheme must change.

IMP-08 — Section/title rows interleaved with data 🟡 MINOR

Row 2 of the sheet is a section header sitting only in the sender column (Brautbriefe von Walter der Gruyter an Eugenie Müller) with a blank Index — caught by the blank-Index skip (overlaps IMP-06). There may be more such banners scattered through 7,943 rows. Also relevant: the per-letter keywords live in Inhalt (col 9) as comma-joined values (Tilburg,Verwandschaft, poetisch,Reise nach Breda), while Schlagwort (col 8) holds a single broad tag (Brautbriefe). The importer only takes one tag column — decide which column feeds tags vs summary, and whether to split comma-lists into multiple tags.

Proposed approach: scan for rows where Index is blank but other cells are set (already have the count: relates to the 93 in IMP-06). Confirm tag vs summary column choice with Marcel.

IMP-09 — Index ↔ Datei filename mismatches 🟡 MINOR

The Datei column (col 1) holds explicit relative paths (..\__scan\W-0001.pdf) but they don't always agree with the Index. Example: row 20 has Index W-0010x but Datei ..\__scan\W-0011x.pdf. The importer derives the filename from Index, so it will look for W-0010x.pdf and may miss the actual scan. (Note: the Datei paths themselves are Windows-style with \ and .. and would be rejected by isValidImportFilename if anyone tried to use that column directly — 7,623 rows use backslashes, 7,455 contain ...)

Proposed approach: when the PDFs arrive, reconcile Index-derived names against actual filenames; produce a mismatch report. Keep deriving from Index (stable IDs) but flag disagreements. Mostly a data/QA task.

IMP-10 — `x`-suffix rows (letter backsides / enclosures) 🟡 MINOR

42 rows have an x-suffixed Index (W-0001x, W-0002x, …). They're sparse — typically only Index + Datei + sender + receiver, no box/folder/date. They appear to be the reverse side or an enclosure of the preceding letter. The importer treats each as an independent Document, and the metadataComplete heuristic flags them complete as soon as a sender is present (date/box/folder all missing).

Proposed approach: decide whether x rows should be (a) separate documents, (b) extra pages/files attached to their parent, or (c) skipped. Affects both the data model and the metadataComplete heuristic. Discuss with Marcel.

IMP-11 — Multi-receiver separators include bare `u` / `u.` 🟡 MINOR

PersonNameParser.parseReceivers already handles und, u, //, geb., parenthesised shared surnames, and Familie filtering — good. But the real data also uses the abbreviation in forms the top-receivers list shows are common: Eugenie u Walter de Gruyter (230), Herbert u Clara (94), Juan u Marie Cram (75), and space-joined pairs like Ella Anita (79) that may be two people. Raw separator tally on receivers: und ×70, , ×11, ; ×2, / ×1 — plus the many u cases above. Senders are not parsed at all (taken raw), which is fine unless a sender cell ever holds two names.

Proposed approach: add MassImportServiceTest cases for the real-world strings above; extend the parser only where it actually fails. Ella Anita-style space-joined pairs are ambiguous — likely leave as one person unless the register says otherwise (ties to IMP-05).

IMP-12 — Importer reads only the first sheet, no validation 🟡 MINOR

readXlsx does workbook.getSheetAt(0). For the new xlsx that's Familienarchiv (✅), but the file also contains Inhaltsverzeichnis grob, Inhaltsverzeichnis WdG, Tabelle4. There is no header validation: if the wrong file/sheet is dropped in /import, the importer will happily map columns positionally and import nonsense. Also findSpreadsheetFile() picks the first spreadsheet found in /import — with three spreadsheets present there today, which one wins is filesystem-order-dependent.

Proposed approach: (a) validate the header row against expected names before importing; (b) make the target sheet/file explicit (config or header match) rather than "first found". Ties into the header-driven mapping in IMP-01(b).

Summary of recommended sequencing

Decide the importer mapping strategy (IMP-01): positional re-config vs header-driven. Header-driven is the durable choice and unblocks IMP-03/12.
Build the tolerant date parser (IMP-02) with original-string preservation + precision.
Import the Person register first (IMP-04) and build the alias/marriage graph, which feeds person dedup (IMP-05).
Then import documents, with reporting for blank-index (IMP-06), duplicates (IMP-07), and section rows (IMP-08).
Reconcile files when the ~7,000 PDFs arrive (IMP-09), and decide x-row semantics (IMP-10).

Code-change items (→ Gitea issues when we get there): IMP-01(b), IMP-02, IMP-04, IMP-05 (consume side), IMP-06 reporting, IMP-12. Pure-data items stay in this folder.

16 KiB Raw Blame History Unescape Escape