Document the raw archive spreadsheet findings (IMP-01..12) and a requirements spec for an offline normalizer that produces a clean canonical dataset before import. Local docs only; no Gitea issue yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
16 KiB
Spreadsheet Analysis — Findings (2026-05-25)
Analysis of the real raw archive spreadsheets against the current MassImportService
(backend/.../importing/MassImportService.java). Goal: import ~7,600 letter rows + a
163-person register, with PDFs to follow.
Every issue has an ID (IMP-NN), severity, evidence, and a proposed approach.
0. Context: how the importer reads a row today
MassImportService reads sheet index 0 and maps columns by configurable indices
(app.import.col.*, defaults in the source):
| Property | Default col | Meaning |
|---|---|---|
colIndex |
0 | Index (→ filename <index>.pdf) |
colBox |
1 | Box |
colFolder |
2 | Mappe |
colSender |
3 | Sender (raw) |
colReceivers |
5 | Receivers (raw) |
colDate |
7 | Date |
colLocation |
9 | Location |
colTags |
10 | Tag (single) |
colSummary |
11 | Summary |
colTranscription |
13 | Transcription |
These defaults match the ODS file exactly (Index, Box, Mappe, Von, BriefeschreiberIn, An, EmpfängerIn, Datum, Datum Originalformat, Ort, Schlagwort, Inhalt, Zeitlicher Kontext, Transkript = 14 cols). The ODS was the development target. The new xlsx is a different beast.
Per-row pipeline: skip if Index blank → derive filename from Index → validate filename →
look for file on disk (recursive; metadata-only if absent) → check PDF magic bytes →
importSingleDocument (upsert by originalFilename, dedupe non-placeholders as
ALREADY_EXISTS). Date parsing is ISO-only (LocalDate.parse).
IMP-01 — New xlsx column layout ≠ importer defaults 🔴 BLOCKER
The new …aktuell…xlsx (sheet Familienarchiv, 7,943 rows × 12 cols) has a denser,
different layout. There is an extra Datei column at index 1, and the normalized
Von/An/ISO-Datum columns from the ODS do not exist.
| col | New xlsx header | Importer default expects | Result with defaults |
|---|---|---|---|
| 0 | Index | Index | ✅ ok |
| 1 | Datei (path) | Box | ❌ Box ← ..\__scan\W-0001.pdf |
| 2 | Box | Mappe | ❌ Mappe ← V |
| 3 | Mappe | Sender | ❌ Sender ← 1 |
| 4 | BriefeschreiberIn (sender) | — (unused) | ❌ sender ignored |
| 5 | EmpfängerIn (receiver) | Receivers | ✅ coincidentally ok |
| 6 | Datum des Briefes | — (unused) | ❌ date ignored |
| 7 | Ort (location) | Date | ❌ Date ← Rotterdam → null |
| 8 | Schlagwort (tag) | — (unused) | ❌ tag ignored |
| 9 | Inhalt (summary) | Location | ❌ Location ← summary text |
| 10 | — | Tag | ❌ empty |
| 11 | — | Summary | ❌ empty |
| 13 | — | Transcription | ❌ column doesn't exist |
Impact: importing as-is produces almost entirely garbage metadata.
Proposed approach (decide with Marcel):
- (a) Re-map via the existing
app.import.col.*properties — fast, no code. New mapping:index=0, box=2, folder=3, sender=4, receivers=5, date=6, location=7, tags=8, summary=9, and there is no transcription column (point it past the end or add a "missing column" convention). Caveat: tags land incolTagsbut the real per-letter keywords are inInhalt(col 9) — see IMP-08 note on tags vs summary. - (b) Make the importer header-driven (map by header name, not index) so it survives layout drift across files. More robust, needs a code change (→ Gitea issue).
Recommendation: (b) is the durable fix given we have ≥3 different layouts already.
IMP-02 — 90% of dates are free-text the parser can't read 🔴 BLOCKER
The dates are written as in the letter. parseDate() only does LocalDate.parse()
(ISO yyyy-MM-dd), so anything non-ISO becomes null.
Of 7,319 rows with a date value (col 6):
| kind | count | parses today? |
|---|---|---|
| Real Excel date cells (→ ISO via POI) | 748 | ✅ |
| Free-text date strings | 6,571 | ❌ → null |
→ 90% of dated rows lose their date. (623 rows have no date at all.)
Observed free-text formats (counts approximate, from col 6):
| Format | Count | Examples |
|---|---|---|
D.M.YY |
1,338 | 11.10.08, 13.5.09 |
D.RomanMonth.YY/YYYY |
~1,527 | 22.III.18, 19.XII.1954, 1.III.27 |
D.Month YYYY |
950 | 6.März 1888, 9.März 1888 (note: no space after the dot) |
D.M.YYYY |
358 | 15.2.1888, 7.3.1888 |
| Approximate / unknown | 146 | ?, 13.7.18?, 17.Nov (?) 1887, 13.Januar ? 1907 |
Month YYYY / season / holiday |
41+27 | Mai 1895, Herbst 1913, Pfingsten 1922, Ostern 1890 |
YYYY only |
17 | 1905, 1949 |
D.M. no year |
10 | 8.9., 14.3. |
| Ranges | 5+ | 8.1.1916 - 15.3.1916, 1881/82, 1945/46? |
| Abbrev/English months, no space | many | 29.Sept.1891, 10.Oct.95, 9.December1889, 18.Dez.1916 |
| Slash separator | ~315 | 2/2. 18, 17/6. 1916, 10/4. 1917 |
English Month D. YYYY |
several | April 12. 1922, Oct.5. 1916, Mai 23. 1917 |
| Trailing notes | 5+ | 26.4.1888, 2. Brief, 31.8.1888,2.Brief |
| 3-digit year (typo) | 107 | 30.1.889 (→ 1889), 4.3.1023 (in person file → 1923) |
| Day-range within month | several | 7./8. Sept.1923 |
Proposed approach: build a tolerant German/historical date parser (→ Gitea issue, it's a code change). Requirements:
- Numeric
D.M.YY[YY]andD/M. YY[YY](slash = dot). - Roman-numeral months (
I–XII). - German + English month names, full + abbreviated, with/without separating space
(
März,Sept.,Dez,December,Oct.). - 2-digit and 3-digit year normalization (
08→1908? needs a century rule;889→1889). - Partial dates → store what's known. The schema only has a single
documentDate LocalDate; decide whether to (i) store first-of-month/year, (ii) add adatePrecisionenum +dateOriginaltext column, or (iii) keep raw text in a newdocumentDateRawfield and leavedocumentatenull when imprecise. Recommendation: preserve the original string always (new column) + best-effort parsed date + precision flag, so nothing is lost and the UI can show "ca. 1916". - Unparseable/approximate (
?,Herbst 1913) → keep raw, leave parsed date null, do not drop the row.
Cross-check: even after IMP-01 is fixed so the date column is read, IMP-02 still bites. Both must be solved before a real import.
IMP-03 — New xlsx has no normalized/ISO date or name columns 🔴 BLOCKER
The ODS had helper columns the importer relied on: Von/An (normalized names) and
Datum (ISO) alongside Datum Originalformat. The new xlsx has only the raw
BriefeschreiberIn / EmpfängerIn / Datum des Briefes. So:
- Names must be parsed from raw strings (PersonNameParser already does receivers; sender is taken raw, never split — fine for senders, which are single, but no normalization).
- Dates must be parsed from raw (IMP-02).
This is the root reason IMP-01/02 exist: the new file is the uncurated source, not the hand-normalized ODS. Tie any importer redesign to this reality — we will not get clean helper columns in the 7k-row file.
IMP-04 — Person register not imported at all 🟠 MAJOR
Personendatei 2.xlsx → sheet Tabelle1, 163 people, columns:
Generation, Familienname, Vorname, geb als (maiden), Geburtsdatum, Geburtsort, Todesdatum, Sterbeort, verheiratet mit, Bemerkung.
Today MassImportService has no person-register import. Persons are only
auto-created as bare aliases from the document sender/receiver strings
(personService.findOrCreateByAlias). All this rich genealogical data is unused:
- birth/death dates + places,
- maiden names (the key to dedup — see IMP-05),
verheiratet mit(marriage links →PersonRelationshipdomain),Bemerkungrelationship hints ("Schwester v Marie Cram","Nichte von Herbert"),Generation(G 1–G 4),- nicknames in quotes (
"Tante Lolly").
Data-quality notes in this file too: multi-value Vorname (Charlotte,Meta,Jacobi);
mixed Excel-date vs text dates; typos (4.3.1023); missing-day dates (.12.1955);
trailing spaces (30.8.1862 ).
Proposed approach: a separate Person import (→ Gitea issue). Order matters: import
persons first so documents can link to real people instead of creating alias stubs.
Use geb als + verheiratet mit to pre-build the alias/relationship graph.
IMP-05 — Name variations create duplicate Persons 🟠 MAJOR
The same person appears under several surface forms across the document sheet:
Eugenie Müller(151) vsEugenie de Gruyter(452) — maiden vs married.Clara Cram(sender 1,284) vsClara de Gruyter(455) vsClara de Gruyter sen.(66).Walter de Gruyter(589) vs bareWalter(78).
findOrCreateByAlias keys on the raw string, so each variant becomes (or matches) a
distinct alias and likely a distinct Person. Result: fragmented person records,
broken Briefwechsel pairing, wrong stats.
Proposed approach: drive dedup from the register's geb als column (IMP-04) —
Eugenie de Gruyter geb Müller tells us the two strings are one person. Build an alias
map (married ↔ maiden ↔ nickname) before/while importing documents. This is partly data
(an alias mapping table/sheet) and partly code (consume it). Likely a Gitea issue once the
mapping format is decided.
945 distinct sender strings / 274 distinct receiver strings — expect a long-tail of variants to reconcile. Don't try to be perfect on the first pass; get the high-frequency names right.
IMP-06 — 93 data rows with blank Index are silently dropped 🟠 MAJOR
processRows does if (index.isBlank()) continue;. 93 rows have a blank Index but
carry other data (sender/receiver/date/etc.). These are silently skipped — they don't even
appear in the skippedFiles report (that list only covers rows that had an index but
failed file checks).
Proposed approach: before import, triage these 93 rows — are they continuation rows, section markers, or genuine letters missing an ID? At minimum, surface a count/warning so nothing vanishes unnoticed. Possibly a small importer change to report blank-index skips.
IMP-07 — 43 duplicate Index values 🟡 MINOR
43 Index values repeat (e.g. W-0388, Eu-0332, C-0234, C-0235, C-0236, J-0175).
Since the filename is derived from Index, the importer's upsert keys both rows on the same
originalFilename: the second occurrence is treated as ALREADY_EXISTS (if the first
isn't a placeholder) and its metadata is lost, or it overwrites a placeholder.
Proposed approach: list the 43 duplicates, check whether they're true duplicates or two distinct letters that share an ID by mistake. Fix in the source data, or extend the ID scheme. Data task first; software only if the ID scheme must change.
IMP-08 — Section/title rows interleaved with data 🟡 MINOR
Row 2 of the sheet is a section header sitting only in the sender column
(Brautbriefe von Walter der Gruyter an Eugenie Müller) with a blank Index — caught by the
blank-Index skip (overlaps IMP-06). There may be more such banners scattered through 7,943
rows. Also relevant: the per-letter keywords live in Inhalt (col 9) as comma-joined
values (Tilburg,Verwandschaft, poetisch,Reise nach Breda), while Schlagwort (col 8)
holds a single broad tag (Brautbriefe). The importer only takes one tag column —
decide which column feeds tags vs summary, and whether to split comma-lists into multiple
tags.
Proposed approach: scan for rows where Index is blank but other cells are set (already have the count: relates to the 93 in IMP-06). Confirm tag vs summary column choice with Marcel.
IMP-09 — Index ↔ Datei filename mismatches 🟡 MINOR
The Datei column (col 1) holds explicit relative paths (..\__scan\W-0001.pdf) but they
don't always agree with the Index. Example: row 20 has Index W-0010x but Datei
..\__scan\W-0011x.pdf. The importer derives the filename from Index, so it will look
for W-0010x.pdf and may miss the actual scan. (Note: the Datei paths themselves are
Windows-style with \ and .. and would be rejected by isValidImportFilename if anyone
tried to use that column directly — 7,623 rows use backslashes, 7,455 contain ...)
Proposed approach: when the PDFs arrive, reconcile Index-derived names against actual filenames; produce a mismatch report. Keep deriving from Index (stable IDs) but flag disagreements. Mostly a data/QA task.
IMP-10 — x-suffix rows (letter backsides / enclosures) 🟡 MINOR
42 rows have an x-suffixed Index (W-0001x, W-0002x, …). They're sparse — typically
only Index + Datei + sender + receiver, no box/folder/date. They appear to be the reverse
side or an enclosure of the preceding letter. The importer treats each as an independent
Document, and the metadataComplete heuristic flags them complete as soon as a sender is
present (date/box/folder all missing).
Proposed approach: decide whether x rows should be (a) separate documents, (b) extra
pages/files attached to their parent, or (c) skipped. Affects both the data model and the
metadataComplete heuristic. Discuss with Marcel.
IMP-11 — Multi-receiver separators include bare u / u. 🟡 MINOR
PersonNameParser.parseReceivers already handles und, u, //, geb.,
parenthesised shared surnames, and Familie filtering — good. But the real data also uses
the abbreviation in forms the top-receivers list shows are common:
Eugenie u Walter de Gruyter (230), Herbert u Clara (94), Juan u Marie Cram (75),
and space-joined pairs like Ella Anita (79) that may be two people.
Raw separator tally on receivers: und ×70, , ×11, ; ×2, / ×1 — plus the many u
cases above. Senders are not parsed at all (taken raw), which is fine unless a sender
cell ever holds two names.
Proposed approach: add MassImportServiceTest cases for the real-world strings above;
extend the parser only where it actually fails. Ella Anita-style space-joined pairs are
ambiguous — likely leave as one person unless the register says otherwise (ties to IMP-05).
IMP-12 — Importer reads only the first sheet, no validation 🟡 MINOR
readXlsx does workbook.getSheetAt(0). For the new xlsx that's Familienarchiv (✅), but
the file also contains Inhaltsverzeichnis grob, Inhaltsverzeichnis WdG, Tabelle4.
There is no header validation: if the wrong file/sheet is dropped in /import, the importer
will happily map columns positionally and import nonsense. Also findSpreadsheetFile() picks
the first spreadsheet found in /import — with three spreadsheets present there today,
which one wins is filesystem-order-dependent.
Proposed approach: (a) validate the header row against expected names before importing; (b) make the target sheet/file explicit (config or header match) rather than "first found". Ties into the header-driven mapping in IMP-01(b).
Summary of recommended sequencing
- Decide the importer mapping strategy (IMP-01): positional re-config vs header-driven. Header-driven is the durable choice and unblocks IMP-03/12.
- Build the tolerant date parser (IMP-02) with original-string preservation + precision.
- Import the Person register first (IMP-04) and build the alias/marriage graph, which feeds person dedup (IMP-05).
- Then import documents, with reporting for blank-index (IMP-06), duplicates (IMP-07), and section rows (IMP-08).
- Reconcile files when the ~7,000 PDFs arrive (IMP-09), and decide
x-row semantics (IMP-10).
Code-change items (→ Gitea issues when we get there): IMP-01(b), IMP-02, IMP-04, IMP-05 (consume side), IMP-06 reporting, IMP-12. Pure-data items stay in this folder.