Document the raw archive spreadsheet findings (IMP-01..12) and a requirements spec for an offline normalizer that produces a clean canonical dataset before import. Local docs only; no Gitea issue yet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
314 lines
16 KiB
Markdown
314 lines
16 KiB
Markdown
# Spreadsheet Analysis — Findings (2026-05-25)
|
||
|
||
Analysis of the **real raw archive** spreadsheets against the current `MassImportService`
|
||
(`backend/.../importing/MassImportService.java`). Goal: import ~7,600 letter rows + a
|
||
163-person register, with PDFs to follow.
|
||
|
||
Every issue has an ID (`IMP-NN`), severity, evidence, and a proposed approach.
|
||
|
||
---
|
||
|
||
## 0. Context: how the importer reads a row today
|
||
|
||
`MassImportService` reads **sheet index 0** and maps columns by configurable indices
|
||
(`app.import.col.*`, defaults in the source):
|
||
|
||
| Property | Default col | Meaning |
|
||
| --- | --- | --- |
|
||
| `colIndex` | 0 | Index (→ filename `<index>.pdf`) |
|
||
| `colBox` | 1 | Box |
|
||
| `colFolder` | 2 | Mappe |
|
||
| `colSender` | 3 | Sender (raw) |
|
||
| `colReceivers` | 5 | Receivers (raw) |
|
||
| `colDate` | 7 | Date |
|
||
| `colLocation` | 9 | Location |
|
||
| `colTags` | 10 | Tag (single) |
|
||
| `colSummary` | 11 | Summary |
|
||
| `colTranscription` | 13 | Transcription |
|
||
|
||
These defaults match the **ODS** file exactly (`Index, Box, Mappe, Von, BriefeschreiberIn,
|
||
An, EmpfängerIn, Datum, Datum Originalformat, Ort, Schlagwort, Inhalt, Zeitlicher Kontext,
|
||
Transkript` = 14 cols). The ODS was the development target. The new xlsx is a different beast.
|
||
|
||
Per-row pipeline: skip if Index blank → derive filename from Index → validate filename →
|
||
look for file on disk (recursive; metadata-only if absent) → check PDF magic bytes →
|
||
`importSingleDocument` (upsert by `originalFilename`, dedupe non-placeholders as
|
||
`ALREADY_EXISTS`). Date parsing is **ISO-only** (`LocalDate.parse`).
|
||
|
||
---
|
||
|
||
## IMP-01 — New xlsx column layout ≠ importer defaults 🔴 BLOCKER
|
||
|
||
The new `…aktuell…xlsx` (sheet `Familienarchiv`, 7,943 rows × 12 cols) has a **denser,
|
||
different** layout. There is an extra `Datei` column at index 1, and the normalized
|
||
`Von`/`An`/ISO-`Datum` columns from the ODS **do not exist**.
|
||
|
||
| col | New xlsx header | Importer default expects | Result with defaults |
|
||
| --- | --- | --- | --- |
|
||
| 0 | Index | Index | ✅ ok |
|
||
| 1 | **Datei** (path) | Box | ❌ Box ← `..\__scan\W-0001.pdf` |
|
||
| 2 | Box | Mappe | ❌ Mappe ← `V` |
|
||
| 3 | Mappe | Sender | ❌ Sender ← `1` |
|
||
| 4 | BriefeschreiberIn (sender) | — (unused) | ❌ sender ignored |
|
||
| 5 | EmpfängerIn (receiver) | Receivers | ✅ coincidentally ok |
|
||
| 6 | Datum des Briefes | — (unused) | ❌ date ignored |
|
||
| 7 | Ort (location) | Date | ❌ Date ← `Rotterdam` → null |
|
||
| 8 | Schlagwort (tag) | — (unused) | ❌ tag ignored |
|
||
| 9 | Inhalt (summary) | Location | ❌ Location ← summary text |
|
||
| 10 | — | Tag | ❌ empty |
|
||
| 11 | — | Summary | ❌ empty |
|
||
| 13 | — | Transcription | ❌ column doesn't exist |
|
||
|
||
**Impact:** importing as-is produces almost entirely garbage metadata.
|
||
|
||
**Proposed approach (decide with Marcel):**
|
||
- (a) Re-map via the existing `app.import.col.*` properties — fast, no code. New mapping:
|
||
`index=0, box=2, folder=3, sender=4, receivers=5, date=6, location=7, tags=8, summary=9`,
|
||
and there is **no** transcription column (point it past the end or add a "missing column"
|
||
convention). Caveat: tags land in `colTags` but the real per-letter keywords are in
|
||
`Inhalt` (col 9) — see IMP-08 note on tags vs summary.
|
||
- (b) Make the importer **header-driven** (map by header name, not index) so it survives
|
||
layout drift across files. More robust, needs a code change (→ Gitea issue).
|
||
|
||
Recommendation: (b) is the durable fix given we have ≥3 different layouts already.
|
||
|
||
---
|
||
|
||
## IMP-02 — 90% of dates are free-text the parser can't read 🔴 BLOCKER
|
||
|
||
The dates are written **as in the letter**. `parseDate()` only does `LocalDate.parse()`
|
||
(ISO `yyyy-MM-dd`), so anything non-ISO becomes `null`.
|
||
|
||
Of **7,319** rows with a date value (col 6):
|
||
|
||
| kind | count | parses today? |
|
||
| --- | --- | --- |
|
||
| Real Excel date cells (→ ISO via POI) | 748 | ✅ |
|
||
| Free-text date strings | 6,571 | ❌ → null |
|
||
|
||
→ **90% of dated rows lose their date.** (623 rows have no date at all.)
|
||
|
||
Observed free-text formats (counts approximate, from col 6):
|
||
|
||
| Format | Count | Examples |
|
||
| --- | --- | --- |
|
||
| `D.M.YY` | 1,338 | `11.10.08`, `13.5.09` |
|
||
| `D.RomanMonth.YY/YYYY` | ~1,527 | `22.III.18`, `19.XII.1954`, `1.III.27` |
|
||
| `D.Month YYYY` | 950 | `6.März 1888`, `9.März 1888` (note: **no space** after the dot) |
|
||
| `D.M.YYYY` | 358 | `15.2.1888`, `7.3.1888` |
|
||
| Approximate / unknown | 146 | `?`, `13.7.18?`, `17.Nov (?) 1887`, `13.Januar ? 1907` |
|
||
| `Month YYYY` / season / holiday | 41+27 | `Mai 1895`, `Herbst 1913`, `Pfingsten 1922`, `Ostern 1890` |
|
||
| `YYYY` only | 17 | `1905`, `1949` |
|
||
| `D.M.` no year | 10 | `8.9.`, `14.3.` |
|
||
| Ranges | 5+ | `8.1.1916 - 15.3.1916`, `1881/82`, `1945/46?` |
|
||
| Abbrev/English months, no space | many | `29.Sept.1891`, `10.Oct.95`, `9.December1889`, `18.Dez.1916` |
|
||
| Slash separator | ~315 | `2/2. 18`, `17/6. 1916`, `10/4. 1917` |
|
||
| English `Month D. YYYY` | several | `April 12. 1922`, `Oct.5. 1916`, `Mai 23. 1917` |
|
||
| Trailing notes | 5+ | `26.4.1888, 2. Brief`, `31.8.1888,2.Brief` |
|
||
| 3-digit year (typo) | 107 | `30.1.889` (→ 1889), `4.3.1023` (in person file → 1923) |
|
||
| Day-range within month | several | `7./8. Sept.1923` |
|
||
|
||
**Proposed approach:** build a tolerant German/historical date parser (→ Gitea issue, it's
|
||
a code change). Requirements:
|
||
- Numeric `D.M.YY[YY]` and `D/M. YY[YY]` (slash = dot).
|
||
- Roman-numeral months (`I`–`XII`).
|
||
- German + English month names, full + abbreviated, with/without separating space
|
||
(`März`, `Sept.`, `Dez`, `December`, `Oct.`).
|
||
- 2-digit and 3-digit year normalization (`08`→1908? needs a century rule; `889`→1889).
|
||
- Partial dates → store what's known. The schema only has a single `documentDate
|
||
LocalDate`; **decide** whether to (i) store first-of-month/year, (ii) add a
|
||
`datePrecision` enum + `dateOriginal` text column, or (iii) keep raw text in a new
|
||
`documentDateRaw` field and leave `documentate` null when imprecise. Recommendation:
|
||
preserve the **original string** always (new column) + best-effort parsed date +
|
||
precision flag, so nothing is lost and the UI can show "ca. 1916".
|
||
- Unparseable/approximate (`?`, `Herbst 1913`) → keep raw, leave parsed date null, **do
|
||
not drop the row**.
|
||
|
||
**Cross-check:** even after IMP-01 is fixed so the date column is read, IMP-02 still bites.
|
||
Both must be solved before a real import.
|
||
|
||
---
|
||
|
||
## IMP-03 — New xlsx has no normalized/ISO date or name columns 🔴 BLOCKER
|
||
|
||
The ODS had helper columns the importer relied on: `Von`/`An` (normalized names) and
|
||
`Datum` (ISO) alongside `Datum Originalformat`. The new xlsx has **only the raw**
|
||
`BriefeschreiberIn` / `EmpfängerIn` / `Datum des Briefes`. So:
|
||
- Names must be parsed from raw strings (PersonNameParser already does receivers; **sender
|
||
is taken raw, never split** — fine for senders, which are single, but no normalization).
|
||
- Dates must be parsed from raw (IMP-02).
|
||
|
||
This is the root reason IMP-01/02 exist: the new file is the *uncurated* source, not the
|
||
hand-normalized ODS. Tie any importer redesign to this reality — we will not get clean
|
||
helper columns in the 7k-row file.
|
||
|
||
---
|
||
|
||
## IMP-04 — Person register not imported at all 🟠 MAJOR
|
||
|
||
`Personendatei 2.xlsx` → sheet `Tabelle1`, **163 people**, columns:
|
||
`Generation, Familienname, Vorname, geb als (maiden), Geburtsdatum, Geburtsort,
|
||
Todesdatum, Sterbeort, verheiratet mit, Bemerkung`.
|
||
|
||
Today `MassImportService` has **no person-register import**. Persons are only
|
||
auto-created as bare aliases from the document sender/receiver strings
|
||
(`personService.findOrCreateByAlias`). All this rich genealogical data is unused:
|
||
- birth/death dates + places,
|
||
- maiden names (the key to dedup — see IMP-05),
|
||
- `verheiratet mit` (marriage links → `PersonRelationship` domain),
|
||
- `Bemerkung` relationship hints (`"Schwester v Marie Cram"`, `"Nichte von Herbert"`),
|
||
- `Generation` (G 1–G 4),
|
||
- nicknames in quotes (`"Tante Lolly"`).
|
||
|
||
Data-quality notes in this file too: multi-value `Vorname` (`Charlotte,Meta,Jacobi`);
|
||
mixed Excel-date vs text dates; typos (`4.3.1023`); missing-day dates (`.12.1955`);
|
||
trailing spaces (`30.8.1862 `).
|
||
|
||
**Proposed approach:** a separate **Person import** (→ Gitea issue). Order matters: import
|
||
persons *first* so documents can link to real people instead of creating alias stubs.
|
||
Use `geb als` + `verheiratet mit` to pre-build the alias/relationship graph.
|
||
|
||
---
|
||
|
||
## IMP-05 — Name variations create duplicate Persons 🟠 MAJOR
|
||
|
||
The same person appears under several surface forms across the document sheet:
|
||
- `Eugenie Müller` (151) vs `Eugenie de Gruyter` (452) — maiden vs married.
|
||
- `Clara Cram` (sender 1,284) vs `Clara de Gruyter` (455) vs `Clara de Gruyter sen.` (66).
|
||
- `Walter de Gruyter` (589) vs bare `Walter` (78).
|
||
|
||
`findOrCreateByAlias` keys on the raw string, so each variant becomes (or matches) a
|
||
distinct alias and likely a **distinct Person**. Result: fragmented person records,
|
||
broken Briefwechsel pairing, wrong stats.
|
||
|
||
**Proposed approach:** drive dedup from the register's `geb als` column (IMP-04) —
|
||
`Eugenie de Gruyter geb Müller` tells us the two strings are one person. Build an alias
|
||
map (married ↔ maiden ↔ nickname) before/while importing documents. This is partly data
|
||
(an alias mapping table/sheet) and partly code (consume it). Likely a Gitea issue once the
|
||
mapping format is decided.
|
||
|
||
945 distinct sender strings / 274 distinct receiver strings — expect a long-tail of
|
||
variants to reconcile. Don't try to be perfect on the first pass; get the high-frequency
|
||
names right.
|
||
|
||
---
|
||
|
||
## IMP-06 — 93 data rows with blank Index are silently dropped 🟠 MAJOR
|
||
|
||
`processRows` does `if (index.isBlank()) continue;`. **93 rows** have a blank Index but
|
||
carry other data (sender/receiver/date/etc.). These are silently skipped — they don't even
|
||
appear in the `skippedFiles` report (that list only covers rows that *had* an index but
|
||
failed file checks).
|
||
|
||
**Proposed approach:** before import, triage these 93 rows — are they continuation rows,
|
||
section markers, or genuine letters missing an ID? At minimum, surface a count/warning so
|
||
nothing vanishes unnoticed. Possibly a small importer change to report blank-index skips.
|
||
|
||
---
|
||
|
||
## IMP-07 — 43 duplicate Index values 🟡 MINOR
|
||
|
||
43 Index values repeat (e.g. `W-0388`, `Eu-0332`, `C-0234`, `C-0235`, `C-0236`, `J-0175`).
|
||
Since the filename is derived from Index, the importer's upsert keys both rows on the same
|
||
`originalFilename`: the second occurrence is treated as `ALREADY_EXISTS` (if the first
|
||
isn't a placeholder) and **its metadata is lost**, or it overwrites a placeholder.
|
||
|
||
**Proposed approach:** list the 43 duplicates, check whether they're true duplicates or
|
||
two distinct letters that share an ID by mistake. Fix in the source data, or extend the ID
|
||
scheme. Data task first; software only if the ID scheme must change.
|
||
|
||
---
|
||
|
||
## IMP-08 — Section/title rows interleaved with data 🟡 MINOR
|
||
|
||
Row 2 of the sheet is a section header sitting only in the sender column
|
||
(`Brautbriefe von Walter der Gruyter an Eugenie Müller`) with a blank Index — caught by the
|
||
blank-Index skip (overlaps IMP-06). There may be more such banners scattered through 7,943
|
||
rows. Also relevant: the per-letter **keywords live in `Inhalt` (col 9)** as comma-joined
|
||
values (`Tilburg,Verwandschaft`, `poetisch,Reise nach Breda`), while `Schlagwort` (col 8)
|
||
holds a single broad tag (`Brautbriefe`). The importer only takes **one** tag column —
|
||
decide which column feeds tags vs summary, and whether to split comma-lists into multiple
|
||
tags.
|
||
|
||
**Proposed approach:** scan for rows where Index is blank but other cells are set (already
|
||
have the count: relates to the 93 in IMP-06). Confirm tag vs summary column choice with
|
||
Marcel.
|
||
|
||
---
|
||
|
||
## IMP-09 — Index ↔ Datei filename mismatches 🟡 MINOR
|
||
|
||
The `Datei` column (col 1) holds explicit relative paths (`..\__scan\W-0001.pdf`) but they
|
||
don't always agree with the Index. Example: row 20 has Index `W-0010x` but Datei
|
||
`..\__scan\W-0011x.pdf`. The importer derives the filename from **Index**, so it will look
|
||
for `W-0010x.pdf` and may miss the actual scan. (Note: the `Datei` paths themselves are
|
||
Windows-style with `\` and `..` and would be **rejected** by `isValidImportFilename` if anyone
|
||
tried to use that column directly — 7,623 rows use backslashes, 7,455 contain `..`.)
|
||
|
||
**Proposed approach:** when the PDFs arrive, reconcile Index-derived names against actual
|
||
filenames; produce a mismatch report. Keep deriving from Index (stable IDs) but flag
|
||
disagreements. Mostly a data/QA task.
|
||
|
||
---
|
||
|
||
## IMP-10 — `x`-suffix rows (letter backsides / enclosures) 🟡 MINOR
|
||
|
||
**42 rows** have an `x`-suffixed Index (`W-0001x`, `W-0002x`, …). They're sparse — typically
|
||
only Index + Datei + sender + receiver, no box/folder/date. They appear to be the reverse
|
||
side or an enclosure of the preceding letter. The importer treats each as an independent
|
||
Document, and the `metadataComplete` heuristic flags them complete as soon as a sender is
|
||
present (date/box/folder all missing).
|
||
|
||
**Proposed approach:** decide whether `x` rows should be (a) separate documents, (b) extra
|
||
pages/files attached to their parent, or (c) skipped. Affects both the data model and the
|
||
`metadataComplete` heuristic. Discuss with Marcel.
|
||
|
||
---
|
||
|
||
## IMP-11 — Multi-receiver separators include bare `u` / `u.` 🟡 MINOR
|
||
|
||
`PersonNameParser.parseReceivers` already handles ` und `, ` u `, `//`, `geb.`,
|
||
parenthesised shared surnames, and `Familie` filtering — good. But the real data also uses
|
||
the abbreviation in forms the top-receivers list shows are common:
|
||
`Eugenie u Walter de Gruyter` (230), `Herbert u Clara` (94), `Juan u Marie Cram` (75),
|
||
and space-joined pairs like `Ella Anita` (79) that may be two people.
|
||
Raw separator tally on receivers: ` und ` ×70, `,` ×11, `;` ×2, `/` ×1 — plus the many ` u `
|
||
cases above. Senders are **not** parsed at all (taken raw), which is fine unless a sender
|
||
cell ever holds two names.
|
||
|
||
**Proposed approach:** add `MassImportServiceTest` cases for the real-world strings above;
|
||
extend the parser only where it actually fails. `Ella Anita`-style space-joined pairs are
|
||
ambiguous — likely leave as one person unless the register says otherwise (ties to IMP-05).
|
||
|
||
---
|
||
|
||
## IMP-12 — Importer reads only the first sheet, no validation 🟡 MINOR
|
||
|
||
`readXlsx` does `workbook.getSheetAt(0)`. For the new xlsx that's `Familienarchiv` (✅), but
|
||
the file also contains `Inhaltsverzeichnis grob`, `Inhaltsverzeichnis WdG`, `Tabelle4`.
|
||
There is no header validation: if the wrong file/sheet is dropped in `/import`, the importer
|
||
will happily map columns positionally and import nonsense. Also `findSpreadsheetFile()` picks
|
||
the **first** spreadsheet found in `/import` — with three spreadsheets present there today,
|
||
which one wins is filesystem-order-dependent.
|
||
|
||
**Proposed approach:** (a) validate the header row against expected names before importing;
|
||
(b) make the target sheet/file explicit (config or header match) rather than "first found".
|
||
Ties into the header-driven mapping in IMP-01(b).
|
||
|
||
---
|
||
|
||
## Summary of recommended sequencing
|
||
|
||
1. **Decide the importer mapping strategy** (IMP-01): positional re-config vs header-driven.
|
||
Header-driven is the durable choice and unblocks IMP-03/12.
|
||
2. **Build the tolerant date parser** (IMP-02) with original-string preservation + precision.
|
||
3. **Import the Person register first** (IMP-04) and build the alias/marriage graph,
|
||
which feeds person dedup (IMP-05).
|
||
4. **Then import documents**, with reporting for blank-index (IMP-06), duplicates (IMP-07),
|
||
and section rows (IMP-08).
|
||
5. **Reconcile files** when the ~7,000 PDFs arrive (IMP-09), and decide `x`-row semantics
|
||
(IMP-10).
|
||
|
||
Code-change items (→ Gitea issues when we get there): IMP-01(b), IMP-02, IMP-04, IMP-05
|
||
(consume side), IMP-06 reporting, IMP-12. Pure-data items stay in this folder.
|