13 KiB
Import Pipeline: ODS Alignment Plan
Context
The real data source is an ODS spreadsheet (zzfamilienarchiv Walter und Eugenie 2025-04-10.ods) with 1,508 rows and 14 columns, living alongside PDF files (W-0001.pdf, C-0451.pdf, etc.) in familienarchiv_raw/. The existing import pipeline was built speculatively without seeing the actual data. It has several structural mismatches that need to be resolved before any real import can run.
ExcelService (the web-upload import path) will be deleted entirely. The only import path is MassImportService, which reads an ODS file from the /import directory on the filesystem. This simplifies the scope significantly.
What the ODS Actually Contains
| Col | Header | Example value | Action |
|---|---|---|---|
| 0 | Index | W-0001 |
→ originalFilename (+ .pdf) |
| 1 | Box | V |
→ archiveBox (new field) |
| 2 | Mappe | 1 |
→ archiveFolder (new field) |
| 3 | Von | Walter de Gruyter |
→ sender (Person) |
| 4 | BriefeschreiberIn | Walter de Gruyter |
Ignored (redundant with col 3) |
| 5 | An | Eugenie de Gruyter geb. Müller |
→ receivers (Person, parse multi) |
| 6 | EmpfängerIn | Eugenie Müller |
Ignored (redundant with col 5) |
| 7 | Datum | 1888-02-15 (ISO date string) |
→ documentDate |
| 8 | Datum Originalformat | 15.2.1888 |
Ignored |
| 9 | Ort | Rotterdam |
→ location |
| 10 | Schlagwort | Brautbriefe |
→ tags |
| 11 | Inhalt | Geschäftsreise |
→ summary |
| 12 | Zeitlicher Kontext | Brautbriefe von Walter... |
Skipped (no clear mapping) |
| 13 | Transkript | (mostly empty for now) | → transcription |
Changes
1. Delete ExcelService
ExcelService.java is deleted. All references to it (in AdminController or wherever it is injected) are removed. Going forward, MassImportService is the sole import mechanism. The web-upload flow that previously called ExcelService is removed from the controller.
Why: The user confirmed the ODS-from-filesystem path is the only import workflow. Keeping dead code would create maintenance confusion.
2. File Format: ODS support via WorkbookFactory
Current behaviour: MassImportService constructs new XSSFWorkbook(inputStream), which only handles .xlsx. The ODS file throws immediately.
Fix: Replace with WorkbookFactory.create(fis). Apache POI 5.x's WorkbookFactory auto-detects the format and handles .xlsx, .xls, and .ods without any extra dependencies. Also update findExcelFile() which currently filters by .endsWith(".xlsx") — change the filter to accept .ods, .xlsx, and .xls.
Why not add odftoolkit? We already have poi and poi-ooxml at 5.5.0. WorkbookFactory covers this case. A second spreadsheet library would be redundant.
3. Column Index Defaults
Current defaults (wrong):
app.import.excel.col.filename=0 date=1 location=2 transcription=3
Correct indices:
filename=0 box=1 folder=2 sender=3 receivers=5 date=7 location=9 tags=10 summary=11 transcription=13
Fix: Update @Value defaults in MassImportService and set explicit values in application.properties. Remove the old defaults from ExcelService (which is deleted). Rename the property prefix from app.import.excel.col.* to app.import.col.* since the format is no longer Excel-specific.
4. Filename Resolution: Index → PDF
Current behaviour: Cell value used directly as originalFilename.
Actual situation: Col 0 is the bare index (e.g., W-0001). PDF files are named W-0001.pdf. The import must append .pdf.
Fix: After reading col 0, append .pdf if the value contains no .:
if (!filename.contains(".")) filename = filename + ".pdf";
5. Document Title: German Date Format
Current behaviour: Title is set to the raw filename, e.g. W-0001.pdf.
Fix: Build title from {Index} – {date in German format} – {location}. Use DateTimeFormatter with locale de:
W-0001 – 15. Februar 1888 – Rotterdam
If date is missing, omit date segment. If location is missing, omit location segment. The index alone is acceptable as a minimum title.
German month formatting: Use DateTimeFormatter.ofPattern("d. MMMM yyyy", Locale.GERMAN).
6. Date Parsing: Add String Fallback
Current behaviour: Only handles numeric date-formatted cells (DateUtil.isCellDateFormatted()).
Actual data: Col 7 contains ISO date strings (1888-02-15) stored as text in LibreOffice ODS. These have CellType.STRING, so the existing code silently produces null dates for every row.
Fix: Extract a helper method parseDate(Cell):
private LocalDate parseDate(Cell cell) {
if (cell == null) return null;
if (cell.getCellType() == CellType.NUMERIC && DateUtil.isCellDateFormatted(cell))
return cell.getDateCellValue().toInstant().atZone(ZoneId.systemDefault()).toLocalDate();
if (cell.getCellType() == CellType.STRING) {
try { return LocalDate.parse(cell.getStringCellValue().trim()); }
catch (DateTimeParseException e) { return null; }
}
return null;
}
7. Sender: Text → Person (lookup-or-create)
Current behaviour: Sender is never set.
Actual data: Col 3 (Von) is always a single name string, e.g. Walter de Gruyter, Eugenie de Gruyter geb. Müller.
Fix: Extract a findOrCreatePerson(String rawName) helper:
- Look up by
aliasexact match (case-insensitive). Use a new repository methodfindByAliasIgnoreCase(String)onPersonRepository. - If not found, create with:
alias= full raw stringfirstName/lastName= best-effort split (see §9 below)
- Return the
Personand set ondocument.setSender(...).
8. Receivers: Text → Person(s) with Normalization
Current behaviour: Receivers are never set.
Actual data (exhaustive set of multi-receiver patterns):
'Clara Cram u Ellen B-M'
'Clara u Familie'
'Clara u Herbert Cram'
'Ella u Walter Dieckmann'
'Eugenie u Walter de Gruyter'
'Hedi und Tutu (Gruber)'
'Herbert und Clara Cram'
'Walter und Eugenie'
'Walter und Eugenie de Gruyter'
Parsing algorithm for col 5 (An):
- Strip
geb.clauses — removegeb. \w+from the string (maiden name annotations are not useful for matching). - Extract parenthesised last name — if the string ends with
(Something), captureSomethingas the shared last name and strip it. - Split on separator — split on
undoru(whole-word match with\s+u\s+or\s+und\s+). - Filter — discard any segment that is exactly
Familie(it's not a person). - Distribute shared last name — find the last name in the rightmost segment. Known multi-word last name particles:
de Gruyter. Known single-word last names:Cram,Dieckmann,Gruber,Müller,Wolff. These are hardcoded as a lookup list. If the last segment ends with a known last name and an earlier segment has no last name (i.e., it is a single token), append that last name to the earlier segment. - Handle no-last-name cases — if no last name can be determined at all (e.g.,
Walter und Eugenie), proceed with just the first name;lastNamewill be set to""(empty string — tolerated since the model hasnullable = falseand we need something; using"?"as placeholder is clearer). - findOrCreatePerson for each resulting name segment, then add all to
document.getReceivers().
Examples:
| Raw | Result |
|---|---|
Walter und Eugenie de Gruyter |
[Walter de Gruyter, Eugenie de Gruyter] |
Herbert und Clara Cram |
[Herbert Cram, Clara Cram] |
Hedi und Tutu (Gruber) |
[Hedi Gruber, Tutu Gruber] |
Clara Cram u Ellen B-M |
[Clara Cram, Ellen B-M] |
Clara u Familie |
[Clara] |
Walter und Eugenie |
[Walter (?), Eugenie (?)] |
Eugenie de Gruyter geb. Müller |
[Eugenie de Gruyter] |
Why normalise? Without normalisation, Herbert und Clara Cram would become one person with a nonsensical name and would never match separate Herbert Cram or Clara Cram entries from other rows. Normalisation means subsequent rows referencing the same individual will reuse the same Person record.
Why hardcode the last names? There are only 6 known family names in this archive. Adding a configurable list would be over-engineering for a one-family archive. If the archive expands, the list can be extended.
9. Name Splitting Helper (firstName / lastName)
Used when creating a new Person who cannot be found by alias.
Algorithm:
- Strip any
geb. \w+suffix. - Check if the string ends with a known last name (from the list in §8). If yes, everything before it is
firstName, and that islastName. - If
de Gruyteris detected as the last name, it is multi-word —firstNameis everything beforede Gruyter. - Otherwise, split on the last space:
firstName= everything before,lastName= last word. - If only one token (no space),
firstName= token,lastName="?".
This logic lives in a single static utility method PersonNameParser.split(String) returning a record SplitName(String firstName, String lastName). Keeping it static and pure makes it straightforward to unit-test without a Spring context.
10. Tags: Lookup-or-Create
Current behaviour: Tags are never imported.
Fix: Read col 10 (Schlagwort). If non-blank:
Tag tag = tagRepository.findByNameIgnoreCase(value)
.orElseGet(() -> tagRepository.save(Tag.builder().name(value).build()));
document.getTags().add(tag);
Tags are imported as-is. The TagRepository already has findByNameIgnoreCase, so deduplication is free.
11. Summary: Map "Inhalt" (Col 11)
Read col 11 (Inhalt) and set on document.setSummary(...). Short content keywords (Geschäftsreise, Reisepläne) are useful for full-text search even if they're terse.
Col 12 (Zeitlicher Kontext) is skipped — it is often a duplicate of context already encoded in sender/receiver/tags.
12. New Model Fields: archiveBox and archiveFolder
Cols 1 and 2 (Box, Mappe) identify the physical storage location of the original document. They have no counterpart in the model today.
Changes:
- Add to
Document.java:@Column(name = "archive_box") private String archiveBox; @Column(name = "archive_folder") private String archiveFolder; - Flyway migration
V4__add_archive_fields_to_documents.sql:ALTER TABLE documents ADD COLUMN archive_box VARCHAR(255); ALTER TABLE documents ADD COLUMN archive_folder VARCHAR(255); - Import logic reads col 1 →
archiveBox, col 2 →archiveFolder.
13. PersonRepository: Add findByAliasIgnoreCase
Add one method to PersonRepository:
Optional<Person> findByAliasIgnoreCase(String alias);
Spring Data generates the query automatically. No other repository changes are needed.
Overwrite Behaviour (No Change)
The existing skip logic stays: if a document already exists in the DB and its status is not PLACEHOLDER, it is skipped. This prevents accidental data loss on re-runs. The assumption is that if someone has manually enriched a document beyond placeholder stage, that work should not be overwritten by a re-import.
Summary of All File Changes
| File | Change |
|---|---|
ExcelService.java |
Deleted |
AdminController.java (or wherever ExcelService is injected) |
Remove ExcelService injection and its endpoint |
MassImportService.java |
WorkbookFactory, new column indices, .ods discovery, filename fix, title, date parsing, sender, receivers, tags, summary, archiveBox/archiveFolder |
PersonNameParser.java (new) |
Static utility: split(String) → SplitName, parseReceivers(String) → List<String> |
PersonRepository.java |
Add findByAliasIgnoreCase(String) |
Document.java |
Add archiveBox, archiveFolder fields |
V4__add_archive_fields_to_documents.sql (new) |
ALTER TABLE for both new columns |
application.properties |
Update/add app.import.col.* properties |
What We Are Not Changing
- Col 4 (
BriefeschreiberIn) — redundant with col 3. - Col 6 (
EmpfängerIn) — redundant with col 5. - Col 8 (
Datum Originalformat) — ISO date in col 7 is strictly better. - Col 12 (
Zeitlicher Kontext) — no clear mapping, often duplicates other fields. personstable schema —aliasserves as the full-name store without a schema change.TagRepository— existingfindByNameIgnoreCaseis sufficient.