Files
familienarchiv/.agent/PLAN.md
2026-05-05 12:39:20 +02:00

13 KiB
Raw Blame History

Import Pipeline: ODS Alignment Plan

Context

The real data source is an ODS spreadsheet (zzfamilienarchiv Walter und Eugenie 2025-04-10.ods) with 1,508 rows and 14 columns, living alongside PDF files (W-0001.pdf, C-0451.pdf, etc.) in familienarchiv_raw/. The existing import pipeline was built speculatively without seeing the actual data. It has several structural mismatches that need to be resolved before any real import can run.

ExcelService (the web-upload import path) will be deleted entirely. The only import path is MassImportService, which reads an ODS file from the /import directory on the filesystem. This simplifies the scope significantly.


What the ODS Actually Contains

Col Header Example value Action
0 Index W-0001 originalFilename (+ .pdf)
1 Box V archiveBox (new field)
2 Mappe 1 archiveFolder (new field)
3 Von Walter de Gruyter sender (Person)
4 BriefeschreiberIn Walter de Gruyter Ignored (redundant with col 3)
5 An Eugenie de Gruyter geb. Müller receivers (Person, parse multi)
6 EmpfängerIn Eugenie Müller Ignored (redundant with col 5)
7 Datum 1888-02-15 (ISO date string) documentDate
8 Datum Originalformat 15.2.1888 Ignored
9 Ort Rotterdam location
10 Schlagwort Brautbriefe tags
11 Inhalt Geschäftsreise summary
12 Zeitlicher Kontext Brautbriefe von Walter... Skipped (no clear mapping)
13 Transkript (mostly empty for now) transcription

Changes

1. Delete ExcelService

ExcelService.java is deleted. All references to it (in AdminController or wherever it is injected) are removed. Going forward, MassImportService is the sole import mechanism. The web-upload flow that previously called ExcelService is removed from the controller.

Why: The user confirmed the ODS-from-filesystem path is the only import workflow. Keeping dead code would create maintenance confusion.


2. File Format: ODS support via WorkbookFactory

Current behaviour: MassImportService constructs new XSSFWorkbook(inputStream), which only handles .xlsx. The ODS file throws immediately.

Fix: Replace with WorkbookFactory.create(fis). Apache POI 5.x's WorkbookFactory auto-detects the format and handles .xlsx, .xls, and .ods without any extra dependencies. Also update findExcelFile() which currently filters by .endsWith(".xlsx") — change the filter to accept .ods, .xlsx, and .xls.

Why not add odftoolkit? We already have poi and poi-ooxml at 5.5.0. WorkbookFactory covers this case. A second spreadsheet library would be redundant.


3. Column Index Defaults

Current defaults (wrong):

app.import.excel.col.filename=0   date=1   location=2   transcription=3

Correct indices:

filename=0  box=1  folder=2  sender=3  receivers=5  date=7  location=9  tags=10  summary=11  transcription=13

Fix: Update @Value defaults in MassImportService and set explicit values in application.properties. Remove the old defaults from ExcelService (which is deleted). Rename the property prefix from app.import.excel.col.* to app.import.col.* since the format is no longer Excel-specific.


4. Filename Resolution: Index → PDF

Current behaviour: Cell value used directly as originalFilename.

Actual situation: Col 0 is the bare index (e.g., W-0001). PDF files are named W-0001.pdf. The import must append .pdf.

Fix: After reading col 0, append .pdf if the value contains no .:

if (!filename.contains(".")) filename = filename + ".pdf";

5. Document Title: German Date Format

Current behaviour: Title is set to the raw filename, e.g. W-0001.pdf.

Fix: Build title from {Index} {date in German format} {location}. Use DateTimeFormatter with locale de:

W-0001  15. Februar 1888  Rotterdam

If date is missing, omit date segment. If location is missing, omit location segment. The index alone is acceptable as a minimum title.

German month formatting: Use DateTimeFormatter.ofPattern("d. MMMM yyyy", Locale.GERMAN).


6. Date Parsing: Add String Fallback

Current behaviour: Only handles numeric date-formatted cells (DateUtil.isCellDateFormatted()).

Actual data: Col 7 contains ISO date strings (1888-02-15) stored as text in LibreOffice ODS. These have CellType.STRING, so the existing code silently produces null dates for every row.

Fix: Extract a helper method parseDate(Cell):

private LocalDate parseDate(Cell cell) {
    if (cell == null) return null;
    if (cell.getCellType() == CellType.NUMERIC && DateUtil.isCellDateFormatted(cell))
        return cell.getDateCellValue().toInstant().atZone(ZoneId.systemDefault()).toLocalDate();
    if (cell.getCellType() == CellType.STRING) {
        try { return LocalDate.parse(cell.getStringCellValue().trim()); }
        catch (DateTimeParseException e) { return null; }
    }
    return null;
}

7. Sender: Text → Person (lookup-or-create)

Current behaviour: Sender is never set.

Actual data: Col 3 (Von) is always a single name string, e.g. Walter de Gruyter, Eugenie de Gruyter geb. Müller.

Fix: Extract a findOrCreatePerson(String rawName) helper:

  1. Look up by alias exact match (case-insensitive). Use a new repository method findByAliasIgnoreCase(String) on PersonRepository.
  2. If not found, create with:
    • alias = full raw string
    • firstName / lastName = best-effort split (see §9 below)
  3. Return the Person and set on document.setSender(...).

8. Receivers: Text → Person(s) with Normalization

Current behaviour: Receivers are never set.

Actual data (exhaustive set of multi-receiver patterns):

'Clara Cram u Ellen B-M'
'Clara u Familie'
'Clara u Herbert Cram'
'Ella u Walter Dieckmann'
'Eugenie u Walter de Gruyter'
'Hedi und Tutu (Gruber)'
'Herbert und Clara Cram'
'Walter und Eugenie'
'Walter und Eugenie de Gruyter'

Parsing algorithm for col 5 (An):

  1. Strip geb. clauses — remove geb. \w+ from the string (maiden name annotations are not useful for matching).
  2. Extract parenthesised last name — if the string ends with (Something), capture Something as the shared last name and strip it.
  3. Split on separator — split on und or u (whole-word match with \s+u\s+ or \s+und\s+).
  4. Filter — discard any segment that is exactly Familie (it's not a person).
  5. Distribute shared last name — find the last name in the rightmost segment. Known multi-word last name particles: de Gruyter. Known single-word last names: Cram, Dieckmann, Gruber, Müller, Wolff. These are hardcoded as a lookup list. If the last segment ends with a known last name and an earlier segment has no last name (i.e., it is a single token), append that last name to the earlier segment.
  6. Handle no-last-name cases — if no last name can be determined at all (e.g., Walter und Eugenie), proceed with just the first name; lastName will be set to "" (empty string — tolerated since the model has nullable = false and we need something; using "?" as placeholder is clearer).
  7. findOrCreatePerson for each resulting name segment, then add all to document.getReceivers().

Examples:

Raw Result
Walter und Eugenie de Gruyter [Walter de Gruyter, Eugenie de Gruyter]
Herbert und Clara Cram [Herbert Cram, Clara Cram]
Hedi und Tutu (Gruber) [Hedi Gruber, Tutu Gruber]
Clara Cram u Ellen B-M [Clara Cram, Ellen B-M]
Clara u Familie [Clara]
Walter und Eugenie [Walter (?), Eugenie (?)]
Eugenie de Gruyter geb. Müller [Eugenie de Gruyter]

Why normalise? Without normalisation, Herbert und Clara Cram would become one person with a nonsensical name and would never match separate Herbert Cram or Clara Cram entries from other rows. Normalisation means subsequent rows referencing the same individual will reuse the same Person record.

Why hardcode the last names? There are only 6 known family names in this archive. Adding a configurable list would be over-engineering for a one-family archive. If the archive expands, the list can be extended.


9. Name Splitting Helper (firstName / lastName)

Used when creating a new Person who cannot be found by alias.

Algorithm:

  1. Strip any geb. \w+ suffix.
  2. Check if the string ends with a known last name (from the list in §8). If yes, everything before it is firstName, and that is lastName.
  3. If de Gruyter is detected as the last name, it is multi-word — firstName is everything before de Gruyter.
  4. Otherwise, split on the last space: firstName = everything before, lastName = last word.
  5. If only one token (no space), firstName = token, lastName = "?".

This logic lives in a single static utility method PersonNameParser.split(String) returning a record SplitName(String firstName, String lastName). Keeping it static and pure makes it straightforward to unit-test without a Spring context.


10. Tags: Lookup-or-Create

Current behaviour: Tags are never imported.

Fix: Read col 10 (Schlagwort). If non-blank:

Tag tag = tagRepository.findByNameIgnoreCase(value)
    .orElseGet(() -> tagRepository.save(Tag.builder().name(value).build()));
document.getTags().add(tag);

Tags are imported as-is. The TagRepository already has findByNameIgnoreCase, so deduplication is free.


11. Summary: Map "Inhalt" (Col 11)

Read col 11 (Inhalt) and set on document.setSummary(...). Short content keywords (Geschäftsreise, Reisepläne) are useful for full-text search even if they're terse.

Col 12 (Zeitlicher Kontext) is skipped — it is often a duplicate of context already encoded in sender/receiver/tags.


12. New Model Fields: archiveBox and archiveFolder

Cols 1 and 2 (Box, Mappe) identify the physical storage location of the original document. They have no counterpart in the model today.

Changes:

  1. Add to Document.java:
    @Column(name = "archive_box")
    private String archiveBox;
    
    @Column(name = "archive_folder")
    private String archiveFolder;
    
  2. Flyway migration V4__add_archive_fields_to_documents.sql:
    ALTER TABLE documents ADD COLUMN archive_box VARCHAR(255);
    ALTER TABLE documents ADD COLUMN archive_folder VARCHAR(255);
    
  3. Import logic reads col 1 → archiveBox, col 2 → archiveFolder.

13. PersonRepository: Add findByAliasIgnoreCase

Add one method to PersonRepository:

Optional<Person> findByAliasIgnoreCase(String alias);

Spring Data generates the query automatically. No other repository changes are needed.


Overwrite Behaviour (No Change)

The existing skip logic stays: if a document already exists in the DB and its status is not PLACEHOLDER, it is skipped. This prevents accidental data loss on re-runs. The assumption is that if someone has manually enriched a document beyond placeholder stage, that work should not be overwritten by a re-import.


Summary of All File Changes

File Change
ExcelService.java Deleted
AdminController.java (or wherever ExcelService is injected) Remove ExcelService injection and its endpoint
MassImportService.java WorkbookFactory, new column indices, .ods discovery, filename fix, title, date parsing, sender, receivers, tags, summary, archiveBox/archiveFolder
PersonNameParser.java (new) Static utility: split(String)SplitName, parseReceivers(String)List<String>
PersonRepository.java Add findByAliasIgnoreCase(String)
Document.java Add archiveBox, archiveFolder fields
V4__add_archive_fields_to_documents.sql (new) ALTER TABLE for both new columns
application.properties Update/add app.import.col.* properties

What We Are Not Changing

  • Col 4 (BriefeschreiberIn) — redundant with col 3.
  • Col 6 (EmpfängerIn) — redundant with col 5.
  • Col 8 (Datum Originalformat) — ISO date in col 7 is strictly better.
  • Col 12 (Zeitlicher Kontext) — no clear mapping, often duplicates other fields.
  • persons table schemaalias serves as the full-name store without a schema change.
  • TagRepository — existing findByNameIgnoreCase is sufficient.