Files
familienarchiv/.agent/PLAN.md
2026-05-05 12:39:20 +02:00

275 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Import Pipeline: ODS Alignment Plan
## Context
The real data source is an ODS spreadsheet (`zzfamilienarchiv Walter und Eugenie 2025-04-10.ods`) with 1,508 rows and 14 columns, living alongside PDF files (`W-0001.pdf`, `C-0451.pdf`, etc.) in `familienarchiv_raw/`. The existing import pipeline was built speculatively without seeing the actual data. It has several structural mismatches that need to be resolved before any real import can run.
`ExcelService` (the web-upload import path) will be **deleted entirely**. The only import path is `MassImportService`, which reads an ODS file from the `/import` directory on the filesystem. This simplifies the scope significantly.
---
## What the ODS Actually Contains
| Col | Header | Example value | Action |
|-----|----------------------|------------------------------------------|-----------------|
| 0 | Index | `W-0001` | → `originalFilename` (+ `.pdf`) |
| 1 | Box | `V` | → `archiveBox` (new field) |
| 2 | Mappe | `1` | → `archiveFolder` (new field) |
| 3 | Von | `Walter de Gruyter` | → `sender` (Person) |
| 4 | BriefeschreiberIn | `Walter de Gruyter` | Ignored (redundant with col 3) |
| 5 | An | `Eugenie de Gruyter geb. Müller` | → `receivers` (Person, parse multi) |
| 6 | EmpfängerIn | `Eugenie Müller` | Ignored (redundant with col 5) |
| 7 | Datum | `1888-02-15` (ISO date string) | → `documentDate` |
| 8 | Datum Originalformat | `15.2.1888` | Ignored |
| 9 | Ort | `Rotterdam` | → `location` |
| 10 | Schlagwort | `Brautbriefe` | → `tags` |
| 11 | Inhalt | `Geschäftsreise` | → `summary` |
| 12 | Zeitlicher Kontext | `Brautbriefe von Walter...` | Skipped (no clear mapping) |
| 13 | Transkript | (mostly empty for now) | → `transcription` |
---
## Changes
### 1. Delete ExcelService
`ExcelService.java` is deleted. All references to it (in `AdminController` or wherever it is injected) are removed. Going forward, `MassImportService` is the sole import mechanism. The web-upload flow that previously called `ExcelService` is removed from the controller.
**Why:** The user confirmed the ODS-from-filesystem path is the only import workflow. Keeping dead code would create maintenance confusion.
---
### 2. File Format: ODS support via WorkbookFactory
**Current behaviour:** `MassImportService` constructs `new XSSFWorkbook(inputStream)`, which only handles `.xlsx`. The ODS file throws immediately.
**Fix:** Replace with `WorkbookFactory.create(fis)`. Apache POI 5.x's `WorkbookFactory` auto-detects the format and handles `.xlsx`, `.xls`, and `.ods` without any extra dependencies. Also update `findExcelFile()` which currently filters by `.endsWith(".xlsx")` — change the filter to accept `.ods`, `.xlsx`, and `.xls`.
**Why not add `odftoolkit`?** We already have `poi` and `poi-ooxml` at 5.5.0. `WorkbookFactory` covers this case. A second spreadsheet library would be redundant.
---
### 3. Column Index Defaults
**Current defaults (wrong):**
```
app.import.excel.col.filename=0 date=1 location=2 transcription=3
```
**Correct indices:**
```
filename=0 box=1 folder=2 sender=3 receivers=5 date=7 location=9 tags=10 summary=11 transcription=13
```
**Fix:** Update `@Value` defaults in `MassImportService` and set explicit values in `application.properties`. Remove the old defaults from `ExcelService` (which is deleted). Rename the property prefix from `app.import.excel.col.*` to `app.import.col.*` since the format is no longer Excel-specific.
---
### 4. Filename Resolution: Index → PDF
**Current behaviour:** Cell value used directly as `originalFilename`.
**Actual situation:** Col 0 is the bare index (e.g., `W-0001`). PDF files are named `W-0001.pdf`. The import must append `.pdf`.
**Fix:** After reading col 0, append `.pdf` if the value contains no `.`:
```java
if (!filename.contains(".")) filename = filename + ".pdf";
```
---
### 5. Document Title: German Date Format
**Current behaviour:** Title is set to the raw filename, e.g. `W-0001.pdf`.
**Fix:** Build title from `{Index} {date in German format} {location}`. Use `DateTimeFormatter` with locale `de`:
```
W-0001 15. Februar 1888 Rotterdam
```
If date is missing, omit date segment. If location is missing, omit location segment. The index alone is acceptable as a minimum title.
**German month formatting:** Use `DateTimeFormatter.ofPattern("d. MMMM yyyy", Locale.GERMAN)`.
---
### 6. Date Parsing: Add String Fallback
**Current behaviour:** Only handles numeric date-formatted cells (`DateUtil.isCellDateFormatted()`).
**Actual data:** Col 7 contains ISO date strings (`1888-02-15`) stored as text in LibreOffice ODS. These have `CellType.STRING`, so the existing code silently produces `null` dates for every row.
**Fix:** Extract a helper method `parseDate(Cell)`:
```java
private LocalDate parseDate(Cell cell) {
if (cell == null) return null;
if (cell.getCellType() == CellType.NUMERIC && DateUtil.isCellDateFormatted(cell))
return cell.getDateCellValue().toInstant().atZone(ZoneId.systemDefault()).toLocalDate();
if (cell.getCellType() == CellType.STRING) {
try { return LocalDate.parse(cell.getStringCellValue().trim()); }
catch (DateTimeParseException e) { return null; }
}
return null;
}
```
---
### 7. Sender: Text → Person (lookup-or-create)
**Current behaviour:** Sender is never set.
**Actual data:** Col 3 (`Von`) is always a single name string, e.g. `Walter de Gruyter`, `Eugenie de Gruyter geb. Müller`.
**Fix:** Extract a `findOrCreatePerson(String rawName)` helper:
1. Look up by `alias` exact match (case-insensitive). Use a new repository method `findByAliasIgnoreCase(String)` on `PersonRepository`.
2. If not found, create with:
- `alias` = full raw string
- `firstName` / `lastName` = best-effort split (see §9 below)
3. Return the `Person` and set on `document.setSender(...)`.
---
### 8. Receivers: Text → Person(s) with Normalization
**Current behaviour:** Receivers are never set.
**Actual data (exhaustive set of multi-receiver patterns):**
```
'Clara Cram u Ellen B-M'
'Clara u Familie'
'Clara u Herbert Cram'
'Ella u Walter Dieckmann'
'Eugenie u Walter de Gruyter'
'Hedi und Tutu (Gruber)'
'Herbert und Clara Cram'
'Walter und Eugenie'
'Walter und Eugenie de Gruyter'
```
**Parsing algorithm for col 5 (`An`):**
1. **Strip `geb.` clauses** — remove ` geb. \w+` from the string (maiden name annotations are not useful for matching).
2. **Extract parenthesised last name** — if the string ends with `(Something)`, capture `Something` as the shared last name and strip it.
3. **Split on separator** — split on ` und ` or ` u ` (whole-word match with `\s+u\s+` or `\s+und\s+`).
4. **Filter** — discard any segment that is exactly `Familie` (it's not a person).
5. **Distribute shared last name** — find the last name in the rightmost segment. Known multi-word last name particles: `de Gruyter`. Known single-word last names: `Cram`, `Dieckmann`, `Gruber`, `Müller`, `Wolff`. These are hardcoded as a lookup list. If the last segment ends with a known last name and an earlier segment has no last name (i.e., it is a single token), append that last name to the earlier segment.
6. **Handle no-last-name cases** — if no last name can be determined at all (e.g., `Walter und Eugenie`), proceed with just the first name; `lastName` will be set to `""` (empty string — tolerated since the model has `nullable = false` and we need something; using `"?"` as placeholder is clearer).
7. **findOrCreatePerson** for each resulting name segment, then add all to `document.getReceivers()`.
**Examples:**
| Raw | Result |
|-----|--------|
| `Walter und Eugenie de Gruyter` | [Walter de Gruyter, Eugenie de Gruyter] |
| `Herbert und Clara Cram` | [Herbert Cram, Clara Cram] |
| `Hedi und Tutu (Gruber)` | [Hedi Gruber, Tutu Gruber] |
| `Clara Cram u Ellen B-M` | [Clara Cram, Ellen B-M] |
| `Clara u Familie` | [Clara] |
| `Walter und Eugenie` | [Walter (?), Eugenie (?)] |
| `Eugenie de Gruyter geb. Müller` | [Eugenie de Gruyter] |
**Why normalise?** Without normalisation, `Herbert und Clara Cram` would become one person with a nonsensical name and would never match separate `Herbert Cram` or `Clara Cram` entries from other rows. Normalisation means subsequent rows referencing the same individual will reuse the same `Person` record.
**Why hardcode the last names?** There are only 6 known family names in this archive. Adding a configurable list would be over-engineering for a one-family archive. If the archive expands, the list can be extended.
---
### 9. Name Splitting Helper (firstName / lastName)
Used when creating a new `Person` who cannot be found by alias.
**Algorithm:**
1. Strip any ` geb. \w+` suffix.
2. Check if the string ends with a known last name (from the list in §8). If yes, everything before it is `firstName`, and that is `lastName`.
3. If `de Gruyter` is detected as the last name, it is multi-word — `firstName` is everything before `de Gruyter`.
4. Otherwise, split on the last space: `firstName` = everything before, `lastName` = last word.
5. If only one token (no space), `firstName` = token, `lastName` = `"?"`.
This logic lives in a single static utility method `PersonNameParser.split(String)` returning a record `SplitName(String firstName, String lastName)`. Keeping it static and pure makes it straightforward to unit-test without a Spring context.
---
### 10. Tags: Lookup-or-Create
**Current behaviour:** Tags are never imported.
**Fix:** Read col 10 (`Schlagwort`). If non-blank:
```java
Tag tag = tagRepository.findByNameIgnoreCase(value)
.orElseGet(() -> tagRepository.save(Tag.builder().name(value).build()));
document.getTags().add(tag);
```
Tags are imported as-is. The `TagRepository` already has `findByNameIgnoreCase`, so deduplication is free.
---
### 11. Summary: Map "Inhalt" (Col 11)
Read col 11 (`Inhalt`) and set on `document.setSummary(...)`. Short content keywords (`Geschäftsreise`, `Reisepläne`) are useful for full-text search even if they're terse.
Col 12 (`Zeitlicher Kontext`) is skipped — it is often a duplicate of context already encoded in sender/receiver/tags.
---
### 12. New Model Fields: archiveBox and archiveFolder
Cols 1 and 2 (`Box`, `Mappe`) identify the physical storage location of the original document. They have no counterpart in the model today.
**Changes:**
1. Add to `Document.java`:
```java
@Column(name = "archive_box")
private String archiveBox;
@Column(name = "archive_folder")
private String archiveFolder;
```
2. Flyway migration `V4__add_archive_fields_to_documents.sql`:
```sql
ALTER TABLE documents ADD COLUMN archive_box VARCHAR(255);
ALTER TABLE documents ADD COLUMN archive_folder VARCHAR(255);
```
3. Import logic reads col 1 → `archiveBox`, col 2 → `archiveFolder`.
---
### 13. PersonRepository: Add findByAliasIgnoreCase
Add one method to `PersonRepository`:
```java
Optional<Person> findByAliasIgnoreCase(String alias);
```
Spring Data generates the query automatically. No other repository changes are needed.
---
## Overwrite Behaviour (No Change)
The existing skip logic stays: if a document already exists in the DB and its status is not `PLACEHOLDER`, it is skipped. This prevents accidental data loss on re-runs. The assumption is that if someone has manually enriched a document beyond placeholder stage, that work should not be overwritten by a re-import.
---
## Summary of All File Changes
| File | Change |
|------|--------|
| `ExcelService.java` | **Deleted** |
| `AdminController.java` (or wherever ExcelService is injected) | Remove ExcelService injection and its endpoint |
| `MassImportService.java` | `WorkbookFactory`, new column indices, `.ods` discovery, filename fix, title, date parsing, sender, receivers, tags, summary, archiveBox/archiveFolder |
| `PersonNameParser.java` (new) | Static utility: `split(String)` → `SplitName`, `parseReceivers(String)` → `List<String>` |
| `PersonRepository.java` | Add `findByAliasIgnoreCase(String)` |
| `Document.java` | Add `archiveBox`, `archiveFolder` fields |
| `V4__add_archive_fields_to_documents.sql` (new) | `ALTER TABLE` for both new columns |
| `application.properties` | Update/add `app.import.col.*` properties |
---
## What We Are Not Changing
- **Col 4 (`BriefeschreiberIn`)** — redundant with col 3.
- **Col 6 (`EmpfängerIn`)** — redundant with col 5.
- **Col 8 (`Datum Originalformat`)** — ISO date in col 7 is strictly better.
- **Col 12 (`Zeitlicher Kontext`)** — no clear mapping, often duplicates other fields.
- **`persons` table schema** — `alias` serves as the full-name store without a schema change.
- **`TagRepository`** — existing `findByNameIgnoreCase` is sufficient.