refactor(document): move document domain core to document/ package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:39:20 +02:00
parent bb7d872a61
commit e85057bed2
2371 changed files with 385726 additions and 1971 deletions
--- a/.agent/PLAN.md
+++ b/.agent/PLAN.md
@@ -0,0 +1,274 @@
+# Import Pipeline: ODS Alignment Plan
+
+## Context
+
+The real data source is an ODS spreadsheet (`zzfamilienarchiv Walter und Eugenie 2025-04-10.ods`) with 1,508 rows and 14 columns, living alongside PDF files (`W-0001.pdf`, `C-0451.pdf`, etc.) in `familienarchiv_raw/`. The existing import pipeline was built speculatively without seeing the actual data. It has several structural mismatches that need to be resolved before any real import can run.
+
+`ExcelService` (the web-upload import path) will be **deleted entirely**. The only import path is `MassImportService`, which reads an ODS file from the `/import` directory on the filesystem. This simplifies the scope significantly.
+
+---
+
+## What the ODS Actually Contains
+
+| Col | Header               | Example value                            | Action          |
+|-----|----------------------|------------------------------------------|-----------------|
+| 0   | Index                | `W-0001`                                 | → `originalFilename` (+ `.pdf`) |
+| 1   | Box                  | `V`                                      | → `archiveBox` (new field) |
+| 2   | Mappe                | `1`                                      | → `archiveFolder` (new field) |
+| 3   | Von                  | `Walter de Gruyter`                      | → `sender` (Person) |
+| 4   | BriefeschreiberIn    | `Walter de Gruyter`                      | Ignored (redundant with col 3) |
+| 5   | An                   | `Eugenie de Gruyter geb. Müller`         | → `receivers` (Person, parse multi) |
+| 6   | EmpfängerIn          | `Eugenie Müller`                         | Ignored (redundant with col 5) |
+| 7   | Datum                | `1888-02-15` (ISO date string)           | → `documentDate` |
+| 8   | Datum Originalformat | `15.2.1888`                              | Ignored |
+| 9   | Ort                  | `Rotterdam`                              | → `location` |
+| 10  | Schlagwort           | `Brautbriefe`                            | → `tags` |
+| 11  | Inhalt               | `Geschäftsreise`                         | → `summary` |
+| 12  | Zeitlicher Kontext   | `Brautbriefe von Walter...`              | Skipped (no clear mapping) |
+| 13  | Transkript           | (mostly empty for now)                   | → `transcription` |
+
+---
+
+## Changes
+
+### 1. Delete ExcelService
+
+`ExcelService.java` is deleted. All references to it (in `AdminController` or wherever it is injected) are removed. Going forward, `MassImportService` is the sole import mechanism. The web-upload flow that previously called `ExcelService` is removed from the controller.
+
+**Why:** The user confirmed the ODS-from-filesystem path is the only import workflow. Keeping dead code would create maintenance confusion.
+
+---
+
+### 2. File Format: ODS support via WorkbookFactory
+
+**Current behaviour:** `MassImportService` constructs `new XSSFWorkbook(inputStream)`, which only handles `.xlsx`. The ODS file throws immediately.
+
+**Fix:** Replace with `WorkbookFactory.create(fis)`. Apache POI 5.x's `WorkbookFactory` auto-detects the format and handles `.xlsx`, `.xls`, and `.ods` without any extra dependencies. Also update `findExcelFile()` which currently filters by `.endsWith(".xlsx")` — change the filter to accept `.ods`, `.xlsx`, and `.xls`.
+
+**Why not add `odftoolkit`?** We already have `poi` and `poi-ooxml` at 5.5.0. `WorkbookFactory` covers this case. A second spreadsheet library would be redundant.
+
+---
+
+### 3. Column Index Defaults
+
+**Current defaults (wrong):**
+```
+app.import.excel.col.filename=0   date=1   location=2   transcription=3
+```
+
+**Correct indices:**
+```
+filename=0  box=1  folder=2  sender=3  receivers=5  date=7  location=9  tags=10  summary=11  transcription=13
+```
+
+**Fix:** Update `@Value` defaults in `MassImportService` and set explicit values in `application.properties`. Remove the old defaults from `ExcelService` (which is deleted). Rename the property prefix from `app.import.excel.col.*` to `app.import.col.*` since the format is no longer Excel-specific.
+
+---
+
+### 4. Filename Resolution: Index → PDF
+
+**Current behaviour:** Cell value used directly as `originalFilename`.
+
+**Actual situation:** Col 0 is the bare index (e.g., `W-0001`). PDF files are named `W-0001.pdf`. The import must append `.pdf`.
+
+**Fix:** After reading col 0, append `.pdf` if the value contains no `.`:
+```java
+if (!filename.contains(".")) filename = filename + ".pdf";
+```
+
+---
+
+### 5. Document Title: German Date Format
+
+**Current behaviour:** Title is set to the raw filename, e.g. `W-0001.pdf`.
+
+**Fix:** Build title from `{Index} – {date in German format} – {location}`. Use `DateTimeFormatter` with locale `de`:
+```
+W-0001 – 15. Februar 1888 – Rotterdam
+```
+If date is missing, omit date segment. If location is missing, omit location segment. The index alone is acceptable as a minimum title.
+
+**German month formatting:** Use `DateTimeFormatter.ofPattern("d. MMMM yyyy", Locale.GERMAN)`.
+
+---
+
+### 6. Date Parsing: Add String Fallback
+
+**Current behaviour:** Only handles numeric date-formatted cells (`DateUtil.isCellDateFormatted()`).
+
+**Actual data:** Col 7 contains ISO date strings (`1888-02-15`) stored as text in LibreOffice ODS. These have `CellType.STRING`, so the existing code silently produces `null` dates for every row.
+
+**Fix:** Extract a helper method `parseDate(Cell)`:
+```java
+private LocalDate parseDate(Cell cell) {
+    if (cell == null) return null;
+    if (cell.getCellType() == CellType.NUMERIC && DateUtil.isCellDateFormatted(cell))
+        return cell.getDateCellValue().toInstant().atZone(ZoneId.systemDefault()).toLocalDate();
+    if (cell.getCellType() == CellType.STRING) {
+        try { return LocalDate.parse(cell.getStringCellValue().trim()); }
+        catch (DateTimeParseException e) { return null; }
+    }
+    return null;
+}
+```
+
+---
+
+### 7. Sender: Text → Person (lookup-or-create)
+
+**Current behaviour:** Sender is never set.
+
+**Actual data:** Col 3 (`Von`) is always a single name string, e.g. `Walter de Gruyter`, `Eugenie de Gruyter geb. Müller`.
+
+**Fix:** Extract a `findOrCreatePerson(String rawName)` helper:
+1. Look up by `alias` exact match (case-insensitive). Use a new repository method `findByAliasIgnoreCase(String)` on `PersonRepository`.
+2. If not found, create with:
+   - `alias` = full raw string
+   - `firstName` / `lastName` = best-effort split (see §9 below)
+3. Return the `Person` and set on `document.setSender(...)`.
+
+---
+
+### 8. Receivers: Text → Person(s) with Normalization
+
+**Current behaviour:** Receivers are never set.
+
+**Actual data (exhaustive set of multi-receiver patterns):**
+```
+'Clara Cram u Ellen B-M'
+'Clara u Familie'
+'Clara u Herbert Cram'
+'Ella u Walter Dieckmann'
+'Eugenie u Walter de Gruyter'
+'Hedi und Tutu (Gruber)'
+'Herbert und Clara Cram'
+'Walter und Eugenie'
+'Walter und Eugenie de Gruyter'
+```
+
+**Parsing algorithm for col 5 (`An`):**
+
+1. **Strip `geb.` clauses** — remove ` geb. \w+` from the string (maiden name annotations are not useful for matching).
+2. **Extract parenthesised last name** — if the string ends with `(Something)`, capture `Something` as the shared last name and strip it.
+3. **Split on separator** — split on ` und ` or ` u ` (whole-word match with `\s+u\s+` or `\s+und\s+`).
+4. **Filter** — discard any segment that is exactly `Familie` (it's not a person).
+5. **Distribute shared last name** — find the last name in the rightmost segment. Known multi-word last name particles: `de Gruyter`. Known single-word last names: `Cram`, `Dieckmann`, `Gruber`, `Müller`, `Wolff`. These are hardcoded as a lookup list. If the last segment ends with a known last name and an earlier segment has no last name (i.e., it is a single token), append that last name to the earlier segment.
+6. **Handle no-last-name cases** — if no last name can be determined at all (e.g., `Walter und Eugenie`), proceed with just the first name; `lastName` will be set to `""` (empty string — tolerated since the model has `nullable = false` and we need something; using `"?"` as placeholder is clearer).
+7. **findOrCreatePerson** for each resulting name segment, then add all to `document.getReceivers()`.
+
+**Examples:**
+| Raw | Result |
+|-----|--------|
+| `Walter und Eugenie de Gruyter` | [Walter de Gruyter, Eugenie de Gruyter] |
+| `Herbert und Clara Cram` | [Herbert Cram, Clara Cram] |
+| `Hedi und Tutu (Gruber)` | [Hedi Gruber, Tutu Gruber] |
+| `Clara Cram u Ellen B-M` | [Clara Cram, Ellen B-M] |
+| `Clara u Familie` | [Clara] |
+| `Walter und Eugenie` | [Walter (?), Eugenie (?)] |
+| `Eugenie de Gruyter geb. Müller` | [Eugenie de Gruyter] |
+
+**Why normalise?** Without normalisation, `Herbert und Clara Cram` would become one person with a nonsensical name and would never match separate `Herbert Cram` or `Clara Cram` entries from other rows. Normalisation means subsequent rows referencing the same individual will reuse the same `Person` record.
+
+**Why hardcode the last names?** There are only 6 known family names in this archive. Adding a configurable list would be over-engineering for a one-family archive. If the archive expands, the list can be extended.
+
+---
+
+### 9. Name Splitting Helper (firstName / lastName)
+
+Used when creating a new `Person` who cannot be found by alias.
+
+**Algorithm:**
+1. Strip any ` geb. \w+` suffix.
+2. Check if the string ends with a known last name (from the list in §8). If yes, everything before it is `firstName`, and that is `lastName`.
+3. If `de Gruyter` is detected as the last name, it is multi-word — `firstName` is everything before `de Gruyter`.
+4. Otherwise, split on the last space: `firstName` = everything before, `lastName` = last word.
+5. If only one token (no space), `firstName` = token, `lastName` = `"?"`.
+
+This logic lives in a single static utility method `PersonNameParser.split(String)` returning a record `SplitName(String firstName, String lastName)`. Keeping it static and pure makes it straightforward to unit-test without a Spring context.
+
+---
+
+### 10. Tags: Lookup-or-Create
+
+**Current behaviour:** Tags are never imported.
+
+**Fix:** Read col 10 (`Schlagwort`). If non-blank:
+```java
+Tag tag = tagRepository.findByNameIgnoreCase(value)
+    .orElseGet(() -> tagRepository.save(Tag.builder().name(value).build()));
+document.getTags().add(tag);
+```
+
+Tags are imported as-is. The `TagRepository` already has `findByNameIgnoreCase`, so deduplication is free.
+
+---
+
+### 11. Summary: Map "Inhalt" (Col 11)
+
+Read col 11 (`Inhalt`) and set on `document.setSummary(...)`. Short content keywords (`Geschäftsreise`, `Reisepläne`) are useful for full-text search even if they're terse.
+
+Col 12 (`Zeitlicher Kontext`) is skipped — it is often a duplicate of context already encoded in sender/receiver/tags.
+
+---
+
+### 12. New Model Fields: archiveBox and archiveFolder
+
+Cols 1 and 2 (`Box`, `Mappe`) identify the physical storage location of the original document. They have no counterpart in the model today.
+
+**Changes:**
+1. Add to `Document.java`:
+   ```java
+   @Column(name = "archive_box")
+   private String archiveBox;
+
+   @Column(name = "archive_folder")
+   private String archiveFolder;
+   ```
+2. Flyway migration `V4__add_archive_fields_to_documents.sql`:
+   ```sql
+   ALTER TABLE documents ADD COLUMN archive_box VARCHAR(255);
+   ALTER TABLE documents ADD COLUMN archive_folder VARCHAR(255);
+   ```
+3. Import logic reads col 1 → `archiveBox`, col 2 → `archiveFolder`.
+
+---
+
+### 13. PersonRepository: Add findByAliasIgnoreCase
+
+Add one method to `PersonRepository`:
+```java
+Optional<Person> findByAliasIgnoreCase(String alias);
+```
+Spring Data generates the query automatically. No other repository changes are needed.
+
+---
+
+## Overwrite Behaviour (No Change)
+
+The existing skip logic stays: if a document already exists in the DB and its status is not `PLACEHOLDER`, it is skipped. This prevents accidental data loss on re-runs. The assumption is that if someone has manually enriched a document beyond placeholder stage, that work should not be overwritten by a re-import.
+
+---
+
+## Summary of All File Changes
+
+| File | Change |
+|------|--------|
+| `ExcelService.java` | **Deleted** |
+| `AdminController.java` (or wherever ExcelService is injected) | Remove ExcelService injection and its endpoint |
+| `MassImportService.java` | `WorkbookFactory`, new column indices, `.ods` discovery, filename fix, title, date parsing, sender, receivers, tags, summary, archiveBox/archiveFolder |
+| `PersonNameParser.java` (new) | Static utility: `split(String)` → `SplitName`, `parseReceivers(String)` → `List<String>` |
+| `PersonRepository.java` | Add `findByAliasIgnoreCase(String)` |
+| `Document.java` | Add `archiveBox`, `archiveFolder` fields |
+| `V4__add_archive_fields_to_documents.sql` (new) | `ALTER TABLE` for both new columns |
+| `application.properties` | Update/add `app.import.col.*` properties |
+
+---
+
+## What We Are Not Changing
+
+- **Col 4 (`BriefeschreiberIn`)** — redundant with col 3.
+- **Col 6 (`EmpfängerIn`)** — redundant with col 5.
+- **Col 8 (`Datum Originalformat`)** — ISO date in col 7 is strictly better.
+- **Col 12 (`Zeitlicher Kontext`)** — no clear mapping, often duplicates other fields.
+- **`persons` table schema** — `alias` serves as the full-name store without a schema change.
+- **`TagRepository`** — existing `findByNameIgnoreCase` is sufficient.