feat(import): store SKIP entries in document notes #217

New Issue

marcel · 2026-04-08T21:14:52+02:00

marcel commented

2026-04-08 21:14:52 +02:00

Problem

When PersonTypeClassifier classifies a Von or An entry as SKIP (e.g. "Briefumschlag aus Java", "Kondolenzbriefe zum Tod von Walter de Gruyter", "Hochzeitsgedicht fur Paul u Luise de Gruyter"), PersonService.findOrCreateByAlias() returns null and the entry is silently dropped. The information is lost — there's no record that the original spreadsheet row contained this text.

Solution

When a person entry is classified as SKIP, store the raw text in the document's notes field instead of discarding it. This preserves the original spreadsheet content for manual review.

Processing

raw name → PersonTypeClassifier.classify()
  → if SKIP: append raw text to Document.notes (e.g. "Übersprungen (Von): Briefumschlag aus Java")
  → if PERSON/UNKNOWN: continue to parseReceivers/split

Files

File	Change
`MassImportService.java`	After `findOrCreatePerson()` returns null for a SKIP entry, append the raw value to `doc.notes`
`MassImportServiceTest.java`	Test that SKIP entries appear in document notes

Acceptance criteria

SKIP-classified Von/An entries are stored in the document's notes field
Notes include the original raw text and which column it came from
Non-SKIP entries are unaffected

Found in

Architecture discussion on #214.

## Problem When `PersonTypeClassifier` classifies a Von or An entry as `SKIP` (e.g. "Briefumschlag aus Java", "Kondolenzbriefe zum Tod von Walter de Gruyter", "Hochzeitsgedicht fur Paul u Luise de Gruyter"), `PersonService.findOrCreateByAlias()` returns `null` and the entry is silently dropped. The information is lost — there's no record that the original spreadsheet row contained this text. ## Solution When a person entry is classified as `SKIP`, store the raw text in the document's `notes` field instead of discarding it. This preserves the original spreadsheet content for manual review. ### Processing ``` raw name → PersonTypeClassifier.classify() → if SKIP: append raw text to Document.notes (e.g. "Übersprungen (Von): Briefumschlag aus Java") → if PERSON/UNKNOWN: continue to parseReceivers/split ``` ### Files | File | Change | |---|---| | `MassImportService.java` | After `findOrCreatePerson()` returns null for a SKIP entry, append the raw value to `doc.notes` | | `MassImportServiceTest.java` | Test that SKIP entries appear in document notes | ## Acceptance criteria - [ ] SKIP-classified Von/An entries are stored in the document's notes field - [ ] Notes include the original raw text and which column it came from - [ ] Non-SKIP entries are unaffected ## Found in Architecture discussion on #214.

marcel added the feature label 2026-04-08 21:15:09 +02:00

marcel commented

2026-04-08 21:17:22 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

findOrCreatePerson() return value is ambiguous — Currently findOrCreateByAlias() returns null for SKIP entries. But null is also what you'd get if something went wrong. The caller in MassImportService can't distinguish "this was a SKIP" from "something unexpected returned null." To store the skip reason in notes, the caller needs to know why it got null. Options:
- Have findOrCreateByAlias() throw a specific SkipException that the caller catches
- Return an Optional<Person> or a result type that carries the skip reason
- Call PersonTypeClassifier.classify() directly in MassImportService before calling findOrCreatePerson(), and handle the SKIP branch there
The third option is cleanest — it matches the pipeline order described in #214 and avoids changing PersonService's contract. The classifier is already a static utility, so calling it from MassImportService has no coupling cost.
Append vs overwrite on doc.notes — The issue says "append raw text to Document.notes." If a document has both a SKIP sender and a SKIP receiver (unlikely but possible), both should appear. If the document already has notes from a previous import or manual entry, the append must not destroy existing content. Use a separator like \n between entries.
Test shape — I'd write a parameterized test in MassImportServiceTest:
- Input: a row with Von = "Briefumschlag aus Java" (SKIP) and An = "Walter de Gruyter" (PERSON)
- Assert: document is created, sender is null, receivers contain Walter, notes contain "Übersprungen (Von): Briefumschlag aus Java"
- Second case: An = "Kondolenzbriefe zum Tod von Walter de Gruyter" (SKIP), Von = "Walter de Gruyter"
- Assert: sender is Walter, receivers empty, notes contain the skip text with "(An)" label

Suggestions

Call the classifier directly in MassImportService before findOrCreatePerson(). This is the right place for the routing decision — it keeps PersonService focused on person CRUD.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - **`findOrCreatePerson()` return value is ambiguous** — Currently `findOrCreateByAlias()` returns `null` for SKIP entries. But `null` is also what you'd get if something went wrong. The caller in `MassImportService` can't distinguish "this was a SKIP" from "something unexpected returned null." To store the skip reason in notes, the caller needs to know *why* it got null. Options: - Have `findOrCreateByAlias()` throw a specific `SkipException` that the caller catches - Return an `Optional<Person>` or a result type that carries the skip reason - Call `PersonTypeClassifier.classify()` directly in `MassImportService` *before* calling `findOrCreatePerson()`, and handle the SKIP branch there The third option is cleanest — it matches the pipeline order described in #214 and avoids changing `PersonService`'s contract. The classifier is already a static utility, so calling it from `MassImportService` has no coupling cost. - **Append vs overwrite on `doc.notes`** — The issue says "append raw text to Document.notes." If a document has both a SKIP sender and a SKIP receiver (unlikely but possible), both should appear. If the document already has notes from a previous import or manual entry, the append must not destroy existing content. Use a separator like `\n` between entries. - **Test shape** — I'd write a parameterized test in `MassImportServiceTest`: - Input: a row with Von = "Briefumschlag aus Java" (SKIP) and An = "Walter de Gruyter" (PERSON) - Assert: document is created, sender is null, receivers contain Walter, notes contain "Übersprungen (Von): Briefumschlag aus Java" - Second case: An = "Kondolenzbriefe zum Tod von Walter de Gruyter" (SKIP), Von = "Walter de Gruyter" - Assert: sender is Walter, receivers empty, notes contain the skip text with "(An)" label ### Suggestions - Call the classifier directly in `MassImportService` before `findOrCreatePerson()`. This is the right place for the routing decision — it keeps `PersonService` focused on person CRUD.

marcel commented

2026-04-08 21:17:33 +02:00

🏗️ Markus Keller — Application Architect

Questions & Observations

Responsibility placement — The issue says to handle this in MassImportService after findOrCreatePerson() returns null. I'd push back slightly: the SKIP classification should happen before calling findOrCreatePerson(), not after. The classifier is a gate in the pipeline — it decides whether to proceed to person creation. Checking the return value of findOrCreatePerson() for null conflates "SKIP entry" with "failed lookup," which is a leaky abstraction.

The clean pipeline:
```
raw → classify()
  → SKIP: append to notes, do not call findOrCreatePerson()
  → PERSON/UNKNOWN: findOrCreatePerson()
```
Notes field semantics — Document.notes is currently a free-text field (nullable text column). Using it for structured skip-reason data means mixing human-authored notes with machine-generated import metadata. Is that acceptable? For this project's scale (family archive, small team), yes — but the format should be consistent enough to be parseable later if needed. I'd suggest a prefix format: [Import] Übersprungen (Von): Briefumschlag aus Java.
No schema change needed — The notes field already exists as a text column with no length limit. No Flyway migration required. This is purely an application logic change.

Suggestions

Move the classifier check into MassImportService before the findOrCreatePerson() call. This makes the pipeline order explicit in code, matching the diagram in #214.
Use a consistent prefix format for import-generated notes so they can be distinguished from manual notes.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **Responsibility placement** — The issue says to handle this in `MassImportService` after `findOrCreatePerson()` returns null. I'd push back slightly: the SKIP classification should happen *before* calling `findOrCreatePerson()`, not after. The classifier is a gate in the pipeline — it decides whether to proceed to person creation. Checking the return value of `findOrCreatePerson()` for null conflates "SKIP entry" with "failed lookup," which is a leaky abstraction. The clean pipeline: ``` raw → classify() → SKIP: append to notes, do not call findOrCreatePerson() → PERSON/UNKNOWN: findOrCreatePerson() ``` - **Notes field semantics** — `Document.notes` is currently a free-text field (nullable `text` column). Using it for structured skip-reason data means mixing human-authored notes with machine-generated import metadata. Is that acceptable? For this project's scale (family archive, small team), yes — but the format should be consistent enough to be parseable later if needed. I'd suggest a prefix format: `[Import] Übersprungen (Von): Briefumschlag aus Java`. - **No schema change needed** — The `notes` field already exists as a `text` column with no length limit. No Flyway migration required. This is purely an application logic change. ### Suggestions - Move the classifier check into `MassImportService` before the `findOrCreatePerson()` call. This makes the pipeline order explicit in code, matching the diagram in #214. - Use a consistent prefix format for import-generated notes so they can be distinguished from manual notes.

marcel commented

2026-04-08 21:17:47 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

Test coverage for all SKIP keywords — The classifier currently has three SKIP keywords: "Briefumschlag", "Kondolenzbriefe", "Hochzeitsgedicht". Each should have a test case in MassImportServiceTest verifying the notes are populated. A @ParameterizedTest with @CsvSource containing all three keywords would cover this efficiently.
Acceptance criteria need a "does not store" negative case — The criteria say "Non-SKIP entries are unaffected." This should be an explicit test: a normal person name in the Von column should NOT produce any skip text in notes. Currently the criteria are all positive assertions — add a negative one.
Multi-SKIP document — What happens when both Von and An are SKIP entries? The document would have no sender, no receivers, but notes with two skip entries. Is the document still created? The metadataComplete heuristic (date != null || !senderRaw.isBlank() || !receiversRaw.isBlank()) would still be true because the raw strings are non-blank — they just got classified as SKIP. Is metadataComplete = true correct for a document where all person fields were skipped? This is an edge case worth testing and deciding on.
Re-import behavior — If a document is re-imported (exists with PLACEHOLDER status), the notes from the first import would be overwritten or duplicated. Should re-import clear previous import notes, or append? The current code does doc.setNotes(...) which would overwrite.

Suggestions

Add a test for the multi-SKIP edge case (both Von and An are SKIP keywords)
Add a negative test: normal person entry does not produce skip notes
Decide on re-import behavior for notes: overwrite or append

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **Test coverage for all SKIP keywords** — The classifier currently has three SKIP keywords: "Briefumschlag", "Kondolenzbriefe", "Hochzeitsgedicht". Each should have a test case in `MassImportServiceTest` verifying the notes are populated. A `@ParameterizedTest` with `@CsvSource` containing all three keywords would cover this efficiently. - **Acceptance criteria need a "does not store" negative case** — The criteria say "Non-SKIP entries are unaffected." This should be an explicit test: a normal person name in the Von column should NOT produce any skip text in notes. Currently the criteria are all positive assertions — add a negative one. - **Multi-SKIP document** — What happens when both Von and An are SKIP entries? The document would have no sender, no receivers, but notes with two skip entries. Is the document still created? The `metadataComplete` heuristic (`date != null || !senderRaw.isBlank() || !receiversRaw.isBlank()`) would still be true because the raw strings are non-blank — they just got classified as SKIP. Is `metadataComplete = true` correct for a document where all person fields were skipped? This is an edge case worth testing and deciding on. - **Re-import behavior** — If a document is re-imported (exists with PLACEHOLDER status), the notes from the first import would be overwritten or duplicated. Should re-import clear previous import notes, or append? The current code does `doc.setNotes(...)` which would overwrite. ### Suggestions - Add a test for the multi-SKIP edge case (both Von and An are SKIP keywords) - Add a negative test: normal person entry does not produce skip notes - Decide on re-import behavior for notes: overwrite or append

marcel commented

2026-04-08 21:17:56 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

Untrusted input stored in notes — The raw Von/An text from the ODS file is being stored directly in Document.notes. This text comes from an import file uploaded by an authenticated user, so the trust level is moderate. However, if notes is rendered anywhere in the frontend without escaping, a crafted import file could inject HTML/XSS payloads via the Von/An column. Verify that wherever notes is displayed in the UI, it's rendered as text content, not raw HTML. Svelte's {variable} syntax auto-escapes, so this should be safe — but check for any {@html notes} usage.
No new attack surface — The notes field already exists and is already user-writable (manual edits). Storing import metadata in it doesn't change the trust boundary. The only new thing is that import file content now reaches the notes field, which was previously only populated by direct user input through the UI.

Suggestions

Verify no {@html} rendering of the notes field in the frontend. If it's all standard {variable} interpolation, this is safe by default. Quick grep for @html.*notes would confirm.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **Untrusted input stored in notes** — The raw Von/An text from the ODS file is being stored directly in `Document.notes`. This text comes from an import file uploaded by an authenticated user, so the trust level is moderate. However, if `notes` is rendered anywhere in the frontend without escaping, a crafted import file could inject HTML/XSS payloads via the Von/An column. Verify that wherever `notes` is displayed in the UI, it's rendered as text content, not raw HTML. Svelte's `{variable}` syntax auto-escapes, so this should be safe — but check for any `{@html notes}` usage. - **No new attack surface** — The notes field already exists and is already user-writable (manual edits). Storing import metadata in it doesn't change the trust boundary. The only new thing is that import file content now reaches the notes field, which was previously only populated by direct user input through the UI. ### Suggestions - Verify no `{@html}` rendering of the notes field in the frontend. If it's all standard `{variable}` interpolation, this is safe by default. Quick grep for `@html.*notes` would confirm.

marcel commented

2026-04-08 21:18:03 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

No direct UI impact — This is a backend import pipeline change. The notes field is already displayed in the document detail view. Import-generated skip notes will appear alongside any manually written notes — no new component or layout change needed.
Visual distinction for import notes — If import-generated notes (like "[Import] Übersprungen (Von): Briefumschlag aus Java") appear mixed with user-written notes, users might be confused about what they can edit or delete. A consistent prefix like [Import] helps, but it's still plain text in the same field. For the current project scale this is fine — if it becomes confusing later, a separate "import log" field could be added.

Suggestions

No UI changes needed. The [Import] prefix convention is sufficient to distinguish machine-generated notes from manual ones at this scale.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **No direct UI impact** — This is a backend import pipeline change. The notes field is already displayed in the document detail view. Import-generated skip notes will appear alongside any manually written notes — no new component or layout change needed. - **Visual distinction for import notes** — If import-generated notes (like "[Import] Übersprungen (Von): Briefumschlag aus Java") appear mixed with user-written notes, users might be confused about what they can edit or delete. A consistent prefix like `[Import]` helps, but it's still plain text in the same field. For the current project scale this is fine — if it becomes confusing later, a separate "import log" field could be added. ### Suggestions - No UI changes needed. The `[Import]` prefix convention is sufficient to distinguish machine-generated notes from manual ones at this scale.

marcel commented

2026-04-08 21:18:09 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

No infrastructure impact — No new services, no config changes, no migrations, no Docker changes. Pure application logic change within MassImportService. The existing CI pipeline handles this without modification.
Observability — The current code already logs SKIP entries via the classifier path. After this change, the skip information will also be persisted in the database (in notes). That's actually a minor observability improvement — you can query SELECT original_filename, notes FROM documents WHERE notes LIKE '%Übersprungen%' to find all documents with skipped person entries, without needing to search through application logs.

Suggestions

No action needed from my side. Clean, self-contained backend change.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **No infrastructure impact** — No new services, no config changes, no migrations, no Docker changes. Pure application logic change within `MassImportService`. The existing CI pipeline handles this without modification. - **Observability** — The current code already logs SKIP entries via the classifier path. After this change, the skip information will also be persisted in the database (in `notes`). That's actually a minor observability improvement — you can query `SELECT original_filename, notes FROM documents WHERE notes LIKE '%Übersprungen%'` to find all documents with skipped person entries, without needing to search through application logs. ### Suggestions - No action needed from my side. Clean, self-contained backend change.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#217