feat(model): add PersonType enum to distinguish persons from institutions/groups #211

Closed
opened 2026-04-07 18:28:12 +02:00 by marcel · 11 comments
Owner

Problem

The mass import creates bogus Person records for entries that are not persons. Currently only "Familie" is filtered out. The ODS file contains at least 19 non-person entries that create garbage Person records:

Institutions/companies:

  • Reichsfechtschule, Steuerfinanzamt, Firma Auschrath, Arthur Collignon GmbH
  • Mitarbeiter Verlag, Mitarbeiter Druckerei TrebbinClara Cram, Mitarbeiter Kunstverlag Mu

Groups/committees:

  • Comite der Abschiedsfeier, Comite zur Errichtung eines Heine-Denkmals
  • Geschwister de Gruyter, Gesellschafter des Verlages
  • Garde du Corps

Possessive group references:

  • Ella de Gruyters Eltern, Eugenie de Gruyters Kinder, Hilde de Gruyters Schwiegereltern, Eltern Muller

Not a person at all:

  • Briefumschlag aus Java, Kondolenzbriefe zum Tod von Walter de Gruyter, Hochzeitsgedicht fur Paul u Luise de Gruyter

Solution

Part 1 - Add PersonType enum

public enum PersonType {
    PERSON,
    INSTITUTION,
    GROUP,
    UNKNOWN
}

Add a type field to Person entity, defaulting to PERSON.

Part 2 - Migration

Add person_type column to persons table with default 'PERSON'.

Part 3 - Import heuristics

During mass import, classify entries based on keyword heuristics:

INSTITUTION keywords: Firma, GmbH, Verlag, Druckerei, Kunstverlag, Steuerfinanzamt, Reichsfechtschule

GROUP keywords: Comite, Comite, Geschwister, Gesellschafter, Garde, Eltern, Kinder, Schwiegereltern, Mitarbeiter

Skip entirely (not even a record): Briefumschlag, Kondolenzbriefe, Hochzeitsgedicht - these are document descriptions, not sender/receiver names

UNKNOWN: anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.)

Part 4 - UI indication

Person list and detail pages should display the type (icon or badge) so users can review and correct misclassified entries.

Files

File Change
PersonType.java (new) Enum: PERSON, INSTITUTION, GROUP, UNKNOWN
Person.java Add type field with default PERSON
V{n}__add_person_type.sql Migration adding column
MassImportService.java or PersonNameParser.java Heuristic classification during import
PersonNameParserTest.java Tests for classification heuristics
Frontend person list/detail Display type badge

Found in

ODS import file analysis for #190.

## Problem The mass import creates bogus Person records for entries that are not persons. Currently only "Familie" is filtered out. The ODS file contains at least 19 non-person entries that create garbage Person records: **Institutions/companies:** - `Reichsfechtschule`, `Steuerfinanzamt`, `Firma Auschrath`, `Arthur Collignon GmbH` - `Mitarbeiter Verlag`, `Mitarbeiter Druckerei TrebbinClara Cram`, `Mitarbeiter Kunstverlag Mu` **Groups/committees:** - `Comite der Abschiedsfeier`, `Comite zur Errichtung eines Heine-Denkmals` - `Geschwister de Gruyter`, `Gesellschafter des Verlages` - `Garde du Corps` **Possessive group references:** - `Ella de Gruyters Eltern`, `Eugenie de Gruyters Kinder`, `Hilde de Gruyters Schwiegereltern`, `Eltern Muller` **Not a person at all:** - `Briefumschlag aus Java`, `Kondolenzbriefe zum Tod von Walter de Gruyter`, `Hochzeitsgedicht fur Paul u Luise de Gruyter` ## Solution ### Part 1 - Add PersonType enum ```java public enum PersonType { PERSON, INSTITUTION, GROUP, UNKNOWN } ``` Add a `type` field to `Person` entity, defaulting to `PERSON`. ### Part 2 - Migration Add `person_type` column to `persons` table with default `'PERSON'`. ### Part 3 - Import heuristics During mass import, classify entries based on keyword heuristics: **INSTITUTION** keywords: `Firma`, `GmbH`, `Verlag`, `Druckerei`, `Kunstverlag`, `Steuerfinanzamt`, `Reichsfechtschule` **GROUP** keywords: `Comite`, `Comite`, `Geschwister`, `Gesellschafter`, `Garde`, `Eltern`, `Kinder`, `Schwiegereltern`, `Mitarbeiter` **Skip entirely** (not even a record): `Briefumschlag`, `Kondolenzbriefe`, `Hochzeitsgedicht` - these are document descriptions, not sender/receiver names **UNKNOWN**: anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.) ### Part 4 - UI indication Person list and detail pages should display the type (icon or badge) so users can review and correct misclassified entries. ## Files | File | Change | |---|---| | `PersonType.java` (new) | Enum: PERSON, INSTITUTION, GROUP, UNKNOWN | | `Person.java` | Add `type` field with default PERSON | | `V{n}__add_person_type.sql` | Migration adding column | | `MassImportService.java` or `PersonNameParser.java` | Heuristic classification during import | | `PersonNameParserTest.java` | Tests for classification heuristics | | Frontend person list/detail | Display type badge | ## Found in ODS import file analysis for #190.
marcel added the featureperson labels 2026-04-07 18:28:43 +02:00
Author
Owner

Complete Input/Output Table

Every non-person entry from the ODS, classified by expected PersonType.

SKIP - Not a record at all (document descriptions, not sender/receiver)

# Raw input Column Why skip
1 Briefumschlag aus Java Von Envelope description, not a person
2 Kondolenzbriefe zum Tod von Walter de Gruyter Von Document description
3 Hochzeitsgedicht fur Paul u Luise de Gruyter Von Document description (also contains u separator - must not be parsed as persons)

INSTITUTION

# Raw input Column Heuristic keyword Notes
4 Arthur Collignon GmbH Von GmbH Company
5 Firma Auschrath Von Firma Company
6 Reichsfechtschule Von (known institution) Fencing school - no keyword heuristic, may need manual classification
7 Steuerfinanzamt Von (known institution) Tax office - no keyword heuristic, may need manual classification
8 Westermann u Co Von u Co / Co Company

GROUP

# Raw input Column Heuristic keyword Notes
9 Comite der Abschiedsfeier Von, An Comite Committee
10 Comite zur Errichtung eines Heine-Denkmals Von Comite Committee (note: accented e)
11 Garde du Corps Von Garde Military unit
12 Geschwister de Gruyter An Geschwister Sibling group
13 Gesellschafter des Verlages An Gesellschafter Business partners
14 Ella de Gruyters Eltern An Eltern Possessive group reference
15 Eugenie de Gruyters Kinder An Kinder Possessive group reference
16 Hilde de Gruyters Schwiegereltern An Schwiegereltern Possessive group reference
17 Eltern Muller An Eltern Group reference
18 Familie Cram Von Familie Family group (currently filtered in parseReceivers but not in Von)
19 Familie Hasenvlever Von Familie Family group
20 Mitarbeiter Druckerei TrebbinClara Cram Von Mitarbeiter Employee group (also has concatenation error in source data)
21 Mitarbeiter Kunstverlag Mu Von Mitarbeiter Employee group (truncated)
22 Mitarbeiter Verlag Von Mitarbeiter Employee group

Keyword heuristics summary

PersonType Keywords to detect
SKIP Briefumschlag, Kondolenzbriefe, Hochzeitsgedicht
INSTITUTION Firma, GmbH, Co (as word boundary)
GROUP Familie, Comite, Comite, Geschwister, Gesellschafter, Garde, Eltern, Kinder, Schwiegereltern, Mitarbeiter

Entries without keyword heuristics

# Raw input Expected type How to classify
6 Reichsfechtschule INSTITUTION No keyword match - needs manual review or a known-institution list
7 Steuerfinanzamt INSTITUTION Same - compound German nouns with no separator keyword

These two may need to remain UNKNOWN and be corrected by users in the UI, unless we add them to a static known-institution list.

PERSON entries that look ambiguous but ARE persons

Raw input Column Why it's a person
Architekt Korschelt u Renker Von Professional title + names - handle via title stripping (#212), u separator in Von column is a separate issue
Jappe Bakker u Frau Von Person + spouse reference - goes to split() not parseReceivers(), so u isn't handled. "u Frau" becomes part of the name
Oscar Knobloch etc Von Person + trailing noise ("etc")
Milly und Mutter Clara Von Person + relationship reference - goes to split(), und isn't handled
Gerd u Brigitte de Gruyter Von Two persons - goes to split(), u isn't handled

Note: Several Von entries contain und/u separators but go to split() (not parseReceivers()), so multi-person splitting doesn't apply. This is a separate concern - should Von entries also go through parseReceivers() first?

## Complete Input/Output Table Every non-person entry from the ODS, classified by expected PersonType. ### SKIP - Not a record at all (document descriptions, not sender/receiver) | # | Raw input | Column | Why skip | |---|---|---|---| | 1 | `Briefumschlag aus Java` | Von | Envelope description, not a person | | 2 | `Kondolenzbriefe zum Tod von Walter de Gruyter` | Von | Document description | | 3 | `Hochzeitsgedicht fur Paul u Luise de Gruyter` | Von | Document description (also contains `u` separator - must not be parsed as persons) | ### INSTITUTION | # | Raw input | Column | Heuristic keyword | Notes | |---|---|---|---|---| | 4 | `Arthur Collignon GmbH` | Von | `GmbH` | Company | | 5 | `Firma Auschrath` | Von | `Firma` | Company | | 6 | `Reichsfechtschule` | Von | (known institution) | Fencing school - no keyword heuristic, may need manual classification | | 7 | `Steuerfinanzamt` | Von | (known institution) | Tax office - no keyword heuristic, may need manual classification | | 8 | `Westermann u Co` | Von | `u Co` / `Co` | Company | ### GROUP | # | Raw input | Column | Heuristic keyword | Notes | |---|---|---|---|---| | 9 | `Comite der Abschiedsfeier` | Von, An | `Comite` | Committee | | 10 | `Comite zur Errichtung eines Heine-Denkmals` | Von | `Comite` | Committee (note: accented e) | | 11 | `Garde du Corps` | Von | `Garde` | Military unit | | 12 | `Geschwister de Gruyter` | An | `Geschwister` | Sibling group | | 13 | `Gesellschafter des Verlages` | An | `Gesellschafter` | Business partners | | 14 | `Ella de Gruyters Eltern` | An | `Eltern` | Possessive group reference | | 15 | `Eugenie de Gruyters Kinder` | An | `Kinder` | Possessive group reference | | 16 | `Hilde de Gruyters Schwiegereltern` | An | `Schwiegereltern` | Possessive group reference | | 17 | `Eltern Muller` | An | `Eltern` | Group reference | | 18 | `Familie Cram` | Von | `Familie` | Family group (currently filtered in `parseReceivers` but not in Von) | | 19 | `Familie Hasenvlever` | Von | `Familie` | Family group | | 20 | `Mitarbeiter Druckerei TrebbinClara Cram` | Von | `Mitarbeiter` | Employee group (also has concatenation error in source data) | | 21 | `Mitarbeiter Kunstverlag Mu` | Von | `Mitarbeiter` | Employee group (truncated) | | 22 | `Mitarbeiter Verlag` | Von | `Mitarbeiter` | Employee group | ### Keyword heuristics summary | PersonType | Keywords to detect | |---|---| | SKIP | `Briefumschlag`, `Kondolenzbriefe`, `Hochzeitsgedicht` | | INSTITUTION | `Firma`, `GmbH`, `Co` (as word boundary) | | GROUP | `Familie`, `Comite`, `Comite`, `Geschwister`, `Gesellschafter`, `Garde`, `Eltern`, `Kinder`, `Schwiegereltern`, `Mitarbeiter` | ### Entries without keyword heuristics | # | Raw input | Expected type | How to classify | |---|---|---|---| | 6 | `Reichsfechtschule` | INSTITUTION | No keyword match - needs manual review or a known-institution list | | 7 | `Steuerfinanzamt` | INSTITUTION | Same - compound German nouns with no separator keyword | These two may need to remain UNKNOWN and be corrected by users in the UI, unless we add them to a static known-institution list. ### PERSON entries that look ambiguous but ARE persons | Raw input | Column | Why it's a person | |---|---|---| | `Architekt Korschelt u Renker` | Von | Professional title + names - handle via title stripping (#212), `u` separator in Von column is a separate issue | | `Jappe Bakker u Frau` | Von | Person + spouse reference - goes to `split()` not `parseReceivers()`, so `u` isn't handled. "u Frau" becomes part of the name | | `Oscar Knobloch etc` | Von | Person + trailing noise ("etc") | | `Milly und Mutter Clara` | Von | Person + relationship reference - goes to `split()`, `und` isn't handled | | `Gerd u Brigitte de Gruyter` | Von | Two persons - goes to `split()`, `u` isn't handled | Note: Several Von entries contain `und`/`u` separators but go to `split()` (not `parseReceivers()`), so multi-person splitting doesn't apply. This is a separate concern - should Von entries also go through `parseReceivers()` first?
Author
Owner

Design Decision: firstName/lastName for non-persons

Decision: Full name goes into lastName, firstName stays null. Non-persons don't have first/last name structure - forcing them into it produces garbage.

This is consistent with #212 where firstName becomes nullable.

INSTITUTION examples

Raw input type title firstName lastName
Arthur Collignon GmbH INSTITUTION null null Arthur Collignon GmbH
Firma Auschrath INSTITUTION null null Firma Auschrath
Reichsfechtschule INSTITUTION null null Reichsfechtschule
Steuerfinanzamt INSTITUTION null null Steuerfinanzamt
Westermann u Co INSTITUTION null null Westermann u Co

GROUP examples

Raw input type title firstName lastName
Comite der Abschiedsfeier GROUP null null Comite der Abschiedsfeier
Comite zur Errichtung eines Heine-Denkmals GROUP null null Comite zur Errichtung eines Heine-Denkmals
Garde du Corps GROUP null null Garde du Corps
Geschwister de Gruyter GROUP null null Geschwister de Gruyter
Gesellschafter des Verlages GROUP null null Gesellschafter des Verlages
Ella de Gruyters Eltern GROUP null null Ella de Gruyters Eltern
Eugenie de Gruyters Kinder GROUP null null Eugenie de Gruyters Kinder
Hilde de Gruyters Schwiegereltern GROUP null null Hilde de Gruyters Schwiegereltern
Eltern Muller GROUP null null Eltern Muller
Familie Cram GROUP null null Familie Cram
Familie Hasenvlever GROUP null null Familie Hasenvlever
Mitarbeiter Verlag GROUP null null Mitarbeiter Verlag
Mitarbeiter Druckerei TrebbinClara Cram GROUP null null Mitarbeiter Druckerei TrebbinClara Cram
Mitarbeiter Kunstverlag Mu GROUP null null Mitarbeiter Kunstverlag Mu

Implementation note

When the heuristic classifies an entry as INSTITUTION or GROUP, skip split() entirely - don't attempt firstName/lastName extraction. Store the raw string (trimmed) as lastName and set firstName = null.

## Design Decision: firstName/lastName for non-persons **Decision:** Full name goes into `lastName`, `firstName` stays `null`. Non-persons don't have first/last name structure - forcing them into it produces garbage. This is consistent with #212 where `firstName` becomes nullable. ### INSTITUTION examples | Raw input | type | title | firstName | lastName | |---|---|---|---|---| | `Arthur Collignon GmbH` | INSTITUTION | null | null | `Arthur Collignon GmbH` | | `Firma Auschrath` | INSTITUTION | null | null | `Firma Auschrath` | | `Reichsfechtschule` | INSTITUTION | null | null | `Reichsfechtschule` | | `Steuerfinanzamt` | INSTITUTION | null | null | `Steuerfinanzamt` | | `Westermann u Co` | INSTITUTION | null | null | `Westermann u Co` | ### GROUP examples | Raw input | type | title | firstName | lastName | |---|---|---|---|---| | `Comite der Abschiedsfeier` | GROUP | null | null | `Comite der Abschiedsfeier` | | `Comite zur Errichtung eines Heine-Denkmals` | GROUP | null | null | `Comite zur Errichtung eines Heine-Denkmals` | | `Garde du Corps` | GROUP | null | null | `Garde du Corps` | | `Geschwister de Gruyter` | GROUP | null | null | `Geschwister de Gruyter` | | `Gesellschafter des Verlages` | GROUP | null | null | `Gesellschafter des Verlages` | | `Ella de Gruyters Eltern` | GROUP | null | null | `Ella de Gruyters Eltern` | | `Eugenie de Gruyters Kinder` | GROUP | null | null | `Eugenie de Gruyters Kinder` | | `Hilde de Gruyters Schwiegereltern` | GROUP | null | null | `Hilde de Gruyters Schwiegereltern` | | `Eltern Muller` | GROUP | null | null | `Eltern Muller` | | `Familie Cram` | GROUP | null | null | `Familie Cram` | | `Familie Hasenvlever` | GROUP | null | null | `Familie Hasenvlever` | | `Mitarbeiter Verlag` | GROUP | null | null | `Mitarbeiter Verlag` | | `Mitarbeiter Druckerei TrebbinClara Cram` | GROUP | null | null | `Mitarbeiter Druckerei TrebbinClara Cram` | | `Mitarbeiter Kunstverlag Mu` | GROUP | null | null | `Mitarbeiter Kunstverlag Mu` | ### Implementation note When the heuristic classifies an entry as INSTITUTION or GROUP, skip `split()` entirely - don't attempt firstName/lastName extraction. Store the raw string (trimmed) as `lastName` and set `firstName = null`.
Author
Owner

💻 Felix Brandt -- Senior Fullstack Developer

Questions & Observations

  • The issue mentions classification logic in MassImportService.java or a new PersonNameParser.java. Where exactly does the heuristic live? If PersonNameParser already exists and handles split() / parseReceivers(), the classification should happen before those methods are called -- otherwise non-person entries will still be split into bogus firstName/lastName pairs before the type is assigned. The comment 2 confirms this ("skip split() entirely"), but the issue body's file table lists both locations as options. Pick one.

  • The SKIP entries (Briefumschlag, Kondolenzbriefe, Hochzeitsgedicht) -- does "skip entirely" mean we silently drop them, or do we log a warning? Silent drops during import make debugging harder. A log.warn("Skipping non-person entry: {}", rawName) costs nothing and saves hours.

  • The keyword list mixes German-specific terms. What happens when someone imports a French or English document set? The heuristic is hardcoded to German vocabulary. Is that an accepted limitation for now, or should the keyword lists be externalized (e.g., in a config file or database table)?

  • Reichsfechtschule and Steuerfinanzamt have no keyword match. Comment 1 suggests they "remain UNKNOWN." But the issue body says UNKNOWN is for "anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.)." These two are real institutions, not suspicious strings. Should there be a static known-institution list, or do we accept that some entries will be UNKNOWN and corrected manually?

  • The PersonType enum has four values but no FAMILY type. Familie Cram and Familie Hasenvlever are classified as GROUP. Is "Familie" semantically the same as "Comite" or "Geschwister"? If users later want to filter "show me all families," they can't distinguish them from committees. Worth considering whether GROUP is sufficient or whether FAMILY should be a separate type.

Suggestions

  • Single responsibility for classification: Create a PersonTypeClassifier (or method on PersonNameParser) that takes a raw string and returns a PersonType. This is a pure function -- trivially testable, no dependencies. The MassImportService calls it before deciding whether to invoke split().

  • Test-first the keyword table: The comment 1 table is an excellent test fixture. Each row becomes a parameterized test case:

    @ParameterizedTest
    @CsvSource({
        "'Briefumschlag aus Java', SKIP",
        "'Arthur Collignon GmbH', INSTITUTION",
        "'Comite der Abschiedsfeier', GROUP",
        "'Ella de Gruyters Eltern', GROUP",
        "'Reichsfechtschule', UNKNOWN"
    })
    void shouldClassify(String input, PersonType expected) {
        assertThat(classifier.classify(input)).isEqualTo(expected);
    }
    

    This gives 22+ test cases directly from the spec.

  • Guard the enum default: The migration sets default 'PERSON'. Make sure the Java enum has PERSON as the JPA default too (@Column(columnDefinition = "varchar(20) default 'PERSON'") or @Builder.Default private PersonType type = PersonType.PERSON). Both layers must agree.

  • Frontend type badge: The issue says "icon or badge" but does not specify the visual treatment. Before implementing, define which component owns this (PersonCard? PersonListItem?) and what the badge looks like for each type. A PersonTypeBadge.svelte component receiving a type prop would be the clean split.

## :computer: Felix Brandt -- Senior Fullstack Developer ### Questions & Observations - The issue mentions classification logic in `MassImportService.java` or a new `PersonNameParser.java`. Where exactly does the heuristic live? If `PersonNameParser` already exists and handles `split()` / `parseReceivers()`, the classification should happen *before* those methods are called -- otherwise non-person entries will still be split into bogus firstName/lastName pairs before the type is assigned. The comment 2 confirms this ("skip `split()` entirely"), but the issue body's file table lists both locations as options. Pick one. - The SKIP entries (`Briefumschlag`, `Kondolenzbriefe`, `Hochzeitsgedicht`) -- does "skip entirely" mean we silently drop them, or do we log a warning? Silent drops during import make debugging harder. A `log.warn("Skipping non-person entry: {}", rawName)` costs nothing and saves hours. - The keyword list mixes German-specific terms. What happens when someone imports a French or English document set? The heuristic is hardcoded to German vocabulary. Is that an accepted limitation for now, or should the keyword lists be externalized (e.g., in a config file or database table)? - `Reichsfechtschule` and `Steuerfinanzamt` have no keyword match. Comment 1 suggests they "remain UNKNOWN." But the issue body says UNKNOWN is for "anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.)." These two are real institutions, not suspicious strings. Should there be a static known-institution list, or do we accept that some entries will be UNKNOWN and corrected manually? - The `PersonType` enum has four values but no `FAMILY` type. `Familie Cram` and `Familie Hasenvlever` are classified as GROUP. Is "Familie" semantically the same as "Comite" or "Geschwister"? If users later want to filter "show me all families," they can't distinguish them from committees. Worth considering whether GROUP is sufficient or whether FAMILY should be a separate type. ### Suggestions - **Single responsibility for classification**: Create a `PersonTypeClassifier` (or method on `PersonNameParser`) that takes a raw string and returns a `PersonType`. This is a pure function -- trivially testable, no dependencies. The `MassImportService` calls it before deciding whether to invoke `split()`. - **Test-first the keyword table**: The comment 1 table is an excellent test fixture. Each row becomes a parameterized test case: ```java @ParameterizedTest @CsvSource({ "'Briefumschlag aus Java', SKIP", "'Arthur Collignon GmbH', INSTITUTION", "'Comite der Abschiedsfeier', GROUP", "'Ella de Gruyters Eltern', GROUP", "'Reichsfechtschule', UNKNOWN" }) void shouldClassify(String input, PersonType expected) { assertThat(classifier.classify(input)).isEqualTo(expected); } ``` This gives 22+ test cases directly from the spec. - **Guard the enum default**: The migration sets default `'PERSON'`. Make sure the Java enum has `PERSON` as the JPA default too (`@Column(columnDefinition = "varchar(20) default 'PERSON'")` or `@Builder.Default private PersonType type = PersonType.PERSON`). Both layers must agree. - **Frontend type badge**: The issue says "icon or badge" but does not specify the visual treatment. Before implementing, define which component owns this (PersonCard? PersonListItem?) and what the badge looks like for each type. A `PersonTypeBadge.svelte` component receiving a `type` prop would be the clean split.
Author
Owner

🏗️ Markus Keller -- Application Architect

Questions & Observations

  • Domain boundary: The classification logic determines what a Person is. This is core domain logic, not import infrastructure. If it lives in MassImportService, it becomes coupled to the import flow. But users should also be able to change a Person's type from the UI (correct a misclassification). That means the type classification and the type persistence must be accessible from PersonService, not buried in import code.

  • Enum storage strategy: The issue says PersonType enum with a varchar column defaulting to 'PERSON'. This is correct -- @Enumerated(EnumType.STRING) with a varchar column. Never use ordinal storage for enums that may grow. But consider: should the column have a CHECK constraint limiting values to the known enum members? Flyway migration should include:

    ALTER TABLE persons ADD COLUMN person_type VARCHAR(20) NOT NULL DEFAULT 'PERSON'
        CHECK (person_type IN ('PERSON', 'INSTITUTION', 'GROUP', 'UNKNOWN'));
    

    This catches invalid values at the DB layer, not just the app layer.

  • SKIP semantics need clarification: The issue says entries like Briefumschlag aus Java should be skipped entirely -- "not even a record." But these raw strings came from the ODS as sender/receiver. If we skip them, the Document record loses its sender/receiver reference. Is that acceptable? Or should we store them as a note/comment on the Document instead? The data loss concern is real -- an archivist might want to know that the original spreadsheet said "Briefumschlag aus Java" even if it's not a person.

  • Cross-domain impact on DocumentService: Documents reference Person records as sender/receiver. If we now have Person records of type INSTITUTION and GROUP, does DocumentService need to handle them differently? For example, should the document search allow filtering by sender type? The issue is scoped to Person, but the ripple into Document display/search should be acknowledged.

  • The u separator issue in Von column: Comment 1 raises an important adjacent concern -- Von entries containing u/und separators go through split() not parseReceivers(), so multi-person splitting doesn't work. This is a separate issue but the classification logic must not be confused by it. Westermann u Co must be classified as INSTITUTION before any splitting attempt, or the u will be interpreted as a person separator.

Suggestions

  • Classification as a PersonService concern: Add a classifyPersonType(String rawName) method to PersonService (or a dedicated PersonTypeClassifier that PersonService delegates to). MassImportService calls PersonService to create persons -- the classification happens inside that creation flow, not before it.

  • Migration should backfill: Existing Person records created by previous imports include the garbage entries listed in the issue. The migration should not just add the column with default PERSON -- it should also run UPDATE statements to reclassify known non-persons:

    UPDATE persons SET person_type = 'INSTITUTION' WHERE last_name ILIKE '%GmbH%' OR last_name ILIKE 'Firma %';
    UPDATE persons SET person_type = 'GROUP' WHERE last_name ILIKE 'Familie %' OR last_name ILIKE 'Comite %' ...;
    

    Otherwise the existing garbage data stays classified as PERSON until someone manually fixes each one.

  • OpenAPI impact: Adding PersonType to the Person entity means the OpenAPI spec changes. The frontend TypeScript types need regeneration (npm run generate:api). The issue's file table should include frontend/src/lib/api/ or note the regeneration step.

  • Consider a displayName computed field: For non-persons, lastName holds the full name while firstName is null. Every frontend display location currently shows firstName + lastName. Rather than fixing every display site, consider a getDisplayName() method on Person that returns the correct representation based on type. This centralizes the display logic.

## :building_construction: Markus Keller -- Application Architect ### Questions & Observations - **Domain boundary**: The classification logic determines what a Person *is*. This is core domain logic, not import infrastructure. If it lives in `MassImportService`, it becomes coupled to the import flow. But users should also be able to change a Person's type from the UI (correct a misclassification). That means the type classification and the type persistence must be accessible from `PersonService`, not buried in import code. - **Enum storage strategy**: The issue says `PersonType` enum with a varchar column defaulting to `'PERSON'`. This is correct -- `@Enumerated(EnumType.STRING)` with a varchar column. Never use ordinal storage for enums that may grow. But consider: should the column have a CHECK constraint limiting values to the known enum members? Flyway migration should include: ```sql ALTER TABLE persons ADD COLUMN person_type VARCHAR(20) NOT NULL DEFAULT 'PERSON' CHECK (person_type IN ('PERSON', 'INSTITUTION', 'GROUP', 'UNKNOWN')); ``` This catches invalid values at the DB layer, not just the app layer. - **SKIP semantics need clarification**: The issue says entries like `Briefumschlag aus Java` should be skipped entirely -- "not even a record." But these raw strings came from the ODS as sender/receiver. If we skip them, the Document record loses its sender/receiver reference. Is that acceptable? Or should we store them as a note/comment on the Document instead? The data loss concern is real -- an archivist might want to know that the original spreadsheet said "Briefumschlag aus Java" even if it's not a person. - **Cross-domain impact on DocumentService**: Documents reference Person records as sender/receiver. If we now have Person records of type INSTITUTION and GROUP, does `DocumentService` need to handle them differently? For example, should the document search allow filtering by sender type? The issue is scoped to Person, but the ripple into Document display/search should be acknowledged. - **The `u` separator issue in Von column**: Comment 1 raises an important adjacent concern -- Von entries containing `u`/`und` separators go through `split()` not `parseReceivers()`, so multi-person splitting doesn't work. This is a separate issue but the classification logic must not be confused by it. `Westermann u Co` must be classified as INSTITUTION *before* any splitting attempt, or the `u` will be interpreted as a person separator. ### Suggestions - **Classification as a PersonService concern**: Add a `classifyPersonType(String rawName)` method to `PersonService` (or a dedicated `PersonTypeClassifier` that `PersonService` delegates to). `MassImportService` calls `PersonService` to create persons -- the classification happens inside that creation flow, not before it. - **Migration should backfill**: Existing Person records created by previous imports include the garbage entries listed in the issue. The migration should not just add the column with default PERSON -- it should also run UPDATE statements to reclassify known non-persons: ```sql UPDATE persons SET person_type = 'INSTITUTION' WHERE last_name ILIKE '%GmbH%' OR last_name ILIKE 'Firma %'; UPDATE persons SET person_type = 'GROUP' WHERE last_name ILIKE 'Familie %' OR last_name ILIKE 'Comite %' ...; ``` Otherwise the existing garbage data stays classified as PERSON until someone manually fixes each one. - **OpenAPI impact**: Adding `PersonType` to the Person entity means the OpenAPI spec changes. The frontend TypeScript types need regeneration (`npm run generate:api`). The issue's file table should include `frontend/src/lib/api/` or note the regeneration step. - **Consider a `displayName` computed field**: For non-persons, `lastName` holds the full name while `firstName` is null. Every frontend display location currently shows `firstName + lastName`. Rather than fixing every display site, consider a `getDisplayName()` method on Person that returns the correct representation based on type. This centralizes the display logic.
Author
Owner

🧪 Sara Holt -- QA Engineer & Test Strategist

Questions & Observations

  • The keyword heuristic is the core risk: False positives (real persons classified as INSTITUTION/GROUP) and false negatives (institutions slipping through as PERSON) are both data quality bugs. The issue provides 22 specific test cases in comment 1 -- but these are only the known cases from the current ODS. What about future imports with different data?

  • Edge cases not covered in the tables:

    • What about Dr. Firma Mueller -- a real person whose last name happens to be "Firma"? The keyword Firma would misclassify them as INSTITUTION.
    • What about Eltern as a last name (it exists in German)? The keyword match would misclassify.
    • Case sensitivity: firma auschrath vs Firma Auschrath -- does the heuristic do case-insensitive matching?
    • What about entries with leading/trailing whitespace or multiple spaces?
  • The SKIP category has no database record: How do we test that skipped entries truly produce no Person record? This needs an integration test that imports a row with Briefumschlag aus Java as sender and verifies: (a) no Person with that name exists, (b) the Document record's sender field is null or handled gracefully.

  • Migration backfill testability: If the migration includes UPDATE statements to reclassify existing records (as the architect might suggest), those UPDATEs need test coverage. A Testcontainers integration test should: load V(n-1) schema, insert known garbage Person records, run V(n) migration, verify person_type values are correct.

  • Frontend display of type badge: The issue says "icon or badge" on person list and detail pages. This needs visual regression coverage at the E2E layer -- a Playwright test that navigates to a person of type INSTITUTION and verifies the badge is visible.

Suggestions

  • Parameterized unit test suite: The 22 rows from comment 1 are a ready-made @ParameterizedTest data source. Add boundary cases:

    "Dr. Firma Mueller" -> PERSON (false positive guard)
    "  Firma Auschrath  " -> INSTITUTION (whitespace handling)
    "firma auschrath" -> INSTITUTION (case insensitivity)
    "" -> ? (empty string)
    null -> ? (null input)
    
  • Integration test for import flow: A @SpringBootTest that imports a minimal ODS containing one PERSON, one INSTITUTION, one GROUP, and one SKIP entry, then verifies:

    • 3 Person records created (not 4)
    • Each has the correct personType
    • The SKIP entry produced no Person record
    • The Document referencing the SKIP entry has null sender
  • Regression test for existing Familie filter: The issue mentions Familie is currently filtered out. Verify the new heuristic subsumes that existing filter without breaking it. If the old filter is removed, a test should confirm Familie Cram now creates a GROUP-typed Person instead of being silently dropped.

  • Test the enum at the DB layer: If a CHECK constraint is added, write an integration test that attempts to INSERT a Person with person_type = 'INVALID' and verifies the constraint violation.

## :test_tube: Sara Holt -- QA Engineer & Test Strategist ### Questions & Observations - **The keyword heuristic is the core risk**: False positives (real persons classified as INSTITUTION/GROUP) and false negatives (institutions slipping through as PERSON) are both data quality bugs. The issue provides 22 specific test cases in comment 1 -- but these are only the *known* cases from the current ODS. What about future imports with different data? - **Edge cases not covered in the tables**: - What about `Dr. Firma Mueller` -- a real person whose last name happens to be "Firma"? The keyword `Firma` would misclassify them as INSTITUTION. - What about `Eltern` as a last name (it exists in German)? The keyword match would misclassify. - Case sensitivity: `firma auschrath` vs `Firma Auschrath` -- does the heuristic do case-insensitive matching? - What about entries with leading/trailing whitespace or multiple spaces? - **The SKIP category has no database record**: How do we test that skipped entries truly produce no Person record? This needs an integration test that imports a row with `Briefumschlag aus Java` as sender and verifies: (a) no Person with that name exists, (b) the Document record's sender field is null or handled gracefully. - **Migration backfill testability**: If the migration includes UPDATE statements to reclassify existing records (as the architect might suggest), those UPDATEs need test coverage. A Testcontainers integration test should: load V(n-1) schema, insert known garbage Person records, run V(n) migration, verify person_type values are correct. - **Frontend display of type badge**: The issue says "icon or badge" on person list and detail pages. This needs visual regression coverage at the E2E layer -- a Playwright test that navigates to a person of type INSTITUTION and verifies the badge is visible. ### Suggestions - **Parameterized unit test suite**: The 22 rows from comment 1 are a ready-made `@ParameterizedTest` data source. Add boundary cases: ``` "Dr. Firma Mueller" -> PERSON (false positive guard) " Firma Auschrath " -> INSTITUTION (whitespace handling) "firma auschrath" -> INSTITUTION (case insensitivity) "" -> ? (empty string) null -> ? (null input) ``` - **Integration test for import flow**: A `@SpringBootTest` that imports a minimal ODS containing one PERSON, one INSTITUTION, one GROUP, and one SKIP entry, then verifies: - 3 Person records created (not 4) - Each has the correct `personType` - The SKIP entry produced no Person record - The Document referencing the SKIP entry has null sender - **Regression test for existing `Familie` filter**: The issue mentions `Familie` is currently filtered out. Verify the new heuristic subsumes that existing filter without breaking it. If the old filter is removed, a test should confirm `Familie Cram` now creates a GROUP-typed Person instead of being silently dropped. - **Test the enum at the DB layer**: If a CHECK constraint is added, write an integration test that attempts to INSERT a Person with `person_type = 'INVALID'` and verifies the constraint violation.
Author
Owner

🛡️ Nora "NullX" Steiner -- Application Security Engineer

Questions & Observations

  • Mass assignment on PersonType: When the type field is added to the Person entity, it becomes part of the JPA model. If Person objects are accepted as request bodies (or via a DTO that maps to Person), can a user set type to any value via the API? For example, could someone POST {"type": "PERSON"} to reclassify an INSTITUTION back to PERSON, bypassing the heuristic? The type field should be explicitly controlled -- not blindly accepted from user input unless the user has appropriate permissions.

  • Keyword matching and injection: The heuristic matches keywords in raw ODS cell values. These values are user-supplied data (whoever created the spreadsheet). If the keyword matching uses regex, are the patterns safe from ReDoS? Simple String.contains() or startsWith() checks are fine, but if someone decides to use Pattern.compile() with user-derived patterns later, that's a risk. Keep the matching simple.

  • SKIP entries and data integrity: If entries are silently skipped during import, there's no audit trail. An attacker (or a data entry error) that names a real person "Briefumschlag aus Java" would have that person silently dropped from the import. This is a minor concern for a family archive, but logging skipped entries (at minimum) provides an audit trail. Consider storing skipped entries in an import log table rather than silently discarding them.

  • Enum values in the database: The person_type column stores string values. If the application ever reads a value that doesn't match the Java enum (e.g., due to a manual DB edit or a migration error), JPA will throw an IllegalArgumentException. This is a crash, not a graceful error. Consider whether a fallback to UNKNOWN is appropriate for unrecognized values.

  • Permission model: Who can change a Person's type in the UI? The issue mentions users can "review and correct misclassified entries." Does this require WRITE_ALL? ADMIN? If any authenticated user can reclassify a Person, that's probably fine for a family archive. But if the type field influences business logic (e.g., filtering, search), unauthorized type changes could affect data visibility for other users.

Suggestions

  • Whitelist type in the DTO: If PersonUpdateDTO or similar exists, explicitly include type as an allowed field with validation (@NotNull, restricted to valid enum values). Do not rely on JPA's enum parsing alone for input validation -- a malformed string should return 400, not 500.

  • Log all SKIP decisions: During import, log every skipped entry at WARN level with the raw value and the reason:

    log.warn("Import: skipping non-person entry '{}' (matched SKIP keyword '{}')", rawName, matchedKeyword);
    

    This is cheap insurance against silent data loss.

  • CHECK constraint in the migration: As noted by the architect, a DB-level CHECK constraint prevents invalid enum values from being inserted by any path (application, migration script, manual DB access). This is defense in depth.

  • No security blockers: This feature doesn't introduce new attack surface beyond the standard CRUD operations. The main concern is data integrity (silent drops, misclassification) rather than traditional security vulnerabilities.

## :shield: Nora "NullX" Steiner -- Application Security Engineer ### Questions & Observations - **Mass assignment on PersonType**: When the `type` field is added to the Person entity, it becomes part of the JPA model. If Person objects are accepted as request bodies (or via a DTO that maps to Person), can a user set `type` to any value via the API? For example, could someone POST `{"type": "PERSON"}` to reclassify an INSTITUTION back to PERSON, bypassing the heuristic? The `type` field should be explicitly controlled -- not blindly accepted from user input unless the user has appropriate permissions. - **Keyword matching and injection**: The heuristic matches keywords in raw ODS cell values. These values are user-supplied data (whoever created the spreadsheet). If the keyword matching uses regex, are the patterns safe from ReDoS? Simple `String.contains()` or `startsWith()` checks are fine, but if someone decides to use `Pattern.compile()` with user-derived patterns later, that's a risk. Keep the matching simple. - **SKIP entries and data integrity**: If entries are silently skipped during import, there's no audit trail. An attacker (or a data entry error) that names a real person "Briefumschlag aus Java" would have that person silently dropped from the import. This is a minor concern for a family archive, but logging skipped entries (at minimum) provides an audit trail. Consider storing skipped entries in an import log table rather than silently discarding them. - **Enum values in the database**: The `person_type` column stores string values. If the application ever reads a value that doesn't match the Java enum (e.g., due to a manual DB edit or a migration error), JPA will throw an `IllegalArgumentException`. This is a crash, not a graceful error. Consider whether a fallback to UNKNOWN is appropriate for unrecognized values. - **Permission model**: Who can change a Person's type in the UI? The issue mentions users can "review and correct misclassified entries." Does this require `WRITE_ALL`? `ADMIN`? If any authenticated user can reclassify a Person, that's probably fine for a family archive. But if the type field influences business logic (e.g., filtering, search), unauthorized type changes could affect data visibility for other users. ### Suggestions - **Whitelist `type` in the DTO**: If `PersonUpdateDTO` or similar exists, explicitly include `type` as an allowed field with validation (`@NotNull`, restricted to valid enum values). Do not rely on JPA's enum parsing alone for input validation -- a malformed string should return 400, not 500. - **Log all SKIP decisions**: During import, log every skipped entry at WARN level with the raw value and the reason: ```java log.warn("Import: skipping non-person entry '{}' (matched SKIP keyword '{}')", rawName, matchedKeyword); ``` This is cheap insurance against silent data loss. - **CHECK constraint in the migration**: As noted by the architect, a DB-level CHECK constraint prevents invalid enum values from being inserted by any path (application, migration script, manual DB access). This is defense in depth. - **No security blockers**: This feature doesn't introduce new attack surface beyond the standard CRUD operations. The main concern is data integrity (silent drops, misclassification) rather than traditional security vulnerabilities.
Author
Owner

🎨 Leonie Voss -- UI/UX Design Lead

Questions & Observations

  • Part 4 is underspecified: "Person list and detail pages should display the type (icon or badge)" -- but which visual treatment? A text badge, an icon, a colored dot? The choice matters for accessibility and scannability. On the person list, users will scan dozens of entries. The type indicator must be visible at a glance without dominating the layout.

  • Four types, four visual treatments: PERSON, INSTITUTION, GROUP, UNKNOWN each need a distinct, accessible indicator. Color alone is not sufficient (WCAG 1.4.1). Each type needs at minimum: a label and either an icon or a distinct shape/pattern.

  • Person list density impact: Adding a badge to every row in the person list increases visual noise. Most persons are type PERSON -- do we show a badge for PERSON, or only for non-PERSON types? Showing a badge on every row when 90%+ are PERSON adds clutter without information. Consider: badge only for INSTITUTION, GROUP, and UNKNOWN. PERSON is the unmarked default.

  • Senior user consideration (60+): Badges with abbreviated text (INST, GRP) may be unclear. Full words (Institution, Gruppe) are clearer but take more space. An icon + tooltip is compact but tooltips are inaccessible on touch devices. The safest pattern: small text badge with full word, using the brand-sand background and brand-navy text.

  • Detail page placement: On the person detail/edit page, where does the type appear? Next to the name? In a separate metadata row? If the type is editable (users correcting misclassifications), it needs a select/dropdown on the edit page. The edit page already has firstName/lastName fields -- for INSTITUTION/GROUP where firstName is null, the edit page should hide/disable the firstName field and show a single "Name" field instead.

  • The UNKNOWN type is a call to action: An UNKNOWN badge should visually signal "this needs review" -- perhaps with a yellow/amber background or an attention icon. It's different from PERSON/INSTITUTION/GROUP which are settled classifications.

Suggestions

  • Badge component spec:

    • PERSON: no badge (unmarked default)
    • INSTITUTION: bg-blue-50 text-blue-700 border border-blue-200 with building icon, text "Institution"
    • GROUP: bg-purple-50 text-purple-700 border border-purple-200 with people icon, text "Gruppe"
    • UNKNOWN: bg-amber-50 text-amber-700 border border-amber-200 with question-mark icon, text "Unbekannt"
    • All badges: text-xs font-medium px-2 py-0.5 rounded-full -- minimum 44px tap target when clickable
    • Pair each color with an icon so color is never the sole differentiator
  • Edit page adaptation for non-persons: When type is INSTITUTION or GROUP:

    • Hide the firstName field
    • Relabel lastName to "Name" or "Bezeichnung"
    • Show the type as an editable dropdown
    • This prevents confusion from seeing an empty firstName field next to a full institution name in lastName
  • Filter by type on the person list: If users need to review UNKNOWN entries, a filter chip or tab for person type would let them quickly find records that need correction. This could be a follow-up issue, but it's worth noting now.

  • Dark mode consideration: The badge colors above should have dark mode equivalents defined. bg-blue-50 becomes dark:bg-blue-900/30, etc. Define the token mapping before implementation.

## :art: Leonie Voss -- UI/UX Design Lead ### Questions & Observations - **Part 4 is underspecified**: "Person list and detail pages should display the type (icon or badge)" -- but which visual treatment? A text badge, an icon, a colored dot? The choice matters for accessibility and scannability. On the person list, users will scan dozens of entries. The type indicator must be visible at a glance without dominating the layout. - **Four types, four visual treatments**: PERSON, INSTITUTION, GROUP, UNKNOWN each need a distinct, accessible indicator. Color alone is not sufficient (WCAG 1.4.1). Each type needs at minimum: a label *and* either an icon or a distinct shape/pattern. - **Person list density impact**: Adding a badge to every row in the person list increases visual noise. Most persons are type PERSON -- do we show a badge for PERSON, or only for non-PERSON types? Showing a badge on every row when 90%+ are PERSON adds clutter without information. Consider: badge only for INSTITUTION, GROUP, and UNKNOWN. PERSON is the unmarked default. - **Senior user consideration (60+)**: Badges with abbreviated text (`INST`, `GRP`) may be unclear. Full words (`Institution`, `Gruppe`) are clearer but take more space. An icon + tooltip is compact but tooltips are inaccessible on touch devices. The safest pattern: small text badge with full word, using the brand-sand background and brand-navy text. - **Detail page placement**: On the person detail/edit page, where does the type appear? Next to the name? In a separate metadata row? If the type is editable (users correcting misclassifications), it needs a select/dropdown on the edit page. The edit page already has firstName/lastName fields -- for INSTITUTION/GROUP where firstName is null, the edit page should hide/disable the firstName field and show a single "Name" field instead. - **The UNKNOWN type is a call to action**: An UNKNOWN badge should visually signal "this needs review" -- perhaps with a yellow/amber background or an attention icon. It's different from PERSON/INSTITUTION/GROUP which are settled classifications. ### Suggestions - **Badge component spec**: - PERSON: no badge (unmarked default) - INSTITUTION: `bg-blue-50 text-blue-700 border border-blue-200` with building icon, text "Institution" - GROUP: `bg-purple-50 text-purple-700 border border-purple-200` with people icon, text "Gruppe" - UNKNOWN: `bg-amber-50 text-amber-700 border border-amber-200` with question-mark icon, text "Unbekannt" - All badges: `text-xs font-medium px-2 py-0.5 rounded-full` -- minimum 44px tap target when clickable - Pair each color with an icon so color is never the sole differentiator - **Edit page adaptation for non-persons**: When type is INSTITUTION or GROUP: - Hide the firstName field - Relabel lastName to "Name" or "Bezeichnung" - Show the type as an editable dropdown - This prevents confusion from seeing an empty firstName field next to a full institution name in lastName - **Filter by type on the person list**: If users need to review UNKNOWN entries, a filter chip or tab for person type would let them quickly find records that need correction. This could be a follow-up issue, but it's worth noting now. - **Dark mode consideration**: The badge colors above should have dark mode equivalents defined. `bg-blue-50` becomes `dark:bg-blue-900/30`, etc. Define the token mapping before implementation.
Author
Owner

🐳 Tobias Wendt -- DevOps & Platform Engineer

Questions & Observations

  • Flyway migration ordering: The issue mentions V{n}__add_person_type.sql but doesn't specify the version number. With multiple features in flight (#211, #212 referencing nullable firstName), migration ordering matters. If both land around the same time, we need to ensure no conflicts. What's the current highest migration version? The migration should be numbered to land after any in-progress migrations from other branches.

  • Migration runtime on existing data: How many Person records exist currently? The ALTER TABLE ... ADD COLUMN ... DEFAULT 'PERSON' is fast on PostgreSQL (metadata-only change for a non-null column with a default since PG 11). But if the migration includes UPDATE statements for backfilling known non-persons (matching keywords against existing lastName values), that's a table scan. For a family archive this is likely trivial (hundreds, not millions of rows), but worth confirming.

  • No infrastructure changes required: This feature adds an enum column and application logic. No new services, no new containers, no new ports, no new volumes. The Docker Compose setup, CI pipeline, and deployment process are unaffected. This is a clean, contained change from an infrastructure perspective.

  • CI impact: The new PersonTypeClassifier (or equivalent) will have unit tests. These are fast and won't affect CI pipeline duration. The integration test for import flow with classification might add a few seconds to the Testcontainers suite. No concern here.

  • OpenAPI spec regeneration: The PersonType enum being added to the Person entity means the OpenAPI spec changes. The CI pipeline should catch type mismatches if the frontend types aren't regenerated. Does the current CI pipeline include a step that verifies generated types are up to date? If not, this is a good time to add one -- a npm run generate:api && git diff --exit-code check would catch drift.

Suggestions

  • Coordinate migration numbering with #212: If #212 (nullable firstName) is in progress on another branch, agree on version numbers now to avoid Flyway conflicts at merge time. One approach: assign this issue the next available version and have #212 take the one after, or vice versa.

  • Add an OpenAPI drift check to CI: A lightweight CI step that regenerates the TypeScript API types and fails if they differ from what's committed:

    - name: Check OpenAPI types are up to date
      run: |
        cd frontend
        npm run generate:api
        git diff --exit-code src/lib/api/
    

    This prevents the "forgot to regenerate types" class of bugs permanently.

  • No deployment considerations: Since this is a schema migration + application code change, a standard deploy (pull new image, run migrations, restart) handles it. No blue-green, no feature flag needed. The migration is backwards-compatible -- the column has a default, so the old application code would work fine alongside the new schema.

## :whale: Tobias Wendt -- DevOps & Platform Engineer ### Questions & Observations - **Flyway migration ordering**: The issue mentions `V{n}__add_person_type.sql` but doesn't specify the version number. With multiple features in flight (#211, #212 referencing nullable firstName), migration ordering matters. If both land around the same time, we need to ensure no conflicts. What's the current highest migration version? The migration should be numbered to land after any in-progress migrations from other branches. - **Migration runtime on existing data**: How many Person records exist currently? The `ALTER TABLE ... ADD COLUMN ... DEFAULT 'PERSON'` is fast on PostgreSQL (metadata-only change for a non-null column with a default since PG 11). But if the migration includes UPDATE statements for backfilling known non-persons (matching keywords against existing lastName values), that's a table scan. For a family archive this is likely trivial (hundreds, not millions of rows), but worth confirming. - **No infrastructure changes required**: This feature adds an enum column and application logic. No new services, no new containers, no new ports, no new volumes. The Docker Compose setup, CI pipeline, and deployment process are unaffected. This is a clean, contained change from an infrastructure perspective. - **CI impact**: The new `PersonTypeClassifier` (or equivalent) will have unit tests. These are fast and won't affect CI pipeline duration. The integration test for import flow with classification might add a few seconds to the Testcontainers suite. No concern here. - **OpenAPI spec regeneration**: The `PersonType` enum being added to the Person entity means the OpenAPI spec changes. The CI pipeline should catch type mismatches if the frontend types aren't regenerated. Does the current CI pipeline include a step that verifies generated types are up to date? If not, this is a good time to add one -- a `npm run generate:api && git diff --exit-code` check would catch drift. ### Suggestions - **Coordinate migration numbering with #212**: If #212 (nullable firstName) is in progress on another branch, agree on version numbers now to avoid Flyway conflicts at merge time. One approach: assign this issue the next available version and have #212 take the one after, or vice versa. - **Add an OpenAPI drift check to CI**: A lightweight CI step that regenerates the TypeScript API types and fails if they differ from what's committed: ```yaml - name: Check OpenAPI types are up to date run: | cd frontend npm run generate:api git diff --exit-code src/lib/api/ ``` This prevents the "forgot to regenerate types" class of bugs permanently. - **No deployment considerations**: Since this is a schema migration + application code change, a standard deploy (pull new image, run migrations, restart) handles it. No blue-green, no feature flag needed. The migration is backwards-compatible -- the column has a default, so the old application code would work fine alongside the new schema.
Author
Owner

🏗️ Markus Keller — Application Architect (Discussion Summary)

Interactive discussion with Marcel covering 5 open items. All resolved.

Resolved Items

  • Classification logic location — New PersonTypeClassifier utility class (like PersonNameParser). Pure static method PersonType classify(String rawName), no dependencies, trivially testable. Called by PersonService.findOrCreateByAlias() before deciding whether to invoke split(). Keeps classification reusable from any entry point — import, UI, API. Separate from PersonNameParser because classification is a different concern from name parsing.

  • SKIP entries and data loss — Decision: store raw Von/An value in the Document's notes field when classification returns SKIP. The Document entity already has a notes TEXT column displayed on the detail page. No schema change needed, no fake Person records, no silent data loss. Example: Briefumschlag aus Java -> Document notes: "[Import] Von: Briefumschlag aus Java (kein Personeneintrag)".

  • False positive risk in keyword matching — Decision: position-aware matching. Start-of-string for most keywords (Firma, Comite, Mitarbeiter, Geschwister, Eltern, Familie, etc.), end-of-string for GmbH/Co. This means Dr. Firma Mueller is correctly classified as PERSON (does not start with "Firma"), while Firma Auschrath is correctly classified as INSTITUTION. Eliminates the false positive class Sara identified.

  • Migration backfillNot needed. Still in dev mode — clean DB before re-import. No UPDATE statements in the migration.

  • Von-column multi-person entriesSeparate issue created: #214. Von entries with und/u separators (Gerd u Brigitte de Gruyter, Hans u Alma Cram, etc.) bypass parseReceivers(). Fix is to route Von entries through parseReceivers() after classification. Classifier must run first so Westermann u Co (INSTITUTION) is not split on u. Depends on this issue (#211) and #213.

Additional Decision: SKIP enum value

Added SKIP to the PersonType enum (in #213 scope). Never persisted to the database — PersonService checks for it and returns early. Makes the classifier's return type exhaustive and self-documenting, cleaner than a null convention.

Dependency Chain

#190 (merged) -> #213 (preparatory refactor, now includes SKIP in enum) -> #211 (classifier + heuristics) -> #214 (Von multi-person splitting)

## 🏗️ Markus Keller — Application Architect (Discussion Summary) Interactive discussion with Marcel covering 5 open items. All resolved. ### Resolved Items - **Classification logic location** — New `PersonTypeClassifier` utility class (like `PersonNameParser`). Pure static method `PersonType classify(String rawName)`, no dependencies, trivially testable. Called by `PersonService.findOrCreateByAlias()` before deciding whether to invoke `split()`. Keeps classification reusable from any entry point — import, UI, API. Separate from `PersonNameParser` because classification is a different concern from name parsing. - **SKIP entries and data loss** — Decision: **store raw Von/An value in the Document's `notes` field** when classification returns `SKIP`. The Document entity already has a `notes` TEXT column displayed on the detail page. No schema change needed, no fake Person records, no silent data loss. Example: `Briefumschlag aus Java` -> Document notes: `"[Import] Von: Briefumschlag aus Java (kein Personeneintrag)"`. - **False positive risk in keyword matching** — Decision: **position-aware matching**. Start-of-string for most keywords (`Firma`, `Comite`, `Mitarbeiter`, `Geschwister`, `Eltern`, `Familie`, etc.), end-of-string for `GmbH`/`Co`. This means `Dr. Firma Mueller` is correctly classified as PERSON (does not start with "Firma"), while `Firma Auschrath` is correctly classified as INSTITUTION. Eliminates the false positive class Sara identified. - **Migration backfill** — **Not needed**. Still in dev mode — clean DB before re-import. No UPDATE statements in the migration. - **Von-column multi-person entries** — **Separate issue created: #214**. Von entries with `und`/`u` separators (`Gerd u Brigitte de Gruyter`, `Hans u Alma Cram`, etc.) bypass `parseReceivers()`. Fix is to route Von entries through `parseReceivers()` after classification. Classifier must run first so `Westermann u Co` (INSTITUTION) is not split on `u`. Depends on this issue (#211) and #213. ### Additional Decision: SKIP enum value Added `SKIP` to the `PersonType` enum (in #213 scope). Never persisted to the database — `PersonService` checks for it and returns early. Makes the classifier's return type exhaustive and self-documenting, cleaner than a null convention. ### Dependency Chain #190 (merged) -> #213 (preparatory refactor, now includes `SKIP` in enum) -> #211 (classifier + heuristics) -> #214 (Von multi-person splitting)
Author
Owner

🏗️ Markus Keller — Application Architect (Discussion: "Architekt" reclassification)

Follow-up discussion with Marcel on the classification of Architekt Korschelt u Renker.

Resolved Items

  • "Architekt" is an INSTITUTION keywordArchitekt Korschelt u Renker is not a person with a professional title — "Korschelt u Renker" is the name of the architect business. Add Architekt to the INSTITUTION keyword list with start-of-string matching (same rule as Firma). The classifier catches it as INSTITUTION before any splitting (#214) or title stripping (#212) runs.

  • No ambiguity handling neededArchitekt is always an institution indicator in this dataset. Any future edge case (e.g., a person genuinely titled "Architekt") will be corrected manually via the UI. No disambiguation logic required.

  • #214 updated — Moved Architekt Korschelt u Renker from #214's split examples table to the "must NOT be split" table, classified as INSTITUTION by #211.

  • No compound "u"-partnership rule — Considered detecting a broader pattern (profession + name u name → INSTITUTION) but decided against it. Only one occurrence in the dataset. Start-of-string match on Architekt is sufficient. KISS.

Updated INSTITUTION keyword list

Keyword Match rule
Firma start-of-string
GmbH end-of-string
Co end-of-string (word boundary)
Architekt start-of-string ← new
## 🏗️ Markus Keller — Application Architect (Discussion: "Architekt" reclassification) Follow-up discussion with Marcel on the classification of `Architekt Korschelt u Renker`. ### Resolved Items - **"Architekt" is an INSTITUTION keyword** — `Architekt Korschelt u Renker` is not a person with a professional title — "Korschelt u Renker" is the name of the architect business. Add `Architekt` to the INSTITUTION keyword list with start-of-string matching (same rule as `Firma`). The classifier catches it as INSTITUTION before any splitting (#214) or title stripping (#212) runs. - **No ambiguity handling needed** — `Architekt` is always an institution indicator in this dataset. Any future edge case (e.g., a person genuinely titled "Architekt") will be corrected manually via the UI. No disambiguation logic required. - **#214 updated** — Moved `Architekt Korschelt u Renker` from #214's split examples table to the "must NOT be split" table, classified as INSTITUTION by #211. - **No compound "u"-partnership rule** — Considered detecting a broader pattern (profession + name `u` name → INSTITUTION) but decided against it. Only one occurrence in the dataset. Start-of-string match on `Architekt` is sufficient. KISS. ### Updated INSTITUTION keyword list | Keyword | Match rule | |---|---| | `Firma` | start-of-string | | `GmbH` | end-of-string | | `Co` | end-of-string (word boundary) | | `Architekt` | start-of-string ← **new** |
Author
Owner

Implementation Complete

Parts 1-4 implemented on branch feat/issues-209-213-person-parser-enhancements. Parts 1-2 (enum, migration) were already done in #213.

Commits

Commit Description
68f0c4c PersonTypeClassifier with 25 parameterized test cases
a3da573 Integrate classifier into findOrCreateByAlias()
6ee1ef7 PersonTypeBadge component on person list and detail pages

What changed

  • PersonTypeClassifier — pure static classifier with position-aware keyword matching: SKIP (3 keywords), INSTITUTION (4 keywords), GROUP (10 keywords)
  • PersonService.findOrCreateByAlias() — calls classifier before processing. SKIP returns null. INSTITUTION/GROUP skip split() and store full name in lastName with firstName=null
  • PersonTypeBadge.svelte — colored badge for non-PERSON types (blue=Institution, purple=Group, amber=Unknown) shown on person list and detail pages
  • False positive guard: Dr. Firma Mueller correctly classified as PERSON (start-of-string matching)

Not in scope (deferred)

  • SKIP entry handling in MassImportService (storing in document notes) — separate follow-up
  • Edit page type dropdown — follow-up
  • Filter by type on person list — follow-up

Test results

  • Backend: 735 tests passing (25 classifier + 3 service tests)
## Implementation Complete Parts 1-4 implemented on branch `feat/issues-209-213-person-parser-enhancements`. Parts 1-2 (enum, migration) were already done in #213. ### Commits | Commit | Description | |--------|-------------| | `68f0c4c` | PersonTypeClassifier with 25 parameterized test cases | | `a3da573` | Integrate classifier into findOrCreateByAlias() | | `6ee1ef7` | PersonTypeBadge component on person list and detail pages | ### What changed - **PersonTypeClassifier** — pure static classifier with position-aware keyword matching: SKIP (3 keywords), INSTITUTION (4 keywords), GROUP (10 keywords) - **PersonService.findOrCreateByAlias()** — calls classifier before processing. SKIP returns null. INSTITUTION/GROUP skip split() and store full name in lastName with firstName=null - **PersonTypeBadge.svelte** — colored badge for non-PERSON types (blue=Institution, purple=Group, amber=Unknown) shown on person list and detail pages - False positive guard: `Dr. Firma Mueller` correctly classified as PERSON (start-of-string matching) ### Not in scope (deferred) - SKIP entry handling in MassImportService (storing in document notes) — separate follow-up - Edit page type dropdown — follow-up - Filter by type on person list — follow-up ### Test results - Backend: 735 tests passing (25 classifier + 3 service tests)
Sign in to join this conversation.
No Label feature person
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#211