feat(model): add PersonType enum to distinguish persons from institutions/groups #211
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The mass import creates bogus Person records for entries that are not persons. Currently only "Familie" is filtered out. The ODS file contains at least 19 non-person entries that create garbage Person records:
Institutions/companies:
Reichsfechtschule,Steuerfinanzamt,Firma Auschrath,Arthur Collignon GmbHMitarbeiter Verlag,Mitarbeiter Druckerei TrebbinClara Cram,Mitarbeiter Kunstverlag MuGroups/committees:
Comite der Abschiedsfeier,Comite zur Errichtung eines Heine-DenkmalsGeschwister de Gruyter,Gesellschafter des VerlagesGarde du CorpsPossessive group references:
Ella de Gruyters Eltern,Eugenie de Gruyters Kinder,Hilde de Gruyters Schwiegereltern,Eltern MullerNot a person at all:
Briefumschlag aus Java,Kondolenzbriefe zum Tod von Walter de Gruyter,Hochzeitsgedicht fur Paul u Luise de GruyterSolution
Part 1 - Add PersonType enum
Add a
typefield toPersonentity, defaulting toPERSON.Part 2 - Migration
Add
person_typecolumn topersonstable with default'PERSON'.Part 3 - Import heuristics
During mass import, classify entries based on keyword heuristics:
INSTITUTION keywords:
Firma,GmbH,Verlag,Druckerei,Kunstverlag,Steuerfinanzamt,ReichsfechtschuleGROUP keywords:
Comite,Comite,Geschwister,Gesellschafter,Garde,Eltern,Kinder,Schwiegereltern,MitarbeiterSkip entirely (not even a record):
Briefumschlag,Kondolenzbriefe,Hochzeitsgedicht- these are document descriptions, not sender/receiver namesUNKNOWN: anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.)
Part 4 - UI indication
Person list and detail pages should display the type (icon or badge) so users can review and correct misclassified entries.
Files
PersonType.java(new)Person.javatypefield with default PERSONV{n}__add_person_type.sqlMassImportService.javaorPersonNameParser.javaPersonNameParserTest.javaFound in
ODS import file analysis for #190.
Complete Input/Output Table
Every non-person entry from the ODS, classified by expected PersonType.
SKIP - Not a record at all (document descriptions, not sender/receiver)
Briefumschlag aus JavaKondolenzbriefe zum Tod von Walter de GruyterHochzeitsgedicht fur Paul u Luise de Gruyteruseparator - must not be parsed as persons)INSTITUTION
Arthur Collignon GmbHGmbHFirma AuschrathFirmaReichsfechtschuleSteuerfinanzamtWestermann u Cou Co/CoGROUP
Comite der AbschiedsfeierComiteComite zur Errichtung eines Heine-DenkmalsComiteGarde du CorpsGardeGeschwister de GruyterGeschwisterGesellschafter des VerlagesGesellschafterElla de Gruyters ElternElternEugenie de Gruyters KinderKinderHilde de Gruyters SchwiegerelternSchwiegerelternEltern MullerElternFamilie CramFamilieparseReceiversbut not in Von)Familie HasenvleverFamilieMitarbeiter Druckerei TrebbinClara CramMitarbeiterMitarbeiter Kunstverlag MuMitarbeiterMitarbeiter VerlagMitarbeiterKeyword heuristics summary
Briefumschlag,Kondolenzbriefe,HochzeitsgedichtFirma,GmbH,Co(as word boundary)Familie,Comite,Comite,Geschwister,Gesellschafter,Garde,Eltern,Kinder,Schwiegereltern,MitarbeiterEntries without keyword heuristics
ReichsfechtschuleSteuerfinanzamtThese two may need to remain UNKNOWN and be corrected by users in the UI, unless we add them to a static known-institution list.
PERSON entries that look ambiguous but ARE persons
Architekt Korschelt u Renkeruseparator in Von column is a separate issueJappe Bakker u Frausplit()notparseReceivers(), souisn't handled. "u Frau" becomes part of the nameOscar Knobloch etcMilly und Mutter Clarasplit(),undisn't handledGerd u Brigitte de Gruytersplit(),uisn't handledNote: Several Von entries contain
und/useparators but go tosplit()(notparseReceivers()), so multi-person splitting doesn't apply. This is a separate concern - should Von entries also go throughparseReceivers()first?Design Decision: firstName/lastName for non-persons
Decision: Full name goes into
lastName,firstNamestaysnull. Non-persons don't have first/last name structure - forcing them into it produces garbage.This is consistent with #212 where
firstNamebecomes nullable.INSTITUTION examples
Arthur Collignon GmbHArthur Collignon GmbHFirma AuschrathFirma AuschrathReichsfechtschuleReichsfechtschuleSteuerfinanzamtSteuerfinanzamtWestermann u CoWestermann u CoGROUP examples
Comite der AbschiedsfeierComite der AbschiedsfeierComite zur Errichtung eines Heine-DenkmalsComite zur Errichtung eines Heine-DenkmalsGarde du CorpsGarde du CorpsGeschwister de GruyterGeschwister de GruyterGesellschafter des VerlagesGesellschafter des VerlagesElla de Gruyters ElternElla de Gruyters ElternEugenie de Gruyters KinderEugenie de Gruyters KinderHilde de Gruyters SchwiegerelternHilde de Gruyters SchwiegerelternEltern MullerEltern MullerFamilie CramFamilie CramFamilie HasenvleverFamilie HasenvleverMitarbeiter VerlagMitarbeiter VerlagMitarbeiter Druckerei TrebbinClara CramMitarbeiter Druckerei TrebbinClara CramMitarbeiter Kunstverlag MuMitarbeiter Kunstverlag MuImplementation note
When the heuristic classifies an entry as INSTITUTION or GROUP, skip
split()entirely - don't attempt firstName/lastName extraction. Store the raw string (trimmed) aslastNameand setfirstName = null.💻 Felix Brandt -- Senior Fullstack Developer
Questions & Observations
The issue mentions classification logic in
MassImportService.javaor a newPersonNameParser.java. Where exactly does the heuristic live? IfPersonNameParseralready exists and handlessplit()/parseReceivers(), the classification should happen before those methods are called -- otherwise non-person entries will still be split into bogus firstName/lastName pairs before the type is assigned. The comment 2 confirms this ("skipsplit()entirely"), but the issue body's file table lists both locations as options. Pick one.The SKIP entries (
Briefumschlag,Kondolenzbriefe,Hochzeitsgedicht) -- does "skip entirely" mean we silently drop them, or do we log a warning? Silent drops during import make debugging harder. Alog.warn("Skipping non-person entry: {}", rawName)costs nothing and saves hours.The keyword list mixes German-specific terms. What happens when someone imports a French or English document set? The heuristic is hardcoded to German vocabulary. Is that an accepted limitation for now, or should the keyword lists be externalized (e.g., in a config file or database table)?
ReichsfechtschuleandSteuerfinanzamthave no keyword match. Comment 1 suggests they "remain UNKNOWN." But the issue body says UNKNOWN is for "anything that doesn't match a heuristic but looks suspicious (very long strings, no space, etc.)." These two are real institutions, not suspicious strings. Should there be a static known-institution list, or do we accept that some entries will be UNKNOWN and corrected manually?The
PersonTypeenum has four values but noFAMILYtype.Familie CramandFamilie Hasenvleverare classified as GROUP. Is "Familie" semantically the same as "Comite" or "Geschwister"? If users later want to filter "show me all families," they can't distinguish them from committees. Worth considering whether GROUP is sufficient or whether FAMILY should be a separate type.Suggestions
Single responsibility for classification: Create a
PersonTypeClassifier(or method onPersonNameParser) that takes a raw string and returns aPersonType. This is a pure function -- trivially testable, no dependencies. TheMassImportServicecalls it before deciding whether to invokesplit().Test-first the keyword table: The comment 1 table is an excellent test fixture. Each row becomes a parameterized test case:
This gives 22+ test cases directly from the spec.
Guard the enum default: The migration sets default
'PERSON'. Make sure the Java enum hasPERSONas the JPA default too (@Column(columnDefinition = "varchar(20) default 'PERSON'")or@Builder.Default private PersonType type = PersonType.PERSON). Both layers must agree.Frontend type badge: The issue says "icon or badge" but does not specify the visual treatment. Before implementing, define which component owns this (PersonCard? PersonListItem?) and what the badge looks like for each type. A
PersonTypeBadge.sveltecomponent receiving atypeprop would be the clean split.🏗️ Markus Keller -- Application Architect
Questions & Observations
Domain boundary: The classification logic determines what a Person is. This is core domain logic, not import infrastructure. If it lives in
MassImportService, it becomes coupled to the import flow. But users should also be able to change a Person's type from the UI (correct a misclassification). That means the type classification and the type persistence must be accessible fromPersonService, not buried in import code.Enum storage strategy: The issue says
PersonTypeenum with a varchar column defaulting to'PERSON'. This is correct --@Enumerated(EnumType.STRING)with a varchar column. Never use ordinal storage for enums that may grow. But consider: should the column have a CHECK constraint limiting values to the known enum members? Flyway migration should include:This catches invalid values at the DB layer, not just the app layer.
SKIP semantics need clarification: The issue says entries like
Briefumschlag aus Javashould be skipped entirely -- "not even a record." But these raw strings came from the ODS as sender/receiver. If we skip them, the Document record loses its sender/receiver reference. Is that acceptable? Or should we store them as a note/comment on the Document instead? The data loss concern is real -- an archivist might want to know that the original spreadsheet said "Briefumschlag aus Java" even if it's not a person.Cross-domain impact on DocumentService: Documents reference Person records as sender/receiver. If we now have Person records of type INSTITUTION and GROUP, does
DocumentServiceneed to handle them differently? For example, should the document search allow filtering by sender type? The issue is scoped to Person, but the ripple into Document display/search should be acknowledged.The
useparator issue in Von column: Comment 1 raises an important adjacent concern -- Von entries containingu/undseparators go throughsplit()notparseReceivers(), so multi-person splitting doesn't work. This is a separate issue but the classification logic must not be confused by it.Westermann u Comust be classified as INSTITUTION before any splitting attempt, or theuwill be interpreted as a person separator.Suggestions
Classification as a PersonService concern: Add a
classifyPersonType(String rawName)method toPersonService(or a dedicatedPersonTypeClassifierthatPersonServicedelegates to).MassImportServicecallsPersonServiceto create persons -- the classification happens inside that creation flow, not before it.Migration should backfill: Existing Person records created by previous imports include the garbage entries listed in the issue. The migration should not just add the column with default PERSON -- it should also run UPDATE statements to reclassify known non-persons:
Otherwise the existing garbage data stays classified as PERSON until someone manually fixes each one.
OpenAPI impact: Adding
PersonTypeto the Person entity means the OpenAPI spec changes. The frontend TypeScript types need regeneration (npm run generate:api). The issue's file table should includefrontend/src/lib/api/or note the regeneration step.Consider a
displayNamecomputed field: For non-persons,lastNameholds the full name whilefirstNameis null. Every frontend display location currently showsfirstName + lastName. Rather than fixing every display site, consider agetDisplayName()method on Person that returns the correct representation based on type. This centralizes the display logic.🧪 Sara Holt -- QA Engineer & Test Strategist
Questions & Observations
The keyword heuristic is the core risk: False positives (real persons classified as INSTITUTION/GROUP) and false negatives (institutions slipping through as PERSON) are both data quality bugs. The issue provides 22 specific test cases in comment 1 -- but these are only the known cases from the current ODS. What about future imports with different data?
Edge cases not covered in the tables:
Dr. Firma Mueller-- a real person whose last name happens to be "Firma"? The keywordFirmawould misclassify them as INSTITUTION.Elternas a last name (it exists in German)? The keyword match would misclassify.firma auschrathvsFirma Auschrath-- does the heuristic do case-insensitive matching?The SKIP category has no database record: How do we test that skipped entries truly produce no Person record? This needs an integration test that imports a row with
Briefumschlag aus Javaas sender and verifies: (a) no Person with that name exists, (b) the Document record's sender field is null or handled gracefully.Migration backfill testability: If the migration includes UPDATE statements to reclassify existing records (as the architect might suggest), those UPDATEs need test coverage. A Testcontainers integration test should: load V(n-1) schema, insert known garbage Person records, run V(n) migration, verify person_type values are correct.
Frontend display of type badge: The issue says "icon or badge" on person list and detail pages. This needs visual regression coverage at the E2E layer -- a Playwright test that navigates to a person of type INSTITUTION and verifies the badge is visible.
Suggestions
Parameterized unit test suite: The 22 rows from comment 1 are a ready-made
@ParameterizedTestdata source. Add boundary cases:Integration test for import flow: A
@SpringBootTestthat imports a minimal ODS containing one PERSON, one INSTITUTION, one GROUP, and one SKIP entry, then verifies:personTypeRegression test for existing
Familiefilter: The issue mentionsFamilieis currently filtered out. Verify the new heuristic subsumes that existing filter without breaking it. If the old filter is removed, a test should confirmFamilie Cramnow creates a GROUP-typed Person instead of being silently dropped.Test the enum at the DB layer: If a CHECK constraint is added, write an integration test that attempts to INSERT a Person with
person_type = 'INVALID'and verifies the constraint violation.🛡️ Nora "NullX" Steiner -- Application Security Engineer
Questions & Observations
Mass assignment on PersonType: When the
typefield is added to the Person entity, it becomes part of the JPA model. If Person objects are accepted as request bodies (or via a DTO that maps to Person), can a user settypeto any value via the API? For example, could someone POST{"type": "PERSON"}to reclassify an INSTITUTION back to PERSON, bypassing the heuristic? Thetypefield should be explicitly controlled -- not blindly accepted from user input unless the user has appropriate permissions.Keyword matching and injection: The heuristic matches keywords in raw ODS cell values. These values are user-supplied data (whoever created the spreadsheet). If the keyword matching uses regex, are the patterns safe from ReDoS? Simple
String.contains()orstartsWith()checks are fine, but if someone decides to usePattern.compile()with user-derived patterns later, that's a risk. Keep the matching simple.SKIP entries and data integrity: If entries are silently skipped during import, there's no audit trail. An attacker (or a data entry error) that names a real person "Briefumschlag aus Java" would have that person silently dropped from the import. This is a minor concern for a family archive, but logging skipped entries (at minimum) provides an audit trail. Consider storing skipped entries in an import log table rather than silently discarding them.
Enum values in the database: The
person_typecolumn stores string values. If the application ever reads a value that doesn't match the Java enum (e.g., due to a manual DB edit or a migration error), JPA will throw anIllegalArgumentException. This is a crash, not a graceful error. Consider whether a fallback to UNKNOWN is appropriate for unrecognized values.Permission model: Who can change a Person's type in the UI? The issue mentions users can "review and correct misclassified entries." Does this require
WRITE_ALL?ADMIN? If any authenticated user can reclassify a Person, that's probably fine for a family archive. But if the type field influences business logic (e.g., filtering, search), unauthorized type changes could affect data visibility for other users.Suggestions
Whitelist
typein the DTO: IfPersonUpdateDTOor similar exists, explicitly includetypeas an allowed field with validation (@NotNull, restricted to valid enum values). Do not rely on JPA's enum parsing alone for input validation -- a malformed string should return 400, not 500.Log all SKIP decisions: During import, log every skipped entry at WARN level with the raw value and the reason:
This is cheap insurance against silent data loss.
CHECK constraint in the migration: As noted by the architect, a DB-level CHECK constraint prevents invalid enum values from being inserted by any path (application, migration script, manual DB access). This is defense in depth.
No security blockers: This feature doesn't introduce new attack surface beyond the standard CRUD operations. The main concern is data integrity (silent drops, misclassification) rather than traditional security vulnerabilities.
🎨 Leonie Voss -- UI/UX Design Lead
Questions & Observations
Part 4 is underspecified: "Person list and detail pages should display the type (icon or badge)" -- but which visual treatment? A text badge, an icon, a colored dot? The choice matters for accessibility and scannability. On the person list, users will scan dozens of entries. The type indicator must be visible at a glance without dominating the layout.
Four types, four visual treatments: PERSON, INSTITUTION, GROUP, UNKNOWN each need a distinct, accessible indicator. Color alone is not sufficient (WCAG 1.4.1). Each type needs at minimum: a label and either an icon or a distinct shape/pattern.
Person list density impact: Adding a badge to every row in the person list increases visual noise. Most persons are type PERSON -- do we show a badge for PERSON, or only for non-PERSON types? Showing a badge on every row when 90%+ are PERSON adds clutter without information. Consider: badge only for INSTITUTION, GROUP, and UNKNOWN. PERSON is the unmarked default.
Senior user consideration (60+): Badges with abbreviated text (
INST,GRP) may be unclear. Full words (Institution,Gruppe) are clearer but take more space. An icon + tooltip is compact but tooltips are inaccessible on touch devices. The safest pattern: small text badge with full word, using the brand-sand background and brand-navy text.Detail page placement: On the person detail/edit page, where does the type appear? Next to the name? In a separate metadata row? If the type is editable (users correcting misclassifications), it needs a select/dropdown on the edit page. The edit page already has firstName/lastName fields -- for INSTITUTION/GROUP where firstName is null, the edit page should hide/disable the firstName field and show a single "Name" field instead.
The UNKNOWN type is a call to action: An UNKNOWN badge should visually signal "this needs review" -- perhaps with a yellow/amber background or an attention icon. It's different from PERSON/INSTITUTION/GROUP which are settled classifications.
Suggestions
Badge component spec:
bg-blue-50 text-blue-700 border border-blue-200with building icon, text "Institution"bg-purple-50 text-purple-700 border border-purple-200with people icon, text "Gruppe"bg-amber-50 text-amber-700 border border-amber-200with question-mark icon, text "Unbekannt"text-xs font-medium px-2 py-0.5 rounded-full-- minimum 44px tap target when clickableEdit page adaptation for non-persons: When type is INSTITUTION or GROUP:
Filter by type on the person list: If users need to review UNKNOWN entries, a filter chip or tab for person type would let them quickly find records that need correction. This could be a follow-up issue, but it's worth noting now.
Dark mode consideration: The badge colors above should have dark mode equivalents defined.
bg-blue-50becomesdark:bg-blue-900/30, etc. Define the token mapping before implementation.🐳 Tobias Wendt -- DevOps & Platform Engineer
Questions & Observations
Flyway migration ordering: The issue mentions
V{n}__add_person_type.sqlbut doesn't specify the version number. With multiple features in flight (#211, #212 referencing nullable firstName), migration ordering matters. If both land around the same time, we need to ensure no conflicts. What's the current highest migration version? The migration should be numbered to land after any in-progress migrations from other branches.Migration runtime on existing data: How many Person records exist currently? The
ALTER TABLE ... ADD COLUMN ... DEFAULT 'PERSON'is fast on PostgreSQL (metadata-only change for a non-null column with a default since PG 11). But if the migration includes UPDATE statements for backfilling known non-persons (matching keywords against existing lastName values), that's a table scan. For a family archive this is likely trivial (hundreds, not millions of rows), but worth confirming.No infrastructure changes required: This feature adds an enum column and application logic. No new services, no new containers, no new ports, no new volumes. The Docker Compose setup, CI pipeline, and deployment process are unaffected. This is a clean, contained change from an infrastructure perspective.
CI impact: The new
PersonTypeClassifier(or equivalent) will have unit tests. These are fast and won't affect CI pipeline duration. The integration test for import flow with classification might add a few seconds to the Testcontainers suite. No concern here.OpenAPI spec regeneration: The
PersonTypeenum being added to the Person entity means the OpenAPI spec changes. The CI pipeline should catch type mismatches if the frontend types aren't regenerated. Does the current CI pipeline include a step that verifies generated types are up to date? If not, this is a good time to add one -- anpm run generate:api && git diff --exit-codecheck would catch drift.Suggestions
Coordinate migration numbering with #212: If #212 (nullable firstName) is in progress on another branch, agree on version numbers now to avoid Flyway conflicts at merge time. One approach: assign this issue the next available version and have #212 take the one after, or vice versa.
Add an OpenAPI drift check to CI: A lightweight CI step that regenerates the TypeScript API types and fails if they differ from what's committed:
This prevents the "forgot to regenerate types" class of bugs permanently.
No deployment considerations: Since this is a schema migration + application code change, a standard deploy (pull new image, run migrations, restart) handles it. No blue-green, no feature flag needed. The migration is backwards-compatible -- the column has a default, so the old application code would work fine alongside the new schema.
🏗️ Markus Keller — Application Architect (Discussion Summary)
Interactive discussion with Marcel covering 5 open items. All resolved.
Resolved Items
Classification logic location — New
PersonTypeClassifierutility class (likePersonNameParser). Pure static methodPersonType classify(String rawName), no dependencies, trivially testable. Called byPersonService.findOrCreateByAlias()before deciding whether to invokesplit(). Keeps classification reusable from any entry point — import, UI, API. Separate fromPersonNameParserbecause classification is a different concern from name parsing.SKIP entries and data loss — Decision: store raw Von/An value in the Document's
notesfield when classification returnsSKIP. The Document entity already has anotesTEXT column displayed on the detail page. No schema change needed, no fake Person records, no silent data loss. Example:Briefumschlag aus Java-> Document notes:"[Import] Von: Briefumschlag aus Java (kein Personeneintrag)".False positive risk in keyword matching — Decision: position-aware matching. Start-of-string for most keywords (
Firma,Comite,Mitarbeiter,Geschwister,Eltern,Familie, etc.), end-of-string forGmbH/Co. This meansDr. Firma Muelleris correctly classified as PERSON (does not start with "Firma"), whileFirma Auschrathis correctly classified as INSTITUTION. Eliminates the false positive class Sara identified.Migration backfill — Not needed. Still in dev mode — clean DB before re-import. No UPDATE statements in the migration.
Von-column multi-person entries — Separate issue created: #214. Von entries with
und/useparators (Gerd u Brigitte de Gruyter,Hans u Alma Cram, etc.) bypassparseReceivers(). Fix is to route Von entries throughparseReceivers()after classification. Classifier must run first soWestermann u Co(INSTITUTION) is not split onu. Depends on this issue (#211) and #213.Additional Decision: SKIP enum value
Added
SKIPto thePersonTypeenum (in #213 scope). Never persisted to the database —PersonServicechecks for it and returns early. Makes the classifier's return type exhaustive and self-documenting, cleaner than a null convention.Dependency Chain
#190 (merged) -> #213 (preparatory refactor, now includes
SKIPin enum) -> #211 (classifier + heuristics) -> #214 (Von multi-person splitting)🏗️ Markus Keller — Application Architect (Discussion: "Architekt" reclassification)
Follow-up discussion with Marcel on the classification of
Architekt Korschelt u Renker.Resolved Items
"Architekt" is an INSTITUTION keyword —
Architekt Korschelt u Renkeris not a person with a professional title — "Korschelt u Renker" is the name of the architect business. AddArchitektto the INSTITUTION keyword list with start-of-string matching (same rule asFirma). The classifier catches it as INSTITUTION before any splitting (#214) or title stripping (#212) runs.No ambiguity handling needed —
Architektis always an institution indicator in this dataset. Any future edge case (e.g., a person genuinely titled "Architekt") will be corrected manually via the UI. No disambiguation logic required.#214 updated — Moved
Architekt Korschelt u Renkerfrom #214's split examples table to the "must NOT be split" table, classified as INSTITUTION by #211.No compound "u"-partnership rule — Considered detecting a broader pattern (profession + name
uname → INSTITUTION) but decided against it. Only one occurrence in the dataset. Start-of-string match onArchitektis sufficient. KISS.Updated INSTITUTION keyword list
FirmaGmbHCoArchitektImplementation Complete
Parts 1-4 implemented on branch
feat/issues-209-213-person-parser-enhancements. Parts 1-2 (enum, migration) were already done in #213.Commits
68f0c4ca3da5736ee1ef7What changed
Dr. Firma Muellercorrectly classified as PERSON (start-of-string matching)Not in scope (deferred)
Test results