feat(parser): support multi-person splitting for Von-column entries #214
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Von-column entries go directly to
PersonService.findOrCreateByAlias()->split(), bypassingparseReceivers(). This meansund/umulti-person separators in Von entries are not handled — they produce a single Person with the separator embedded in the name.Examples
Gerd u Brigitte de GruyterMilly und Mutter ClaraJappe Bakker u FrauGisela u EberhardHans u Alma CramHerta u Felix DurrLili u Arnold HardtRichard u Kathe FriedrichsenM u M MahlingEntries that must NOT be split
Westermann u CoArchitekt Korschelt u RenkerHochzeitsgedicht fur Paul u Luise de GruyterSolution
Route Von-column entries through
parseReceivers()(or a shared splitting step) sound/useparators are handled. ThePersonTypeClassifier(#211) must run FIRST — only entries classified asPERSONorUNKNOWNshould be split.Processing order for Von entries
This matches what An-column entries already do.
Files
MassImportService.javaparseReceivers()after classificationMassImportServiceTest.java/PersonNameParserTest.javaDepends on
uFound in
ODS import file analysis for #190, flagged during #211 discussion.
👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
parseReceivers()reuse scope — The issue says Von entries should be routed throughparseReceivers(). CurrentlyparseReceivers()is designed for the An-column. Does it already handle all the patterns listed here (e.g.Gerd u Brigitte de Gruyter→ shared last name distribution), or doesparseReceivers()itself need to be extended? The issue's examples imply last-name inheritance logic (Hans u Alma Cram→ both getCram) — is that already inparseReceivers()or is it new behavior?Jappe Bakker u Frauedge case — The expected result says "1 person: Jappe Bakker (+ spouse reference — may need special handling)". This feels under-specified. Does "may need special handling" mean this issue should handle it, or is it deferred? If deferred, what should the current parser produce — one personJappe Bakkerand silently dropFrau, or one person with the raw string preserved? I'd want a concrete expected output before writing the test.Milly und Mutter Clarainteraction with #212 — The expected result references "relationship prefix handled by #212". If #212 isn't merged yet, should the test for this case assert the intermediate behavior (without prefix stripping) or the final behavior (with it)? This affects whether we can write the test now or need to wait.Von-column entry point — The issue mentions
MassImportService.javaas the file to change. Is there a single method where Von entries enter the pipeline (likeprocessSender()or similar), or are there multiple call sites? This determines whether the change is a one-liner routing change or requires touching multiple paths.Suggestions
Test-first approach: The examples table is excellent test input. I'd write one parameterized test per row in
PersonNameParserTest(orMassImportServiceTest) before touching production code. The "must NOT be split" table is equally important — those are regression tests ensuring the classifier gate works.Clarify the
Fraucase: Either spec the expected output precisely or explicitly mark it as out-of-scope for this issue. A test with a vague expected result is worse than no test — it'll pass for the wrong reason.Consider a shared
splitMultiPerson()method: If both An and Von columns need the sameund/usplitting logic routed through the same classifier gate, extracting a named method (splitMultiPersonEntry()) that both paths call would make the intent clear and avoid duplicating the classifier → parseReceivers → split chain.🏗️ Markus Keller — Application Architect
Questions & Observations
Pipeline ordering as architecture — The processing order diagram (
raw → classify → if PERSON/UNKNOWN → parseReceivers → split) is effectively defining a pipeline architecture. With #211 (classifier), #212 (relationship prefixes), #213 (refactor), and now #214 all touching the same flow, the implicit question is: is there a single, explicit pipeline abstraction that enforces this ordering, or is each issue adding anotherifbranch intoMassImportService? If it's the latter, four issues from now this method will be unreadable.Von vs An symmetry — The issue states "this matches what An-column entries already do." If both columns now go through the same processing pipeline, is there still a reason to have separate code paths for Von and An at all? Or should both converge on a single
processPersonEntry(raw, column)method where the only difference is what role (sender vs receiver) the resulting Person gets assigned to on the Document?Classifier as a gate vs. filter — The pipeline shows the classifier deciding whether to proceed to
parseReceivers(). This makes the classifier a control-flow gate in the import pipeline. Is this responsibility documented in #211's contract, or is #214 adding an implicit dependency on classifier behavior that #211 doesn't know about? If the classifier's return type or semantics change, this pipeline breaks silently."Store in Document notes" for SKIP — Where does this behavior live? Is it already implemented in #211, or is this issue also responsible for implementing the SKIP → notes routing? If it's split across issues, the intermediate state (after #214 but before "notes" routing) needs to be defined.
Suggestions
Make the pipeline explicit: Whether it's a simple method chain or a lightweight
ImportPipelinewith steps, the ordering guarantees should be in code, not just in issue descriptions. This doesn't need to be a framework — even a well-named method likeprocessPersonColumn(raw)that calls classify → route → parse → split in order would suffice.Document the classifier contract: #214 depends on
PersonTypeClassifier.classify()returning specific values (PERSON,UNKNOWN,SKIP,INSTITUTION,GROUP). That contract should be explicit in #211 so both issues can be implemented and tested independently.🧪 Sara Holt — QA Engineer & Test Strategist
Questions & Observations
Test data coverage — The examples table has 10 "should split" cases and 2 "must NOT split" cases. That's a good start, but the "must not split" category feels thin. Are there other patterns where
uorundappears legitimately inside a name or institution that we need to protect? For instance, German place names like "Neustadt und Umgebung" or compound names — are those possible in the Von column?Acceptance criteria missing — The issue describes expected results per example but doesn't state explicit acceptance criteria. I'd want:
Integration test boundary — The issue lists unit test files (
PersonNameParserTest,MassImportServiceTest). Is there an integration test that exercises the full import pipeline (ODS file → MassImportService → database) with a Von entry containingu? A unit test onparseReceivers()proves the parser works; an integration test proves the routing inMassImportServiceactually calls it for Von entries.M u M Mahlingdedup question — The expected result says "2 persons: M Mahling, M Mahling (or 1 if deduped)". Which is it?findOrCreateByAlias()presumably deduplicates, so the test should assert the actual behavior — likely 1 person. If the answer is "depends on existing data", the test needs to cover both scenarios.Suggestions
Parameterized tests: Both the positive and negative tables are perfect for
@ParameterizedTestwith@MethodSourceor@CsvSource. One test method, N rows, clear failure messages per case.Regression suite for An-column: Add a few existing An-column multi-person cases to the same test class to ensure the routing change doesn't break what already works.
Resolve ambiguous expected outputs before implementation:
Jappe Bakker u FrauandM u M Mahlingboth have uncertain expected results. Pin them down now — ambiguous specs produce ambiguous tests.🔒 Nora "NullX" Steiner — Application Security Engineer
Questions & Observations
Input trust boundary — The Von-column data comes from ODS/Excel import files uploaded by users. The issue focuses on parsing logic, but I want to confirm: is the raw Von string sanitized before it hits
parseReceivers()andsplit()? If a malicious import file contains a Von entry like<script>alert(1)</script>or a string with SQL-significant characters, does the pipeline handle that safely? This isn't a new attack surface (it exists for An-column too), but routing Von through the same path means confirming the same protections apply.Person creation from untrusted input —
findOrCreateByAlias()creates Person records from import data. With Von entries now being split, a single Von field can create multiple Person records. Is there a limit on how many persons a single Von entry can produce? A crafted input likeA u B u C u D u E u ...with hundreds ofuseparators could create a large number of Person records in one import row. This is a low-severity DoS vector but worth considering.Classifier bypass — The pipeline depends on #211's classifier running FIRST to filter out non-person entries. If someone finds a way to bypass or confuse the classifier (e.g., an institution name that doesn't match the classifier's patterns), the splitting logic would run on it. The security impact is low (worst case: an institution gets split into multiple nonsense Person records), but it's worth noting that the classifier is now a security-relevant gate, not just a data quality filter.
Suggestions
No new security concerns introduced — This change is primarily a routing change within an existing trusted pipeline. As long as the existing input sanitization and JPA parameterized queries remain in place (which they do —
findOrCreateByAliasuses Spring Data JPA), the security posture doesn't change. The main thing to verify is that the Von→parseReceivers path doesn't introduce any string concatenation into queries that the direct Von→split path didn't have.Consider a sanity limit on split count: Cap
parseReceivers()output at a reasonable maximum (e.g., 10 persons per field) to prevent resource exhaustion from malformed import data.🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
No direct UI impact — This issue is purely backend parser logic within the ODS/Excel import pipeline. There are no frontend components, user-facing screens, or interaction patterns affected. The change happens silently during batch import — users see the resulting Person records but don't interact with the splitting logic itself.
Indirect UX improvement — From a user perspective, this is a data quality fix. Currently, users importing documents with Von entries like "Hans u Alma Cram" get a single garbage Person record with an embedded separator. After this change, they'll see two correctly formed persons. This reduces the need for manual cleanup in the person management UI — a meaningful UX win even though no UI code changes.
Person list display — If an import previously created one person "Gerd u Brigitte de Gruyter" and after this fix creates two persons "Gerd de Gruyter" and "Brigitte de Gruyter", is there a migration or cleanup step for existing bad data? If not, users will see both the old malformed record and the new clean ones. That could be confusing in search results.
Suggestions
⚙️ Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
No infrastructure impact — This is a pure backend parsing change within
MassImportService. No new services, no config changes, no new environment variables, no migration files, no Docker changes. The existing CI pipeline (build → test → deploy) handles this without modification.Dependency chain risk — The issue depends on #211 and #213. From a CI/merge perspective: are these going to land on
mainin order, or could they be developed in parallel on feature branches? If parallel, the test suite for #214 needs to either mock the classifier or include it as a test dependency. If sequential, the branch for #214 should be rebased onto #213's merge. Either way, merge conflicts inMassImportService.javaare likely if all three issues touch the same method.Import performance — Von entries now go through an additional processing step (classifier + parseReceivers) compared to the current direct path. For a large ODS import (thousands of rows), is the added overhead measurable? Probably not — string parsing is fast — but if the classifier in #211 does anything expensive (regex compilation per call, database lookups), it could add up. Worth a quick check after implementation.
Suggestions
🏗️ Markus Keller — Application Architect
Discussion summary from architecture review with Marcel.
Resolved
Document.senderis@ManyToOne, but Von splitting viaparseReceivers()can produce multiple persons. Agreed to change the model to@ManyToMany senders(symmetric with receivers) in a separate issue before #214. This includes UI changes. #214 becomes a follow-up that naturally works once the model supports it.processPersonColumn()method in #214 rather than keeping separate code paths. The only difference is whether results land insendersorreceivers.MassImportServiceafter #211/#213 merge. The method is linear and readable — no nestedifbranches. TheprocessPersonColumn()extraction will make it simpler. No pipeline abstraction needed.New issues created
@ManyToMany) + UI updates. Prerequisite for #214.Overall read
The sequencing matters here: multiple senders first, then #214's Von splitting slots in cleanly without workarounds. The codebase is in good shape after the recent merges — no architectural concerns with the current import pipeline structure.