feat(parser): strip parenthesized annotations in split() and preserve content #210
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
parseReceivers()strips parenthesized content like(Gruber)as a shared last-name override. Butsplit()does not strip parenthesized content at all, so names with annotations break:Clara de Gruyter(*1871)Gertrud D.(Tuttu)Ernst Kurmany (?)Richard (Quast ?)The parenthesized content contains valuable metadata: birth years, nicknames, uncertainty markers. It should be stripped from the name but preserved.
Solution
Part 1 - Strip parenthesized content in
split()Before the
findKnownLastNamecheck, strip(...)content from the cleaned name. This should handle:(*1871)- birth/death year annotations(Tuttu)- nicknames(?)- uncertainty markers(Quast ?)- uncertain last namesPart 2 - Preserve extracted content
Return the extracted parenthesized content alongside the split name. The caller can decide how to use it:
birthYearfieldPersonNameAliaswith typeNICKNAMEFor now, at minimum extract and return the content so it's not silently lost. The consuming code can be enhanced incrementally.
Files
PersonNameParser.javasplit(), return extracted contentPersonNameParserTest.javaFound in
ODS import file analysis for #190.
Complete Input/Output Table
Every parenthesized entry from the ODS and how
split()should handle it.Entries that
split()must handleClara de Gruyter(*1871)Clarade Gruyter*1871Ernst Kurmany (?)ErnstKurmany?Gertrud D.(Tuttu)GertrudD.TuttuRichard (Quast ? )RichardQuast?Entry already handled by
parseReceivers()(no change needed)Hedi und Tutu (Gruber)parseReceivers()extractsGruberas shared last name ->["Hedi Gruber", "Tutu Gruber"]Processing order in
split()For each entry, the strip must happen BEFORE
findKnownLastNameand the last-space fallback:Clara de Gruyter(*1871)Clara de Gruyterde Gruyter->("Clara", "de Gruyter")Ernst Kurmany (?)Ernst Kurmany("Ernst", "Kurmany")Gertrud D.(Tuttu)Gertrud D.("Gertrud", "D.")Richard (Quast ? )Richard("Richard", "?")Note: #4 has a double space before the paren. After stripping
(Quast ? )and trimming, onlyRichardremains - a single token with no last name. The uncertain last nameQuastshould be preserved as metadata but cannot be used as the canonical lastName since it's explicitly marked uncertain.Edge cases to test
Clara de Gruyter(*1871)Clara de Gruyter("Clara", "de Gruyter")Ernst Kurmany (?)Ernst Kurmany("Ernst", "Kurmany")Gertrud D.(Tuttu)Gertrud D.("Gertrud", "D.")Name (annotation) extra(...)is stripped(OnlyParen)("?", "?")("?", "?")👨💻 Felix Brandt — Senior Fullstack Developer
Questions & Observations
The issue says
split()should return extracted parenthesized content "alongside the split name." What is the concrete return type? Currentlysplit()returns aString[](or similar two-element structure). Returning annotation metadata means either a new record/DTO or overloading the return value. ANameSplitResultrecord withfirstName,lastName, andString annotation(orList<String> annotations) would be the clean solution -- is that the intended direction?Entry #4 (
Richard (Quast ? )) is interesting: after stripping, onlyRichardremains -- a single token. The comment says "single token with no last name" producing("Richard", "?"). But?as a lastName is a sentinel value, not a real name. Does the currentsplit()already have a convention for "no last name found"? If it returnsnullor empty string, switching to?is a behavior change that callers may not expect.The regex for stripping needs to handle: no space before paren (
Gruyter(*1871)), space before paren (Kurmany (?)), content with spaces inside ((Quast ? )), and multiple consecutive parens (unlikely but worth a guard). A single\s*\([^)]*\)\s*$anchored at end-of-string should cover all four cases listed.The issue scopes this to
split()only, leavingparseReceivers()untouched. ButparseReceivers()callssplit()internally for each receiver after extracting the shared last-name override. Will the new stripping insplit()accidentally interfere with theparseReceivers()path where(Gruber)has already been removed? Need to confirm thatparseReceivers()strips its own parens before delegating tosplit(), so there is no double-processing risk.Suggestions
Introduce a
NameSplitResultrecord:This keeps
split()as a query (returns data, no side effects) and avoids encoding metadata into the name strings themselves. The annotation field can benullwhen no parenthesized content exists.TDD sequence for the four cases: write one test per row in the input/output table from the comment, watch each fail, then generalize the regex. Start with #2 (
Ernst Kurmany (?)) as the simplest case, then #1 (no space before paren), then #3 (nickname), then #4 (uncertain last name producing single token). Each test forces one refinement.For the
(OnlyParen)degenerate edge case: the test should assert thatsplit()returns("?", "?")with annotation"OnlyParen"-- confirming the annotation is still captured even when the name collapses entirely.🏗️ Markus Keller — Application Architect
Questions & Observations
Return type change is a module boundary concern.
split()is called byPersonService(viafindOrCreateByNameor similar). Changing it from a simpleString[]to a richer result type is the right call, but every caller must be updated. How many call sites exist? IfparseReceivers()andMassImportServiceboth callsplit(), this is a cross-cutting change that should be mapped before implementation.Annotation semantics are underspecified. The issue lists three annotation types: birth year (
*1871), nickname (Tuttu), and uncertainty (?). These have fundamentally different semantics -- a birth year is structured data, a nickname is aPersonNameAlias, and uncertainty is a quality flag. Returning a rawString annotationdefers the parsing problem. Is the intent to parse annotation types in this issue, or truly just extract the raw string and leave interpretation for later? The answer affects whether we need anAnnotationTypeenum now or not.Schema implications for "preserve extracted content." Part 2 says "the caller can decide how to use it" and mentions
birthYearfield andPersonNameAliaswith typeNICKNAME. Neither of these schema additions are in scope for this issue. That is fine -- but thePersonNameAliasentity already exists (from #181). Does the currentPersonNameAliasmodel have atypefield, or onlyalias? If there is no type discriminator, nicknames extracted here cannot be distinguished from other alias types without a schema migration.Entry #4 loses the uncertain last name entirely.
Richard (Quast ?)becomesfirstName=Richard, lastName=null/empty, withQuastonly in the annotation string. If this name is used tofindOrCreatea Person, it creates a person with no last name. Later, if someone importsRichard Quast(without uncertainty), the system will not match them. Is that acceptable, or should uncertain last names still populatelastNamewith a flag?Suggestions
Scope this issue strictly to extraction (Part 1 + raw string preservation). Do not add
AnnotationTypeenum,birthYearfield, or auto-creation ofPersonNameAliasentries. Those belong in follow-up issues that reference this one.Document the
NameSplitResult(or equivalent) as an internal API contract inPersonNameParser. It should be package-private or at most used within thepersondomain -- no other service should depend on annotation parsing internals.For entry #4, consider whether
Quastshould become the lastName with annotation?(meaning "uncertain"), rather than dropping it entirely. The current spec sayslastNameshould be empty, but that creates a data loss scenario where the uncertain name vanishes from the searchable fields.🧪 Sara Holt — QA Engineer & Test Strategist
Questions & Observations
Test layer assignment. This is pure parsing logic with no database or HTTP dependency. All tests belong at the unit test layer (JUnit 5, no Spring context). The existing
PersonNameParserTest.javais the right home. No integration or E2E tests needed for this change.Regression risk on existing
split()behavior. The issue adds parenthesis stripping as a new step beforefindKnownLastName. Every existing test case forsplit()should still pass unchanged -- names without parentheses must not be affected. Before writing new tests, run the existing suite and confirm it is green. After adding the stripping logic, run the full suite again. Any regression means the stripping regex is too greedy.The processing order table in the comment is a test plan. Each row maps directly to a test case:
Clara de Gruyter(*1871))Ernst Kurmany (?))Gertrud D.(Tuttu))Richard (Quast ? ))Edge cases not covered in the issue that need tests:
Hans (von) Meier) -- should this strip or not? The issue says "trailing(...)is stripped" but the regex needs to enforce end-of-string anchoringName (a)(b)) -- unlikely in real data, but the regex behavior should be definedName ()-- what does annotation contain?Name ((inner))-- degenerate but should not crashTest naming convention. Follow the project pattern:
split_shouldExtractBirthYearAnnotation_whenParenFollowsLastName,split_shouldPreserveLastName_whenNoParenthesesPresent.Suggestions
Write a parameterized test (
@ParameterizedTestwith@CsvSourceor@MethodSource) for the four main cases from the comment table. This makes the input/output mapping explicit and easy to extend when new annotation patterns appear in future ODS imports.Add an explicit regression test block: take 5-6 existing
split()inputs that have no parentheses and assert they produce identical results after this change. This is cheap insurance against regex over-matching.For the
(OnlyParen)degenerate case: assert the method does not throw. The expected output("?", "?")with annotation"OnlyParen"should be documented in the test name so the behavior is explicit, not an accident.🔒 Nora "NullX" Steiner — Application Security Engineer
Questions & Observations
ReDoS risk in the parenthesis-stripping regex. If the regex is applied to user-supplied or import-file-supplied strings, a catastrophic backtracking pattern could cause CPU exhaustion. A simple pattern like
\s*\([^)]*\)\s*$is safe (no nested quantifiers, no alternation inside repetition). But if someone writes\(.*\)with greedy.*and the input contains multiple(without closing), backtracking can spike. Confirm the chosen regex uses[^)]*(character class negation) rather than.*inside the group.Input length validation. The
split()method processes strings from ODS imports. Is there an upper bound on input string length? An extremely long string (e.g., a malformed cell with 100KB of text) would be processed by the regex engine. A simple guard (if (input.length() > MAX_NAME_LENGTH) truncate or reject) before regex processing is good defense-in-depth.No injection concern here. This is a parser that produces data objects, not a sink that executes queries or renders HTML. The extracted annotation string will eventually be stored or displayed -- the security boundary is at the point of use (JPA parameterized queries for storage, Svelte auto-escaping for display), not here. No action needed in this issue, but worth noting for whoever implements Part 2.
Log injection. If extracted annotations are logged (e.g.,
log.debug("Extracted annotation: {}", annotation)), ensure SLF4J parameterized logging is used (which it is by convention in this project). Raw string concatenation in log statements with user-derived content could inject newlines into structured logs. This is a low-severity concern but worth a glance during implementation.Suggestions
Use
[^)]*in the regex, not.*. This eliminates ReDoS risk entirely for this pattern.No security-blocking concerns for this issue. The change is contained within a pure parser with no I/O side effects. The security boundary matters when the extracted annotations are persisted or displayed -- that is a concern for follow-up issues, not this one.
🎨 Leonie Voss — UI/UX Design Lead
Questions & Observations
No direct UI impact in this issue, but the extracted annotations will eventually surface in the UI. When nicknames become
PersonNameAliasentries (mentioned in Part 2), they will appear in the person detail/edit page alias section. When birth years populate a future field, they will need a display location. These downstream UI changes are out of scope here, but the data decisions made now constrain the UI options later.Display of uncertainty markers. Entry #2 and #4 produce names with
?annotations indicating uncertainty. If these ever surface in the UI (e.g., a badge or tooltip on the person card saying "uncertain attribution"), the visual treatment matters: color alone cannot convey uncertainty (WCAG). An icon + text label ("uncertain") would be needed. This is a future concern, not blocking, but worth flagging as a design consideration when Part 2 is scoped.Person names with abbreviations. Entry #3 produces
lastName = "D."-- a single-letter abbreviation with a dot. When this displays in the person list or document metadata, it may look like a rendering error to users ("Is the last name truncated?"). Consider whether the UI should eventually show abbreviated last names differently, or whether this is just accepted as-is from the source data.Nicknames in search. If
(Tuttu)becomes aPersonNameAliasof typeNICKNAME, users searching for "Tuttu" should find Gertrud D. The search behavior is a UX concern -- the alias search already works for other alias types, so this should flow naturally, but it is worth confirming during testing.Suggestions
No UI work needed for this issue. The parser change is backend-only.
When Part 2 is scoped, request a design review for how uncertainty markers and extracted birth years will display. These are new data types that need intentional visual treatment, not ad-hoc placement.
🔧 Tobias Wendt — DevOps & Platform Engineer
Questions & Observations
No infrastructure impact. This change is contained within
PersonNameParser.javaand its test file. No new dependencies, no schema migrations, no configuration changes, no Docker or CI modifications needed.CI pipeline consideration. The existing unit test suite will run this automatically. Since the change modifies parsing logic used by
MassImportService(which processes ODS files), it may be worth confirming that the integration tests for ODS import still pass. These tests run against Testcontainers Postgres in CI -- no additional CI configuration needed, but the test run time could increase slightly if new parameterized tests add many cases. Unlikely to be measurable.No Flyway migration needed. The issue explicitly says the extracted annotations are returned but not yet persisted. When Part 2 adds
birthYearto the Person entity or atypecolumn toPersonNameAlias, that will require a Flyway migration (V{n}__add_person_birth_year.sqlor similar). Flag this for the follow-up issue.ODS import performance. The regex stripping adds one operation per name string during import. For the typical import sizes in this project (hundreds to low thousands of rows), this is negligible. No performance concern.
Suggestions
No DevOps action items for this issue. Clean backend-only change with no operational footprint.
For the follow-up issues (Part 2): if
birthYearor annotation type columns are added, ensure the Flyway migration is backward-compatible (nullable columns, no NOT NULL without defaults) so rolling deployments do not break.🏗️ Markus Keller — Application Architect (Discussion Summary)
Interactive discussion with Marcel covering 3 open items. All resolved.
Resolved Items
Entry #4 Richard (Quast ?) - uncertain last name handling — Decision: use best-effort data. When parenthesized content contains a name-like string plus
?(e.g.(Quast ?)), extract the name part back into the cleaned result and store only?as the annotation. Result:Richard (Quast ?)-> firstName="Richard", lastName="Quast", annotation="?". This preserves matchability - a "probably Quast" is more useful than a?placeholder. For pure uncertainty markers like(?), the annotation is?and the preceding name stays intact.Scope: raw string only, no typed annotations — This issue strictly extracts the raw parenthesized content into the
annotationfield ofSplitName(from #213). NoAnnotationTypeenum, nobirthYearpopulation, noPersonNameAliascreation. Parser extracts, service interprets. Clean separation - interpretation logic belongs in follow-up issues.Scope expanded:
parseReceivers()paren handling must change — CurrentlyparseReceivers()strips(...)in step 2 unconditionally (before checking for multi-person separators). This means An-column entries likeGertrud D.(Tuttu)have their annotation parens eaten byparseReceivers()beforesplit()ever sees them. Fix: move paren extraction inparseReceivers()to only fire when a multi-separator (und/u) is present. Single-person inputs leave parens intact forsplit()to handle as annotations. Without this fix, thesplit()annotation extraction is useless for An-column entries.Updated Input/Output Table for Entry #4
Richard (Quast ? )Richard QuastRichardQuast?Ernst Kurmany (?)Ernst KurmanyErnstKurmany?Clara de Gruyter(*1871)Clara de GruyterClarade Gruyter*1871Gertrud D.(Tuttu)Gertrud D.GertrudD.TuttuDependency Note
This issue now depends on #213 (preparatory refactor) which provides the
SplitName.annotationfield and thestripAnnotation()pipeline slot. Execution order: #190 (merged) -> #213 -> #210.Implementation Complete
Both parts implemented on branch
feat/issues-209-213-person-parser-enhancements.Commits
e696e50e49ae5dWhat changed
(...)content as annotation. Handles birth years(*1871), nicknames(Tuttu), uncertainty markers(?), and uncertain names(Quast ?)where the name part is extracted back into the cleaned resultsplit()annotation extraction; multi-person inputs still use(Gruber)as shared last-name override[^)]*regex to prevent ReDoSSplitName.annotation— no typed interpretation in this issueTest results