feat(parser): strip parenthesized annotations in split() and preserve content #210

New Issue

marcel · 2026-04-07T18:27:57+02:00

marcel commented

2026-04-07 18:27:57 +02:00

Problem

parseReceivers() strips parenthesized content like (Gruber) as a shared last-name override. But split() does not strip parenthesized content at all, so names with annotations break:

Input	Current result	Expected result
`Clara de Gruyter(*1871)`	firstName="Clara de", lastName="Gruyter(*1871)"	firstName="Clara", lastName="de Gruyter", birth year=1871 preserved
`Gertrud D.(Tuttu)`	lastName="D.(Tuttu)"	firstName="Gertrud", lastName="D.", nickname "Tuttu" preserved
`Ernst Kurmany (?)`	firstName="Ernst Kurmany", lastName="(?)"	firstName="Ernst", lastName="Kurmany", uncertain flag preserved
`Richard (Quast ?)`	firstName="Richard", lastName="(Quast ?)" via fallback	firstName="Richard", lastName="Quast", uncertain flag preserved

The parenthesized content contains valuable metadata: birth years, nicknames, uncertainty markers. It should be stripped from the name but preserved.

Solution

Part 1 - Strip parenthesized content in `split()`

Before the findKnownLastName check, strip (...) content from the cleaned name. This should handle:

(*1871) - birth/death year annotations
(Tuttu) - nicknames
(?) - uncertainty markers
(Quast ?) - uncertain last names

Part 2 - Preserve extracted content

Return the extracted parenthesized content alongside the split name. The caller can decide how to use it:

Birth years could populate a future birthYear field
Nicknames could become a PersonNameAlias with type NICKNAME
Uncertainty markers could be logged/flagged

For now, at minimum extract and return the content so it's not silently lost. The consuming code can be enhanced incrementally.

Files

File	Change
`PersonNameParser.java`	Add parenthesis stripping in `split()`, return extracted content
`PersonNameParserTest.java`	New tests for parenthesized annotation stripping

Found in

ODS import file analysis for #190.

## Problem `parseReceivers()` strips parenthesized content like `(Gruber)` as a shared last-name override. But `split()` does not strip parenthesized content at all, so names with annotations break: | Input | Current result | Expected result | |---|---|---| | `Clara de Gruyter(*1871)` | firstName="Clara de", lastName="Gruyter(*1871)" | firstName="Clara", lastName="de Gruyter", birth year=1871 preserved | | `Gertrud D.(Tuttu)` | lastName="D.(Tuttu)" | firstName="Gertrud", lastName="D.", nickname "Tuttu" preserved | | `Ernst Kurmany (?)` | firstName="Ernst Kurmany", lastName="(?)" | firstName="Ernst", lastName="Kurmany", uncertain flag preserved | | `Richard (Quast ?)` | firstName="Richard", lastName="(Quast ?)" via fallback | firstName="Richard", lastName="Quast", uncertain flag preserved | The parenthesized content contains valuable metadata: birth years, nicknames, uncertainty markers. It should be stripped from the name but preserved. ## Solution ### Part 1 - Strip parenthesized content in `split()` Before the `findKnownLastName` check, strip `(...)` content from the cleaned name. This should handle: - `(*1871)` - birth/death year annotations - `(Tuttu)` - nicknames - `(?)` - uncertainty markers - `(Quast ?)` - uncertain last names ### Part 2 - Preserve extracted content Return the extracted parenthesized content alongside the split name. The caller can decide how to use it: - Birth years could populate a future `birthYear` field - Nicknames could become a `PersonNameAlias` with type `NICKNAME` - Uncertainty markers could be logged/flagged For now, at minimum extract and return the content so it's not silently lost. The consuming code can be enhanced incrementally. ## Files | File | Change | |---|---| | `PersonNameParser.java` | Add parenthesis stripping in `split()`, return extracted content | | `PersonNameParserTest.java` | New tests for parenthesized annotation stripping | ## Found in ODS import file analysis for #190.

marcel added the feature person labels 2026-04-07 18:28:41 +02:00

marcel commented

2026-04-07 18:32:18 +02:00

Complete Input/Output Table

Every parenthesized entry from the ODS and how split() should handle it.

Entries that `split()` must handle

#	Raw input	Column	Expected firstName	Expected lastName	Extracted annotation	Annotation type
1	`Clara de Gruyter(*1871)`	Von	`Clara`	`de Gruyter`	`*1871`	Birth year
2	`Ernst Kurmany (?)`	Von	`Ernst`	`Kurmany`	`?`	Uncertainty
3	`Gertrud D.(Tuttu)`	An	`Gertrud`	`D.`	`Tuttu`	Nickname
4	`Richard (Quast ? )`	An	`Richard`	`Quast`	`?`	Uncertainty (uncertain last name)

Entry already handled by `parseReceivers()` (no change needed)

#	Raw input	Column	Current handling	Notes
5	`Hedi und Tutu (Gruber)`	An	`parseReceivers()` extracts `Gruber` as shared last name -> `["Hedi Gruber", "Tutu Gruber"]`	This is a shared last-name override, not an annotation. Already works correctly.

Processing order in `split()`

For each entry, the strip must happen BEFORE findKnownLastName and the last-space fallback:

#	Input	After paren strip	After dot-normalization	After known-last-name / fallback
1	`Clara de Gruyter(*1871)`	`Clara de Gruyter`	(has spaces, skip)	known: `de Gruyter` -> `("Clara", "de Gruyter")`
2	`Ernst Kurmany (?)`	`Ernst Kurmany`	(has spaces, skip)	fallback: `("Ernst", "Kurmany")`
3	`Gertrud D.(Tuttu)`	`Gertrud D.`	(has spaces, skip)	fallback: `("Gertrud", "D.")`
4	`Richard (Quast ? )`	`Richard`	(no dots, skip)	single token: `("Richard", "?")`

Note: #4 has a double space before the paren. After stripping (Quast ? ) and trimming, only Richard remains - a single token with no last name. The uncertain last name Quast should be preserved as metadata but cannot be used as the canonical lastName since it's explicitly marked uncertain.

Edge cases to test

Input	After paren strip	Expected split	Notes
`Clara de Gruyter(*1871)`	`Clara de Gruyter`	`("Clara", "de Gruyter")`	No space before paren
`Ernst Kurmany (?)`	`Ernst Kurmany`	`("Ernst", "Kurmany")`	Space before paren
`Gertrud D.(Tuttu)`	`Gertrud D.`	`("Gertrud", "D.")`	No space, nickname
`Name (annotation) extra`	Should not occur in data	-	If it does, only trailing `(...)` is stripped
`(OnlyParen)`	empty -> `("?", "?")`	`("?", "?")`	Degenerate case

## Complete Input/Output Table Every parenthesized entry from the ODS and how `split()` should handle it. ### Entries that `split()` must handle | # | Raw input | Column | Expected firstName | Expected lastName | Extracted annotation | Annotation type | |---|---|---|---|---|---|---| | 1 | `Clara de Gruyter(*1871)` | Von | `Clara` | `de Gruyter` | `*1871` | Birth year | | 2 | `Ernst Kurmany (?)` | Von | `Ernst` | `Kurmany` | `?` | Uncertainty | | 3 | `Gertrud D.(Tuttu)` | An | `Gertrud` | `D.` | `Tuttu` | Nickname | | 4 | `Richard (Quast ? )` | An | `Richard` | `Quast` | `?` | Uncertainty (uncertain last name) | ### Entry already handled by `parseReceivers()` (no change needed) | # | Raw input | Column | Current handling | Notes | |---|---|---|---|---| | 5 | `Hedi und Tutu (Gruber)` | An | `parseReceivers()` extracts `Gruber` as shared last name -> `["Hedi Gruber", "Tutu Gruber"]` | This is a shared last-name override, not an annotation. Already works correctly. | ### Processing order in `split()` For each entry, the strip must happen BEFORE `findKnownLastName` and the last-space fallback: | # | Input | After paren strip | After dot-normalization | After known-last-name / fallback | |---|---|---|---|---| | 1 | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | (has spaces, skip) | known: `de Gruyter` -> `("Clara", "de Gruyter")` | | 2 | `Ernst Kurmany (?)` | `Ernst Kurmany` | (has spaces, skip) | fallback: `("Ernst", "Kurmany")` | | 3 | `Gertrud D.(Tuttu)` | `Gertrud D.` | (has spaces, skip) | fallback: `("Gertrud", "D.")` | | 4 | `Richard (Quast ? )` | `Richard` | (no dots, skip) | single token: `("Richard", "?")` | Note: #4 has a double space before the paren. After stripping `(Quast ? )` and trimming, only `Richard` remains - a single token with no last name. The uncertain last name `Quast` should be preserved as metadata but cannot be used as the canonical lastName since it's explicitly marked uncertain. ### Edge cases to test | Input | After paren strip | Expected split | Notes | |---|---|---|---| | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | `("Clara", "de Gruyter")` | No space before paren | | `Ernst Kurmany (?)` | `Ernst Kurmany` | `("Ernst", "Kurmany")` | Space before paren | | `Gertrud D.(Tuttu)` | `Gertrud D.` | `("Gertrud", "D.")` | No space, nickname | | `Name (annotation) extra` | Should not occur in data | - | If it does, only trailing `(...)` is stripped | | `(OnlyParen)` | empty -> `("?", "?")` | `("?", "?")` | Degenerate case |

marcel referenced this issue

2026-04-07 18:41:23 +02:00

feat(parser): improve GEB_PATTERN and store maiden name as alias #209

marcel commented

2026-04-07 18:41:30 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

The issue says split() should return extracted parenthesized content "alongside the split name." What is the concrete return type? Currently split() returns a String[] (or similar two-element structure). Returning annotation metadata means either a new record/DTO or overloading the return value. A NameSplitResult record with firstName, lastName, and String annotation (or List<String> annotations) would be the clean solution -- is that the intended direction?
Entry #4 (Richard (Quast ? )) is interesting: after stripping, only Richard remains -- a single token. The comment says "single token with no last name" producing ("Richard", "?"). But ? as a lastName is a sentinel value, not a real name. Does the current split() already have a convention for "no last name found"? If it returns null or empty string, switching to ? is a behavior change that callers may not expect.
The regex for stripping needs to handle: no space before paren (Gruyter(*1871)), space before paren (Kurmany (?)), content with spaces inside ((Quast ? )), and multiple consecutive parens (unlikely but worth a guard). A single \s*$[^)]*$\s*$ anchored at end-of-string should cover all four cases listed.
The issue scopes this to split() only, leaving parseReceivers() untouched. But parseReceivers() calls split() internally for each receiver after extracting the shared last-name override. Will the new stripping in split() accidentally interfere with the parseReceivers() path where (Gruber) has already been removed? Need to confirm that parseReceivers() strips its own parens before delegating to split(), so there is no double-processing risk.

Suggestions

Introduce a NameSplitResult record:
```
public record NameSplitResult(String firstName, String lastName, String annotation) {}
```
This keeps split() as a query (returns data, no side effects) and avoids encoding metadata into the name strings themselves. The annotation field can be null when no parenthesized content exists.
TDD sequence for the four cases: write one test per row in the input/output table from the comment, watch each fail, then generalize the regex. Start with #2 (Ernst Kurmany (?)) as the simplest case, then #1 (no space before paren), then #3 (nickname), then #4 (uncertain last name producing single token). Each test forces one refinement.
For the (OnlyParen) degenerate edge case: the test should assert that split() returns ("?", "?") with annotation "OnlyParen" -- confirming the annotation is still captured even when the name collapses entirely.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - The issue says `split()` should return extracted parenthesized content "alongside the split name." What is the concrete return type? Currently `split()` returns a `String[]` (or similar two-element structure). Returning annotation metadata means either a new record/DTO or overloading the return value. A `NameSplitResult` record with `firstName`, `lastName`, and `String annotation` (or `List<String> annotations`) would be the clean solution -- is that the intended direction? - Entry #4 (`Richard (Quast ? )`) is interesting: after stripping, only `Richard` remains -- a single token. The comment says "single token with no last name" producing `("Richard", "?")`. But `?` as a lastName is a sentinel value, not a real name. Does the current `split()` already have a convention for "no last name found"? If it returns `null` or empty string, switching to `?` is a behavior change that callers may not expect. - The regex for stripping needs to handle: no space before paren (`Gruyter(*1871)`), space before paren (`Kurmany (?)`), content with spaces inside (`(Quast ? )`), and multiple consecutive parens (unlikely but worth a guard). A single `\s*$[^)]*$\s*$` anchored at end-of-string should cover all four cases listed. - The issue scopes this to `split()` only, leaving `parseReceivers()` untouched. But `parseReceivers()` calls `split()` internally for each receiver after extracting the shared last-name override. Will the new stripping in `split()` accidentally interfere with the `parseReceivers()` path where `(Gruber)` has already been removed? Need to confirm that `parseReceivers()` strips its own parens before delegating to `split()`, so there is no double-processing risk. ### Suggestions - Introduce a `NameSplitResult` record: ```java public record NameSplitResult(String firstName, String lastName, String annotation) {} ``` This keeps `split()` as a query (returns data, no side effects) and avoids encoding metadata into the name strings themselves. The annotation field can be `null` when no parenthesized content exists. - TDD sequence for the four cases: write one test per row in the input/output table from the comment, watch each fail, then generalize the regex. Start with #2 (`Ernst Kurmany (?)`) as the simplest case, then #1 (no space before paren), then #3 (nickname), then #4 (uncertain last name producing single token). Each test forces one refinement. - For the `(OnlyParen)` degenerate edge case: the test should assert that `split()` returns `("?", "?")` with annotation `"OnlyParen"` -- confirming the annotation is still captured even when the name collapses entirely.

marcel commented

2026-04-07 18:41:50 +02:00

🏗️ Markus Keller — Application Architect

Questions & Observations

Return type change is a module boundary concern. split() is called by PersonService (via findOrCreateByName or similar). Changing it from a simple String[] to a richer result type is the right call, but every caller must be updated. How many call sites exist? If parseReceivers() and MassImportService both call split(), this is a cross-cutting change that should be mapped before implementation.
Annotation semantics are underspecified. The issue lists three annotation types: birth year (*1871), nickname (Tuttu), and uncertainty (?). These have fundamentally different semantics -- a birth year is structured data, a nickname is a PersonNameAlias, and uncertainty is a quality flag. Returning a raw String annotation defers the parsing problem. Is the intent to parse annotation types in this issue, or truly just extract the raw string and leave interpretation for later? The answer affects whether we need an AnnotationType enum now or not.
Schema implications for "preserve extracted content." Part 2 says "the caller can decide how to use it" and mentions birthYear field and PersonNameAlias with type NICKNAME. Neither of these schema additions are in scope for this issue. That is fine -- but the PersonNameAlias entity already exists (from #181). Does the current PersonNameAlias model have a type field, or only alias? If there is no type discriminator, nicknames extracted here cannot be distinguished from other alias types without a schema migration.
Entry #4 loses the uncertain last name entirely. Richard (Quast ?) becomes firstName=Richard, lastName=null/empty, with Quast only in the annotation string. If this name is used to findOrCreate a Person, it creates a person with no last name. Later, if someone imports Richard Quast (without uncertainty), the system will not match them. Is that acceptable, or should uncertain last names still populate lastName with a flag?

Suggestions

Scope this issue strictly to extraction (Part 1 + raw string preservation). Do not add AnnotationType enum, birthYear field, or auto-creation of PersonNameAlias entries. Those belong in follow-up issues that reference this one.
Document the NameSplitResult (or equivalent) as an internal API contract in PersonNameParser. It should be package-private or at most used within the person domain -- no other service should depend on annotation parsing internals.
For entry #4, consider whether Quast should become the lastName with annotation ? (meaning "uncertain"), rather than dropping it entirely. The current spec says lastName should be empty, but that creates a data loss scenario where the uncertain name vanishes from the searchable fields.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **Return type change is a module boundary concern.** `split()` is called by `PersonService` (via `findOrCreateByName` or similar). Changing it from a simple `String[]` to a richer result type is the right call, but every caller must be updated. How many call sites exist? If `parseReceivers()` and `MassImportService` both call `split()`, this is a cross-cutting change that should be mapped before implementation. - **Annotation semantics are underspecified.** The issue lists three annotation types: birth year (`*1871`), nickname (`Tuttu`), and uncertainty (`?`). These have fundamentally different semantics -- a birth year is structured data, a nickname is a `PersonNameAlias`, and uncertainty is a quality flag. Returning a raw `String annotation` defers the parsing problem. Is the intent to parse annotation types in this issue, or truly just extract the raw string and leave interpretation for later? The answer affects whether we need an `AnnotationType` enum now or not. - **Schema implications for "preserve extracted content."** Part 2 says "the caller can decide how to use it" and mentions `birthYear` field and `PersonNameAlias` with type `NICKNAME`. Neither of these schema additions are in scope for this issue. That is fine -- but the `PersonNameAlias` entity already exists (from #181). Does the current `PersonNameAlias` model have a `type` field, or only `alias`? If there is no type discriminator, nicknames extracted here cannot be distinguished from other alias types without a schema migration. - **Entry #4 loses the uncertain last name entirely.** `Richard (Quast ?)` becomes `firstName=Richard, lastName=null/empty`, with `Quast` only in the annotation string. If this name is used to `findOrCreate` a Person, it creates a person with no last name. Later, if someone imports `Richard Quast` (without uncertainty), the system will not match them. Is that acceptable, or should uncertain last names still populate `lastName` with a flag? ### Suggestions - Scope this issue strictly to extraction (Part 1 + raw string preservation). Do not add `AnnotationType` enum, `birthYear` field, or auto-creation of `PersonNameAlias` entries. Those belong in follow-up issues that reference this one. - Document the `NameSplitResult` (or equivalent) as an internal API contract in `PersonNameParser`. It should be package-private or at most used within the `person` domain -- no other service should depend on annotation parsing internals. - For entry #4, consider whether `Quast` should become the lastName with annotation `?` (meaning "uncertain"), rather than dropping it entirely. The current spec says `lastName` should be empty, but that creates a data loss scenario where the uncertain name vanishes from the searchable fields.

marcel commented

2026-04-07 18:42:09 +02:00

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

Test layer assignment. This is pure parsing logic with no database or HTTP dependency. All tests belong at the unit test layer (JUnit 5, no Spring context). The existing PersonNameParserTest.java is the right home. No integration or E2E tests needed for this change.
Regression risk on existing split() behavior. The issue adds parenthesis stripping as a new step before findKnownLastName. Every existing test case for split() should still pass unchanged -- names without parentheses must not be affected. Before writing new tests, run the existing suite and confirm it is green. After adding the stripping logic, run the full suite again. Any regression means the stripping regex is too greedy.
The processing order table in the comment is a test plan. Each row maps directly to a test case:
1. Input with no space before paren + known last name (Clara de Gruyter(*1871))
2. Input with space before paren + fallback split (Ernst Kurmany (?))
3. Input with no space + dot in name (Gertrud D.(Tuttu))
4. Input with double space + content-bearing paren producing single token (Richard (Quast ? ))
Edge cases not covered in the issue that need tests:
- Name with no parentheses at all (existing behavior, regression guard)
- Name with parentheses in the middle, not at the end (Hans (von) Meier) -- should this strip or not? The issue says "trailing (...) is stripped" but the regex needs to enforce end-of-string anchoring
- Multiple trailing parenthesized groups (Name (a)(b)) -- unlikely in real data, but the regex behavior should be defined
- Empty parentheses Name () -- what does annotation contain?
- Nested parentheses Name ((inner)) -- degenerate but should not crash
Test naming convention. Follow the project pattern: split_shouldExtractBirthYearAnnotation_whenParenFollowsLastName, split_shouldPreserveLastName_whenNoParenthesesPresent.

Suggestions

Write a parameterized test (@ParameterizedTest with @CsvSource or @MethodSource) for the four main cases from the comment table. This makes the input/output mapping explicit and easy to extend when new annotation patterns appear in future ODS imports.
Add an explicit regression test block: take 5-6 existing split() inputs that have no parentheses and assert they produce identical results after this change. This is cheap insurance against regex over-matching.
For the (OnlyParen) degenerate case: assert the method does not throw. The expected output ("?", "?") with annotation "OnlyParen" should be documented in the test name so the behavior is explicit, not an accident.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **Test layer assignment.** This is pure parsing logic with no database or HTTP dependency. All tests belong at the unit test layer (JUnit 5, no Spring context). The existing `PersonNameParserTest.java` is the right home. No integration or E2E tests needed for this change. - **Regression risk on existing `split()` behavior.** The issue adds parenthesis stripping as a new step before `findKnownLastName`. Every existing test case for `split()` should still pass unchanged -- names without parentheses must not be affected. Before writing new tests, run the existing suite and confirm it is green. After adding the stripping logic, run the full suite again. Any regression means the stripping regex is too greedy. - **The processing order table in the comment is a test plan.** Each row maps directly to a test case: 1. Input with no space before paren + known last name (`Clara de Gruyter(*1871)`) 2. Input with space before paren + fallback split (`Ernst Kurmany (?)`) 3. Input with no space + dot in name (`Gertrud D.(Tuttu)`) 4. Input with double space + content-bearing paren producing single token (`Richard (Quast ? )`) - **Edge cases not covered in the issue that need tests:** - Name with no parentheses at all (existing behavior, regression guard) - Name with parentheses in the middle, not at the end (`Hans (von) Meier`) -- should this strip or not? The issue says "trailing `(...)` is stripped" but the regex needs to enforce end-of-string anchoring - Multiple trailing parenthesized groups (`Name (a)(b)`) -- unlikely in real data, but the regex behavior should be defined - Empty parentheses `Name ()` -- what does annotation contain? - Nested parentheses `Name ((inner))` -- degenerate but should not crash - **Test naming convention.** Follow the project pattern: `split_shouldExtractBirthYearAnnotation_whenParenFollowsLastName`, `split_shouldPreserveLastName_whenNoParenthesesPresent`. ### Suggestions - Write a parameterized test (`@ParameterizedTest` with `@CsvSource` or `@MethodSource`) for the four main cases from the comment table. This makes the input/output mapping explicit and easy to extend when new annotation patterns appear in future ODS imports. - Add an explicit regression test block: take 5-6 existing `split()` inputs that have no parentheses and assert they produce identical results after this change. This is cheap insurance against regex over-matching. - For the `(OnlyParen)` degenerate case: assert the method does not throw. The expected output `("?", "?")` with annotation `"OnlyParen"` should be documented in the test name so the behavior is explicit, not an accident.

marcel commented

2026-04-07 18:42:24 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

ReDoS risk in the parenthesis-stripping regex. If the regex is applied to user-supplied or import-file-supplied strings, a catastrophic backtracking pattern could cause CPU exhaustion. A simple pattern like \s*$[^)]*$\s*$ is safe (no nested quantifiers, no alternation inside repetition). But if someone writes $.*$ with greedy .* and the input contains multiple ( without closing ), backtracking can spike. Confirm the chosen regex uses [^)]* (character class negation) rather than .* inside the group.
Input length validation. The split() method processes strings from ODS imports. Is there an upper bound on input string length? An extremely long string (e.g., a malformed cell with 100KB of text) would be processed by the regex engine. A simple guard (if (input.length() > MAX_NAME_LENGTH) truncate or reject) before regex processing is good defense-in-depth.
No injection concern here. This is a parser that produces data objects, not a sink that executes queries or renders HTML. The extracted annotation string will eventually be stored or displayed -- the security boundary is at the point of use (JPA parameterized queries for storage, Svelte auto-escaping for display), not here. No action needed in this issue, but worth noting for whoever implements Part 2.
Log injection. If extracted annotations are logged (e.g., log.debug("Extracted annotation: {}", annotation)), ensure SLF4J parameterized logging is used (which it is by convention in this project). Raw string concatenation in log statements with user-derived content could inject newlines into structured logs. This is a low-severity concern but worth a glance during implementation.

Suggestions

Use [^)]* in the regex, not .*. This eliminates ReDoS risk entirely for this pattern.
No security-blocking concerns for this issue. The change is contained within a pure parser with no I/O side effects. The security boundary matters when the extracted annotations are persisted or displayed -- that is a concern for follow-up issues, not this one.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **ReDoS risk in the parenthesis-stripping regex.** If the regex is applied to user-supplied or import-file-supplied strings, a catastrophic backtracking pattern could cause CPU exhaustion. A simple pattern like `\s*$[^)]*$\s*$` is safe (no nested quantifiers, no alternation inside repetition). But if someone writes `$.*$` with greedy `.*` and the input contains multiple `(` without closing `)`, backtracking can spike. Confirm the chosen regex uses `[^)]*` (character class negation) rather than `.*` inside the group. - **Input length validation.** The `split()` method processes strings from ODS imports. Is there an upper bound on input string length? An extremely long string (e.g., a malformed cell with 100KB of text) would be processed by the regex engine. A simple guard (`if (input.length() > MAX_NAME_LENGTH) truncate or reject`) before regex processing is good defense-in-depth. - **No injection concern here.** This is a parser that produces data objects, not a sink that executes queries or renders HTML. The extracted annotation string will eventually be stored or displayed -- the security boundary is at the point of use (JPA parameterized queries for storage, Svelte auto-escaping for display), not here. No action needed in this issue, but worth noting for whoever implements Part 2. - **Log injection.** If extracted annotations are logged (e.g., `log.debug("Extracted annotation: {}", annotation)`), ensure SLF4J parameterized logging is used (which it is by convention in this project). Raw string concatenation in log statements with user-derived content could inject newlines into structured logs. This is a low-severity concern but worth a glance during implementation. ### Suggestions - Use `[^)]*` in the regex, not `.*`. This eliminates ReDoS risk entirely for this pattern. - No security-blocking concerns for this issue. The change is contained within a pure parser with no I/O side effects. The security boundary matters when the extracted annotations are persisted or displayed -- that is a concern for follow-up issues, not this one.

marcel commented

2026-04-07 18:42:37 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

No direct UI impact in this issue, but the extracted annotations will eventually surface in the UI. When nicknames become PersonNameAlias entries (mentioned in Part 2), they will appear in the person detail/edit page alias section. When birth years populate a future field, they will need a display location. These downstream UI changes are out of scope here, but the data decisions made now constrain the UI options later.
Display of uncertainty markers. Entry #2 and #4 produce names with ? annotations indicating uncertainty. If these ever surface in the UI (e.g., a badge or tooltip on the person card saying "uncertain attribution"), the visual treatment matters: color alone cannot convey uncertainty (WCAG). An icon + text label ("uncertain") would be needed. This is a future concern, not blocking, but worth flagging as a design consideration when Part 2 is scoped.
Person names with abbreviations. Entry #3 produces lastName = "D." -- a single-letter abbreviation with a dot. When this displays in the person list or document metadata, it may look like a rendering error to users ("Is the last name truncated?"). Consider whether the UI should eventually show abbreviated last names differently, or whether this is just accepted as-is from the source data.
Nicknames in search. If (Tuttu) becomes a PersonNameAlias of type NICKNAME, users searching for "Tuttu" should find Gertrud D. The search behavior is a UX concern -- the alias search already works for other alias types, so this should flow naturally, but it is worth confirming during testing.

Suggestions

No UI work needed for this issue. The parser change is backend-only.
When Part 2 is scoped, request a design review for how uncertainty markers and extracted birth years will display. These are new data types that need intentional visual treatment, not ad-hoc placement.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **No direct UI impact in this issue**, but the extracted annotations will eventually surface in the UI. When nicknames become `PersonNameAlias` entries (mentioned in Part 2), they will appear in the person detail/edit page alias section. When birth years populate a future field, they will need a display location. These downstream UI changes are out of scope here, but the data decisions made now constrain the UI options later. - **Display of uncertainty markers.** Entry #2 and #4 produce names with `?` annotations indicating uncertainty. If these ever surface in the UI (e.g., a badge or tooltip on the person card saying "uncertain attribution"), the visual treatment matters: color alone cannot convey uncertainty (WCAG). An icon + text label ("uncertain") would be needed. This is a future concern, not blocking, but worth flagging as a design consideration when Part 2 is scoped. - **Person names with abbreviations.** Entry #3 produces `lastName = "D."` -- a single-letter abbreviation with a dot. When this displays in the person list or document metadata, it may look like a rendering error to users ("Is the last name truncated?"). Consider whether the UI should eventually show abbreviated last names differently, or whether this is just accepted as-is from the source data. - **Nicknames in search.** If `(Tuttu)` becomes a `PersonNameAlias` of type `NICKNAME`, users searching for "Tuttu" should find Gertrud D. The search behavior is a UX concern -- the alias search already works for other alias types, so this should flow naturally, but it is worth confirming during testing. ### Suggestions - No UI work needed for this issue. The parser change is backend-only. - When Part 2 is scoped, request a design review for how uncertainty markers and extracted birth years will display. These are new data types that need intentional visual treatment, not ad-hoc placement.

marcel commented

2026-04-07 18:42:46 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

No infrastructure impact. This change is contained within PersonNameParser.java and its test file. No new dependencies, no schema migrations, no configuration changes, no Docker or CI modifications needed.
CI pipeline consideration. The existing unit test suite will run this automatically. Since the change modifies parsing logic used by MassImportService (which processes ODS files), it may be worth confirming that the integration tests for ODS import still pass. These tests run against Testcontainers Postgres in CI -- no additional CI configuration needed, but the test run time could increase slightly if new parameterized tests add many cases. Unlikely to be measurable.
No Flyway migration needed. The issue explicitly says the extracted annotations are returned but not yet persisted. When Part 2 adds birthYear to the Person entity or a type column to PersonNameAlias, that will require a Flyway migration (V{n}__add_person_birth_year.sql or similar). Flag this for the follow-up issue.
ODS import performance. The regex stripping adds one operation per name string during import. For the typical import sizes in this project (hundreds to low thousands of rows), this is negligible. No performance concern.

Suggestions

No DevOps action items for this issue. Clean backend-only change with no operational footprint.
For the follow-up issues (Part 2): if birthYear or annotation type columns are added, ensure the Flyway migration is backward-compatible (nullable columns, no NOT NULL without defaults) so rolling deployments do not break.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **No infrastructure impact.** This change is contained within `PersonNameParser.java` and its test file. No new dependencies, no schema migrations, no configuration changes, no Docker or CI modifications needed. - **CI pipeline consideration.** The existing unit test suite will run this automatically. Since the change modifies parsing logic used by `MassImportService` (which processes ODS files), it may be worth confirming that the integration tests for ODS import still pass. These tests run against Testcontainers Postgres in CI -- no additional CI configuration needed, but the test run time could increase slightly if new parameterized tests add many cases. Unlikely to be measurable. - **No Flyway migration needed.** The issue explicitly says the extracted annotations are returned but not yet persisted. When Part 2 adds `birthYear` to the Person entity or a `type` column to `PersonNameAlias`, that will require a Flyway migration (`V{n}__add_person_birth_year.sql` or similar). Flag this for the follow-up issue. - **ODS import performance.** The regex stripping adds one operation per name string during import. For the typical import sizes in this project (hundreds to low thousands of rows), this is negligible. No performance concern. ### Suggestions - No DevOps action items for this issue. Clean backend-only change with no operational footprint. - For the follow-up issues (Part 2): if `birthYear` or annotation type columns are added, ensure the Flyway migration is backward-compatible (nullable columns, no NOT NULL without defaults) so rolling deployments do not break.

marcel referenced this issue

2026-04-07 18:43:16 +02:00

feat(model): add title/salutation field to Person and make firstName optional #212

marcel referenced this issue

2026-04-07 18:50:13 +02:00

refactor: preparatory infrastructure for PersonNameParser enhancements (#209-#212) #213

marcel commented

2026-04-07 19:00:03 +02:00

🏗️ Markus Keller — Application Architect (Discussion Summary)

Interactive discussion with Marcel covering 3 open items. All resolved.

Resolved Items

Entry #4 Richard (Quast ?) - uncertain last name handling — Decision: use best-effort data. When parenthesized content contains a name-like string plus ? (e.g. (Quast ?)), extract the name part back into the cleaned result and store only ? as the annotation. Result: Richard (Quast ?) -> firstName="Richard", lastName="Quast", annotation="?". This preserves matchability - a "probably Quast" is more useful than a ? placeholder. For pure uncertainty markers like (?), the annotation is ? and the preceding name stays intact.
Scope: raw string only, no typed annotations — This issue strictly extracts the raw parenthesized content into the annotation field of SplitName (from #213). No AnnotationType enum, no birthYear population, no PersonNameAlias creation. Parser extracts, service interprets. Clean separation - interpretation logic belongs in follow-up issues.
Scope expanded: parseReceivers() paren handling must change — Currently parseReceivers() strips (...) in step 2 unconditionally (before checking for multi-person separators). This means An-column entries like Gertrud D.(Tuttu) have their annotation parens eaten by parseReceivers() before split() ever sees them. Fix: move paren extraction in parseReceivers() to only fire when a multi-separator (und/u) is present. Single-person inputs leave parens intact for split() to handle as annotations. Without this fix, the split() annotation extraction is useless for An-column entries.

Updated Input/Output Table for Entry #4

Input	Cleaned name	firstName	lastName	Annotation	Notes
`Richard (Quast ? )`	`Richard Quast`	`Richard`	`Quast`	`?`	Name-like content extracted back into name; only uncertainty marker in annotation
`Ernst Kurmany (?)`	`Ernst Kurmany`	`Ernst`	`Kurmany`	`?`	Pure uncertainty on existing name - no name content in parens
`Clara de Gruyter(*1871)`	`Clara de Gruyter`	`Clara`	`de Gruyter`	`*1871`	Birth year annotation, raw string
`Gertrud D.(Tuttu)`	`Gertrud D.`	`Gertrud`	`D.`	`Tuttu`	Nickname annotation, raw string

Dependency Note

This issue now depends on #213 (preparatory refactor) which provides the SplitName.annotation field and the stripAnnotation() pipeline slot. Execution order: #190 (merged) -> #213 -> #210.

## 🏗️ Markus Keller — Application Architect (Discussion Summary) Interactive discussion with Marcel covering 3 open items. All resolved. ### Resolved Items - **Entry #4 Richard (Quast ?) - uncertain last name handling** — Decision: **use best-effort data**. When parenthesized content contains a name-like string plus `?` (e.g. `(Quast ?)`), extract the name part back into the cleaned result and store only `?` as the annotation. Result: `Richard (Quast ?)` -> firstName="Richard", lastName="Quast", annotation="?". This preserves matchability - a "probably Quast" is more useful than a `?` placeholder. For pure uncertainty markers like `(?)`, the annotation is `?` and the preceding name stays intact. - **Scope: raw string only, no typed annotations** — This issue strictly extracts the raw parenthesized content into the `annotation` field of `SplitName` (from #213). No `AnnotationType` enum, no `birthYear` population, no `PersonNameAlias` creation. Parser extracts, service interprets. Clean separation - interpretation logic belongs in follow-up issues. - **Scope expanded: `parseReceivers()` paren handling must change** — Currently `parseReceivers()` strips `(...)` in step 2 unconditionally (before checking for multi-person separators). This means An-column entries like `Gertrud D.(Tuttu)` have their annotation parens eaten by `parseReceivers()` before `split()` ever sees them. **Fix**: move paren extraction in `parseReceivers()` to only fire when a multi-separator (`und`/`u`) is present. Single-person inputs leave parens intact for `split()` to handle as annotations. Without this fix, the `split()` annotation extraction is useless for An-column entries. ### Updated Input/Output Table for Entry #4 | Input | Cleaned name | firstName | lastName | Annotation | Notes | |---|---|---|---|---|---| | `Richard (Quast ? )` | `Richard Quast` | `Richard` | `Quast` | `?` | Name-like content extracted back into name; only uncertainty marker in annotation | | `Ernst Kurmany (?)` | `Ernst Kurmany` | `Ernst` | `Kurmany` | `?` | Pure uncertainty on existing name - no name content in parens | | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | `Clara` | `de Gruyter` | `*1871` | Birth year annotation, raw string | | `Gertrud D.(Tuttu)` | `Gertrud D.` | `Gertrud` | `D.` | `Tuttu` | Nickname annotation, raw string | ### Dependency Note This issue now depends on #213 (preparatory refactor) which provides the `SplitName.annotation` field and the `stripAnnotation()` pipeline slot. Execution order: #190 (merged) -> #213 -> #210.

marcel commented

2026-04-08 13:01:04 +02:00

Implementation Complete

Both parts implemented on branch feat/issues-209-213-person-parser-enhancements.

Commits

Commit	Description
`e696e50`	Implement stripAnnotation() for parenthesized content
`e49ae5d`	Fix parseReceivers() to preserve annotation parens for single-person inputs

What changed

stripAnnotation() extracts trailing (...) content as annotation. Handles birth years (*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain names (Quast ?) where the name part is extracted back into the cleaned result
parseReceivers() moved paren extraction after the multi-separator check — single-person inputs keep parens for split() annotation extraction; multi-person inputs still use (Gruber) as shared last-name override
Uses [^)]* regex to prevent ReDoS
Raw annotation string preserved in SplitName.annotation — no typed interpretation in this issue

Test results

Backend: 710 tests passing (8 new annotation tests)

## Implementation Complete Both parts implemented on branch `feat/issues-209-213-person-parser-enhancements`. ### Commits | Commit | Description | |--------|-------------| | `e696e50` | Implement stripAnnotation() for parenthesized content | | `e49ae5d` | Fix parseReceivers() to preserve annotation parens for single-person inputs | ### What changed - **stripAnnotation()** extracts trailing `(...)` content as annotation. Handles birth years `(*1871)`, nicknames `(Tuttu)`, uncertainty markers `(?)`, and uncertain names `(Quast ?)` where the name part is extracted back into the cleaned result - **parseReceivers()** moved paren extraction after the multi-separator check — single-person inputs keep parens for `split()` annotation extraction; multi-person inputs still use `(Gruber)` as shared last-name override - Uses `[^)]*` regex to prevent ReDoS - Raw annotation string preserved in `SplitName.annotation` — no typed interpretation in this issue ### Test results - Backend: 710 tests passing (8 new annotation tests)

marcel referenced this issue

2026-04-08 13:16:37 +02:00

feat: PersonNameParser enhancements and Person model refactor (#209-#213) #215

marcel closed this issue

2026-04-08 20:55:39 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#210

feat(parser): strip parenthesized annotations in split() and preserve content #210

Problem

Solution

Part 1 - Strip parenthesized content in split()

Part 2 - Preserve extracted content

Files

Found in

Complete Input/Output Table

Entries that split() must handle

Entry already handled by parseReceivers() (no change needed)

Processing order in split()

Edge cases to test

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

Suggestions

🏗️ Markus Keller — Application Architect

Questions & Observations

Suggestions

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

Suggestions

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

Suggestions

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

Suggestions

🔧 Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

Suggestions

🏗️ Markus Keller — Application Architect (Discussion Summary)

Resolved Items

Updated Input/Output Table for Entry #4

Dependency Note

Implementation Complete

Commits

What changed

Test results

Part 1 - Strip parenthesized content in `split()`

Entries that `split()` must handle

Entry already handled by `parseReceivers()` (no change needed)

Processing order in `split()`