feat(parser): strip parenthesized annotations in split() and preserve content #210

Closed
opened 2026-04-07 18:27:57 +02:00 by marcel · 9 comments
Owner

Problem

parseReceivers() strips parenthesized content like (Gruber) as a shared last-name override. But split() does not strip parenthesized content at all, so names with annotations break:

Input Current result Expected result
Clara de Gruyter(*1871) firstName="Clara de", lastName="Gruyter(*1871)" firstName="Clara", lastName="de Gruyter", birth year=1871 preserved
Gertrud D.(Tuttu) lastName="D.(Tuttu)" firstName="Gertrud", lastName="D.", nickname "Tuttu" preserved
Ernst Kurmany (?) firstName="Ernst Kurmany", lastName="(?)" firstName="Ernst", lastName="Kurmany", uncertain flag preserved
Richard (Quast ?) firstName="Richard", lastName="(Quast ?)" via fallback firstName="Richard", lastName="Quast", uncertain flag preserved

The parenthesized content contains valuable metadata: birth years, nicknames, uncertainty markers. It should be stripped from the name but preserved.

Solution

Part 1 - Strip parenthesized content in split()

Before the findKnownLastName check, strip (...) content from the cleaned name. This should handle:

  • (*1871) - birth/death year annotations
  • (Tuttu) - nicknames
  • (?) - uncertainty markers
  • (Quast ?) - uncertain last names

Part 2 - Preserve extracted content

Return the extracted parenthesized content alongside the split name. The caller can decide how to use it:

  • Birth years could populate a future birthYear field
  • Nicknames could become a PersonNameAlias with type NICKNAME
  • Uncertainty markers could be logged/flagged

For now, at minimum extract and return the content so it's not silently lost. The consuming code can be enhanced incrementally.

Files

File Change
PersonNameParser.java Add parenthesis stripping in split(), return extracted content
PersonNameParserTest.java New tests for parenthesized annotation stripping

Found in

ODS import file analysis for #190.

## Problem `parseReceivers()` strips parenthesized content like `(Gruber)` as a shared last-name override. But `split()` does not strip parenthesized content at all, so names with annotations break: | Input | Current result | Expected result | |---|---|---| | `Clara de Gruyter(*1871)` | firstName="Clara de", lastName="Gruyter(*1871)" | firstName="Clara", lastName="de Gruyter", birth year=1871 preserved | | `Gertrud D.(Tuttu)` | lastName="D.(Tuttu)" | firstName="Gertrud", lastName="D.", nickname "Tuttu" preserved | | `Ernst Kurmany (?)` | firstName="Ernst Kurmany", lastName="(?)" | firstName="Ernst", lastName="Kurmany", uncertain flag preserved | | `Richard (Quast ?)` | firstName="Richard", lastName="(Quast ?)" via fallback | firstName="Richard", lastName="Quast", uncertain flag preserved | The parenthesized content contains valuable metadata: birth years, nicknames, uncertainty markers. It should be stripped from the name but preserved. ## Solution ### Part 1 - Strip parenthesized content in `split()` Before the `findKnownLastName` check, strip `(...)` content from the cleaned name. This should handle: - `(*1871)` - birth/death year annotations - `(Tuttu)` - nicknames - `(?)` - uncertainty markers - `(Quast ?)` - uncertain last names ### Part 2 - Preserve extracted content Return the extracted parenthesized content alongside the split name. The caller can decide how to use it: - Birth years could populate a future `birthYear` field - Nicknames could become a `PersonNameAlias` with type `NICKNAME` - Uncertainty markers could be logged/flagged For now, at minimum extract and return the content so it's not silently lost. The consuming code can be enhanced incrementally. ## Files | File | Change | |---|---| | `PersonNameParser.java` | Add parenthesis stripping in `split()`, return extracted content | | `PersonNameParserTest.java` | New tests for parenthesized annotation stripping | ## Found in ODS import file analysis for #190.
marcel added the featureperson labels 2026-04-07 18:28:41 +02:00
Author
Owner

Complete Input/Output Table

Every parenthesized entry from the ODS and how split() should handle it.

Entries that split() must handle

# Raw input Column Expected firstName Expected lastName Extracted annotation Annotation type
1 Clara de Gruyter(*1871) Von Clara de Gruyter *1871 Birth year
2 Ernst Kurmany (?) Von Ernst Kurmany ? Uncertainty
3 Gertrud D.(Tuttu) An Gertrud D. Tuttu Nickname
4 Richard (Quast ? ) An Richard Quast ? Uncertainty (uncertain last name)

Entry already handled by parseReceivers() (no change needed)

# Raw input Column Current handling Notes
5 Hedi und Tutu (Gruber) An parseReceivers() extracts Gruber as shared last name -> ["Hedi Gruber", "Tutu Gruber"] This is a shared last-name override, not an annotation. Already works correctly.

Processing order in split()

For each entry, the strip must happen BEFORE findKnownLastName and the last-space fallback:

# Input After paren strip After dot-normalization After known-last-name / fallback
1 Clara de Gruyter(*1871) Clara de Gruyter (has spaces, skip) known: de Gruyter -> ("Clara", "de Gruyter")
2 Ernst Kurmany (?) Ernst Kurmany (has spaces, skip) fallback: ("Ernst", "Kurmany")
3 Gertrud D.(Tuttu) Gertrud D. (has spaces, skip) fallback: ("Gertrud", "D.")
4 Richard (Quast ? ) Richard (no dots, skip) single token: ("Richard", "?")

Note: #4 has a double space before the paren. After stripping (Quast ? ) and trimming, only Richard remains - a single token with no last name. The uncertain last name Quast should be preserved as metadata but cannot be used as the canonical lastName since it's explicitly marked uncertain.

Edge cases to test

Input After paren strip Expected split Notes
Clara de Gruyter(*1871) Clara de Gruyter ("Clara", "de Gruyter") No space before paren
Ernst Kurmany (?) Ernst Kurmany ("Ernst", "Kurmany") Space before paren
Gertrud D.(Tuttu) Gertrud D. ("Gertrud", "D.") No space, nickname
Name (annotation) extra Should not occur in data - If it does, only trailing (...) is stripped
(OnlyParen) empty -> ("?", "?") ("?", "?") Degenerate case
## Complete Input/Output Table Every parenthesized entry from the ODS and how `split()` should handle it. ### Entries that `split()` must handle | # | Raw input | Column | Expected firstName | Expected lastName | Extracted annotation | Annotation type | |---|---|---|---|---|---|---| | 1 | `Clara de Gruyter(*1871)` | Von | `Clara` | `de Gruyter` | `*1871` | Birth year | | 2 | `Ernst Kurmany (?)` | Von | `Ernst` | `Kurmany` | `?` | Uncertainty | | 3 | `Gertrud D.(Tuttu)` | An | `Gertrud` | `D.` | `Tuttu` | Nickname | | 4 | `Richard (Quast ? )` | An | `Richard` | `Quast` | `?` | Uncertainty (uncertain last name) | ### Entry already handled by `parseReceivers()` (no change needed) | # | Raw input | Column | Current handling | Notes | |---|---|---|---|---| | 5 | `Hedi und Tutu (Gruber)` | An | `parseReceivers()` extracts `Gruber` as shared last name -> `["Hedi Gruber", "Tutu Gruber"]` | This is a shared last-name override, not an annotation. Already works correctly. | ### Processing order in `split()` For each entry, the strip must happen BEFORE `findKnownLastName` and the last-space fallback: | # | Input | After paren strip | After dot-normalization | After known-last-name / fallback | |---|---|---|---|---| | 1 | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | (has spaces, skip) | known: `de Gruyter` -> `("Clara", "de Gruyter")` | | 2 | `Ernst Kurmany (?)` | `Ernst Kurmany` | (has spaces, skip) | fallback: `("Ernst", "Kurmany")` | | 3 | `Gertrud D.(Tuttu)` | `Gertrud D.` | (has spaces, skip) | fallback: `("Gertrud", "D.")` | | 4 | `Richard (Quast ? )` | `Richard` | (no dots, skip) | single token: `("Richard", "?")` | Note: #4 has a double space before the paren. After stripping `(Quast ? )` and trimming, only `Richard` remains - a single token with no last name. The uncertain last name `Quast` should be preserved as metadata but cannot be used as the canonical lastName since it's explicitly marked uncertain. ### Edge cases to test | Input | After paren strip | Expected split | Notes | |---|---|---|---| | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | `("Clara", "de Gruyter")` | No space before paren | | `Ernst Kurmany (?)` | `Ernst Kurmany` | `("Ernst", "Kurmany")` | Space before paren | | `Gertrud D.(Tuttu)` | `Gertrud D.` | `("Gertrud", "D.")` | No space, nickname | | `Name (annotation) extra` | Should not occur in data | - | If it does, only trailing `(...)` is stripped | | `(OnlyParen)` | empty -> `("?", "?")` | `("?", "?")` | Degenerate case |
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

  • The issue says split() should return extracted parenthesized content "alongside the split name." What is the concrete return type? Currently split() returns a String[] (or similar two-element structure). Returning annotation metadata means either a new record/DTO or overloading the return value. A NameSplitResult record with firstName, lastName, and String annotation (or List<String> annotations) would be the clean solution -- is that the intended direction?

  • Entry #4 (Richard (Quast ? )) is interesting: after stripping, only Richard remains -- a single token. The comment says "single token with no last name" producing ("Richard", "?"). But ? as a lastName is a sentinel value, not a real name. Does the current split() already have a convention for "no last name found"? If it returns null or empty string, switching to ? is a behavior change that callers may not expect.

  • The regex for stripping needs to handle: no space before paren (Gruyter(*1871)), space before paren (Kurmany (?)), content with spaces inside ((Quast ? )), and multiple consecutive parens (unlikely but worth a guard). A single \s*\([^)]*\)\s*$ anchored at end-of-string should cover all four cases listed.

  • The issue scopes this to split() only, leaving parseReceivers() untouched. But parseReceivers() calls split() internally for each receiver after extracting the shared last-name override. Will the new stripping in split() accidentally interfere with the parseReceivers() path where (Gruber) has already been removed? Need to confirm that parseReceivers() strips its own parens before delegating to split(), so there is no double-processing risk.

Suggestions

  • Introduce a NameSplitResult record:

    public record NameSplitResult(String firstName, String lastName, String annotation) {}
    

    This keeps split() as a query (returns data, no side effects) and avoids encoding metadata into the name strings themselves. The annotation field can be null when no parenthesized content exists.

  • TDD sequence for the four cases: write one test per row in the input/output table from the comment, watch each fail, then generalize the regex. Start with #2 (Ernst Kurmany (?)) as the simplest case, then #1 (no space before paren), then #3 (nickname), then #4 (uncertain last name producing single token). Each test forces one refinement.

  • For the (OnlyParen) degenerate edge case: the test should assert that split() returns ("?", "?") with annotation "OnlyParen" -- confirming the annotation is still captured even when the name collapses entirely.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - The issue says `split()` should return extracted parenthesized content "alongside the split name." What is the concrete return type? Currently `split()` returns a `String[]` (or similar two-element structure). Returning annotation metadata means either a new record/DTO or overloading the return value. A `NameSplitResult` record with `firstName`, `lastName`, and `String annotation` (or `List<String> annotations`) would be the clean solution -- is that the intended direction? - Entry #4 (`Richard (Quast ? )`) is interesting: after stripping, only `Richard` remains -- a single token. The comment says "single token with no last name" producing `("Richard", "?")`. But `?` as a lastName is a sentinel value, not a real name. Does the current `split()` already have a convention for "no last name found"? If it returns `null` or empty string, switching to `?` is a behavior change that callers may not expect. - The regex for stripping needs to handle: no space before paren (`Gruyter(*1871)`), space before paren (`Kurmany (?)`), content with spaces inside (`(Quast ? )`), and multiple consecutive parens (unlikely but worth a guard). A single `\s*\([^)]*\)\s*$` anchored at end-of-string should cover all four cases listed. - The issue scopes this to `split()` only, leaving `parseReceivers()` untouched. But `parseReceivers()` calls `split()` internally for each receiver after extracting the shared last-name override. Will the new stripping in `split()` accidentally interfere with the `parseReceivers()` path where `(Gruber)` has already been removed? Need to confirm that `parseReceivers()` strips its own parens before delegating to `split()`, so there is no double-processing risk. ### Suggestions - Introduce a `NameSplitResult` record: ```java public record NameSplitResult(String firstName, String lastName, String annotation) {} ``` This keeps `split()` as a query (returns data, no side effects) and avoids encoding metadata into the name strings themselves. The annotation field can be `null` when no parenthesized content exists. - TDD sequence for the four cases: write one test per row in the input/output table from the comment, watch each fail, then generalize the regex. Start with #2 (`Ernst Kurmany (?)`) as the simplest case, then #1 (no space before paren), then #3 (nickname), then #4 (uncertain last name producing single token). Each test forces one refinement. - For the `(OnlyParen)` degenerate edge case: the test should assert that `split()` returns `("?", "?")` with annotation `"OnlyParen"` -- confirming the annotation is still captured even when the name collapses entirely.
Author
Owner

🏗️ Markus Keller — Application Architect

Questions & Observations

  • Return type change is a module boundary concern. split() is called by PersonService (via findOrCreateByName or similar). Changing it from a simple String[] to a richer result type is the right call, but every caller must be updated. How many call sites exist? If parseReceivers() and MassImportService both call split(), this is a cross-cutting change that should be mapped before implementation.

  • Annotation semantics are underspecified. The issue lists three annotation types: birth year (*1871), nickname (Tuttu), and uncertainty (?). These have fundamentally different semantics -- a birth year is structured data, a nickname is a PersonNameAlias, and uncertainty is a quality flag. Returning a raw String annotation defers the parsing problem. Is the intent to parse annotation types in this issue, or truly just extract the raw string and leave interpretation for later? The answer affects whether we need an AnnotationType enum now or not.

  • Schema implications for "preserve extracted content." Part 2 says "the caller can decide how to use it" and mentions birthYear field and PersonNameAlias with type NICKNAME. Neither of these schema additions are in scope for this issue. That is fine -- but the PersonNameAlias entity already exists (from #181). Does the current PersonNameAlias model have a type field, or only alias? If there is no type discriminator, nicknames extracted here cannot be distinguished from other alias types without a schema migration.

  • Entry #4 loses the uncertain last name entirely. Richard (Quast ?) becomes firstName=Richard, lastName=null/empty, with Quast only in the annotation string. If this name is used to findOrCreate a Person, it creates a person with no last name. Later, if someone imports Richard Quast (without uncertainty), the system will not match them. Is that acceptable, or should uncertain last names still populate lastName with a flag?

Suggestions

  • Scope this issue strictly to extraction (Part 1 + raw string preservation). Do not add AnnotationType enum, birthYear field, or auto-creation of PersonNameAlias entries. Those belong in follow-up issues that reference this one.

  • Document the NameSplitResult (or equivalent) as an internal API contract in PersonNameParser. It should be package-private or at most used within the person domain -- no other service should depend on annotation parsing internals.

  • For entry #4, consider whether Quast should become the lastName with annotation ? (meaning "uncertain"), rather than dropping it entirely. The current spec says lastName should be empty, but that creates a data loss scenario where the uncertain name vanishes from the searchable fields.

## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **Return type change is a module boundary concern.** `split()` is called by `PersonService` (via `findOrCreateByName` or similar). Changing it from a simple `String[]` to a richer result type is the right call, but every caller must be updated. How many call sites exist? If `parseReceivers()` and `MassImportService` both call `split()`, this is a cross-cutting change that should be mapped before implementation. - **Annotation semantics are underspecified.** The issue lists three annotation types: birth year (`*1871`), nickname (`Tuttu`), and uncertainty (`?`). These have fundamentally different semantics -- a birth year is structured data, a nickname is a `PersonNameAlias`, and uncertainty is a quality flag. Returning a raw `String annotation` defers the parsing problem. Is the intent to parse annotation types in this issue, or truly just extract the raw string and leave interpretation for later? The answer affects whether we need an `AnnotationType` enum now or not. - **Schema implications for "preserve extracted content."** Part 2 says "the caller can decide how to use it" and mentions `birthYear` field and `PersonNameAlias` with type `NICKNAME`. Neither of these schema additions are in scope for this issue. That is fine -- but the `PersonNameAlias` entity already exists (from #181). Does the current `PersonNameAlias` model have a `type` field, or only `alias`? If there is no type discriminator, nicknames extracted here cannot be distinguished from other alias types without a schema migration. - **Entry #4 loses the uncertain last name entirely.** `Richard (Quast ?)` becomes `firstName=Richard, lastName=null/empty`, with `Quast` only in the annotation string. If this name is used to `findOrCreate` a Person, it creates a person with no last name. Later, if someone imports `Richard Quast` (without uncertainty), the system will not match them. Is that acceptable, or should uncertain last names still populate `lastName` with a flag? ### Suggestions - Scope this issue strictly to extraction (Part 1 + raw string preservation). Do not add `AnnotationType` enum, `birthYear` field, or auto-creation of `PersonNameAlias` entries. Those belong in follow-up issues that reference this one. - Document the `NameSplitResult` (or equivalent) as an internal API contract in `PersonNameParser`. It should be package-private or at most used within the `person` domain -- no other service should depend on annotation parsing internals. - For entry #4, consider whether `Quast` should become the lastName with annotation `?` (meaning "uncertain"), rather than dropping it entirely. The current spec says `lastName` should be empty, but that creates a data loss scenario where the uncertain name vanishes from the searchable fields.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Questions & Observations

  • Test layer assignment. This is pure parsing logic with no database or HTTP dependency. All tests belong at the unit test layer (JUnit 5, no Spring context). The existing PersonNameParserTest.java is the right home. No integration or E2E tests needed for this change.

  • Regression risk on existing split() behavior. The issue adds parenthesis stripping as a new step before findKnownLastName. Every existing test case for split() should still pass unchanged -- names without parentheses must not be affected. Before writing new tests, run the existing suite and confirm it is green. After adding the stripping logic, run the full suite again. Any regression means the stripping regex is too greedy.

  • The processing order table in the comment is a test plan. Each row maps directly to a test case:

    1. Input with no space before paren + known last name (Clara de Gruyter(*1871))
    2. Input with space before paren + fallback split (Ernst Kurmany (?))
    3. Input with no space + dot in name (Gertrud D.(Tuttu))
    4. Input with double space + content-bearing paren producing single token (Richard (Quast ? ))
  • Edge cases not covered in the issue that need tests:

    • Name with no parentheses at all (existing behavior, regression guard)
    • Name with parentheses in the middle, not at the end (Hans (von) Meier) -- should this strip or not? The issue says "trailing (...) is stripped" but the regex needs to enforce end-of-string anchoring
    • Multiple trailing parenthesized groups (Name (a)(b)) -- unlikely in real data, but the regex behavior should be defined
    • Empty parentheses Name () -- what does annotation contain?
    • Nested parentheses Name ((inner)) -- degenerate but should not crash
  • Test naming convention. Follow the project pattern: split_shouldExtractBirthYearAnnotation_whenParenFollowsLastName, split_shouldPreserveLastName_whenNoParenthesesPresent.

Suggestions

  • Write a parameterized test (@ParameterizedTest with @CsvSource or @MethodSource) for the four main cases from the comment table. This makes the input/output mapping explicit and easy to extend when new annotation patterns appear in future ODS imports.

  • Add an explicit regression test block: take 5-6 existing split() inputs that have no parentheses and assert they produce identical results after this change. This is cheap insurance against regex over-matching.

  • For the (OnlyParen) degenerate case: assert the method does not throw. The expected output ("?", "?") with annotation "OnlyParen" should be documented in the test name so the behavior is explicit, not an accident.

## 🧪 Sara Holt — QA Engineer & Test Strategist ### Questions & Observations - **Test layer assignment.** This is pure parsing logic with no database or HTTP dependency. All tests belong at the unit test layer (JUnit 5, no Spring context). The existing `PersonNameParserTest.java` is the right home. No integration or E2E tests needed for this change. - **Regression risk on existing `split()` behavior.** The issue adds parenthesis stripping as a new step before `findKnownLastName`. Every existing test case for `split()` should still pass unchanged -- names without parentheses must not be affected. Before writing new tests, run the existing suite and confirm it is green. After adding the stripping logic, run the full suite again. Any regression means the stripping regex is too greedy. - **The processing order table in the comment is a test plan.** Each row maps directly to a test case: 1. Input with no space before paren + known last name (`Clara de Gruyter(*1871)`) 2. Input with space before paren + fallback split (`Ernst Kurmany (?)`) 3. Input with no space + dot in name (`Gertrud D.(Tuttu)`) 4. Input with double space + content-bearing paren producing single token (`Richard (Quast ? )`) - **Edge cases not covered in the issue that need tests:** - Name with no parentheses at all (existing behavior, regression guard) - Name with parentheses in the middle, not at the end (`Hans (von) Meier`) -- should this strip or not? The issue says "trailing `(...)` is stripped" but the regex needs to enforce end-of-string anchoring - Multiple trailing parenthesized groups (`Name (a)(b)`) -- unlikely in real data, but the regex behavior should be defined - Empty parentheses `Name ()` -- what does annotation contain? - Nested parentheses `Name ((inner))` -- degenerate but should not crash - **Test naming convention.** Follow the project pattern: `split_shouldExtractBirthYearAnnotation_whenParenFollowsLastName`, `split_shouldPreserveLastName_whenNoParenthesesPresent`. ### Suggestions - Write a parameterized test (`@ParameterizedTest` with `@CsvSource` or `@MethodSource`) for the four main cases from the comment table. This makes the input/output mapping explicit and easy to extend when new annotation patterns appear in future ODS imports. - Add an explicit regression test block: take 5-6 existing `split()` inputs that have no parentheses and assert they produce identical results after this change. This is cheap insurance against regex over-matching. - For the `(OnlyParen)` degenerate case: assert the method does not throw. The expected output `("?", "?")` with annotation `"OnlyParen"` should be documented in the test name so the behavior is explicit, not an accident.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

  • ReDoS risk in the parenthesis-stripping regex. If the regex is applied to user-supplied or import-file-supplied strings, a catastrophic backtracking pattern could cause CPU exhaustion. A simple pattern like \s*\([^)]*\)\s*$ is safe (no nested quantifiers, no alternation inside repetition). But if someone writes \(.*\) with greedy .* and the input contains multiple ( without closing ), backtracking can spike. Confirm the chosen regex uses [^)]* (character class negation) rather than .* inside the group.

  • Input length validation. The split() method processes strings from ODS imports. Is there an upper bound on input string length? An extremely long string (e.g., a malformed cell with 100KB of text) would be processed by the regex engine. A simple guard (if (input.length() > MAX_NAME_LENGTH) truncate or reject) before regex processing is good defense-in-depth.

  • No injection concern here. This is a parser that produces data objects, not a sink that executes queries or renders HTML. The extracted annotation string will eventually be stored or displayed -- the security boundary is at the point of use (JPA parameterized queries for storage, Svelte auto-escaping for display), not here. No action needed in this issue, but worth noting for whoever implements Part 2.

  • Log injection. If extracted annotations are logged (e.g., log.debug("Extracted annotation: {}", annotation)), ensure SLF4J parameterized logging is used (which it is by convention in this project). Raw string concatenation in log statements with user-derived content could inject newlines into structured logs. This is a low-severity concern but worth a glance during implementation.

Suggestions

  • Use [^)]* in the regex, not .*. This eliminates ReDoS risk entirely for this pattern.

  • No security-blocking concerns for this issue. The change is contained within a pure parser with no I/O side effects. The security boundary matters when the extracted annotations are persisted or displayed -- that is a concern for follow-up issues, not this one.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **ReDoS risk in the parenthesis-stripping regex.** If the regex is applied to user-supplied or import-file-supplied strings, a catastrophic backtracking pattern could cause CPU exhaustion. A simple pattern like `\s*\([^)]*\)\s*$` is safe (no nested quantifiers, no alternation inside repetition). But if someone writes `\(.*\)` with greedy `.*` and the input contains multiple `(` without closing `)`, backtracking can spike. Confirm the chosen regex uses `[^)]*` (character class negation) rather than `.*` inside the group. - **Input length validation.** The `split()` method processes strings from ODS imports. Is there an upper bound on input string length? An extremely long string (e.g., a malformed cell with 100KB of text) would be processed by the regex engine. A simple guard (`if (input.length() > MAX_NAME_LENGTH) truncate or reject`) before regex processing is good defense-in-depth. - **No injection concern here.** This is a parser that produces data objects, not a sink that executes queries or renders HTML. The extracted annotation string will eventually be stored or displayed -- the security boundary is at the point of use (JPA parameterized queries for storage, Svelte auto-escaping for display), not here. No action needed in this issue, but worth noting for whoever implements Part 2. - **Log injection.** If extracted annotations are logged (e.g., `log.debug("Extracted annotation: {}", annotation)`), ensure SLF4J parameterized logging is used (which it is by convention in this project). Raw string concatenation in log statements with user-derived content could inject newlines into structured logs. This is a low-severity concern but worth a glance during implementation. ### Suggestions - Use `[^)]*` in the regex, not `.*`. This eliminates ReDoS risk entirely for this pattern. - No security-blocking concerns for this issue. The change is contained within a pure parser with no I/O side effects. The security boundary matters when the extracted annotations are persisted or displayed -- that is a concern for follow-up issues, not this one.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

  • No direct UI impact in this issue, but the extracted annotations will eventually surface in the UI. When nicknames become PersonNameAlias entries (mentioned in Part 2), they will appear in the person detail/edit page alias section. When birth years populate a future field, they will need a display location. These downstream UI changes are out of scope here, but the data decisions made now constrain the UI options later.

  • Display of uncertainty markers. Entry #2 and #4 produce names with ? annotations indicating uncertainty. If these ever surface in the UI (e.g., a badge or tooltip on the person card saying "uncertain attribution"), the visual treatment matters: color alone cannot convey uncertainty (WCAG). An icon + text label ("uncertain") would be needed. This is a future concern, not blocking, but worth flagging as a design consideration when Part 2 is scoped.

  • Person names with abbreviations. Entry #3 produces lastName = "D." -- a single-letter abbreviation with a dot. When this displays in the person list or document metadata, it may look like a rendering error to users ("Is the last name truncated?"). Consider whether the UI should eventually show abbreviated last names differently, or whether this is just accepted as-is from the source data.

  • Nicknames in search. If (Tuttu) becomes a PersonNameAlias of type NICKNAME, users searching for "Tuttu" should find Gertrud D. The search behavior is a UX concern -- the alias search already works for other alias types, so this should flow naturally, but it is worth confirming during testing.

Suggestions

  • No UI work needed for this issue. The parser change is backend-only.

  • When Part 2 is scoped, request a design review for how uncertainty markers and extracted birth years will display. These are new data types that need intentional visual treatment, not ad-hoc placement.

## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - **No direct UI impact in this issue**, but the extracted annotations will eventually surface in the UI. When nicknames become `PersonNameAlias` entries (mentioned in Part 2), they will appear in the person detail/edit page alias section. When birth years populate a future field, they will need a display location. These downstream UI changes are out of scope here, but the data decisions made now constrain the UI options later. - **Display of uncertainty markers.** Entry #2 and #4 produce names with `?` annotations indicating uncertainty. If these ever surface in the UI (e.g., a badge or tooltip on the person card saying "uncertain attribution"), the visual treatment matters: color alone cannot convey uncertainty (WCAG). An icon + text label ("uncertain") would be needed. This is a future concern, not blocking, but worth flagging as a design consideration when Part 2 is scoped. - **Person names with abbreviations.** Entry #3 produces `lastName = "D."` -- a single-letter abbreviation with a dot. When this displays in the person list or document metadata, it may look like a rendering error to users ("Is the last name truncated?"). Consider whether the UI should eventually show abbreviated last names differently, or whether this is just accepted as-is from the source data. - **Nicknames in search.** If `(Tuttu)` becomes a `PersonNameAlias` of type `NICKNAME`, users searching for "Tuttu" should find Gertrud D. The search behavior is a UX concern -- the alias search already works for other alias types, so this should flow naturally, but it is worth confirming during testing. ### Suggestions - No UI work needed for this issue. The parser change is backend-only. - When Part 2 is scoped, request a design review for how uncertainty markers and extracted birth years will display. These are new data types that need intentional visual treatment, not ad-hoc placement.
Author
Owner

🔧 Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

  • No infrastructure impact. This change is contained within PersonNameParser.java and its test file. No new dependencies, no schema migrations, no configuration changes, no Docker or CI modifications needed.

  • CI pipeline consideration. The existing unit test suite will run this automatically. Since the change modifies parsing logic used by MassImportService (which processes ODS files), it may be worth confirming that the integration tests for ODS import still pass. These tests run against Testcontainers Postgres in CI -- no additional CI configuration needed, but the test run time could increase slightly if new parameterized tests add many cases. Unlikely to be measurable.

  • No Flyway migration needed. The issue explicitly says the extracted annotations are returned but not yet persisted. When Part 2 adds birthYear to the Person entity or a type column to PersonNameAlias, that will require a Flyway migration (V{n}__add_person_birth_year.sql or similar). Flag this for the follow-up issue.

  • ODS import performance. The regex stripping adds one operation per name string during import. For the typical import sizes in this project (hundreds to low thousands of rows), this is negligible. No performance concern.

Suggestions

  • No DevOps action items for this issue. Clean backend-only change with no operational footprint.

  • For the follow-up issues (Part 2): if birthYear or annotation type columns are added, ensure the Flyway migration is backward-compatible (nullable columns, no NOT NULL without defaults) so rolling deployments do not break.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **No infrastructure impact.** This change is contained within `PersonNameParser.java` and its test file. No new dependencies, no schema migrations, no configuration changes, no Docker or CI modifications needed. - **CI pipeline consideration.** The existing unit test suite will run this automatically. Since the change modifies parsing logic used by `MassImportService` (which processes ODS files), it may be worth confirming that the integration tests for ODS import still pass. These tests run against Testcontainers Postgres in CI -- no additional CI configuration needed, but the test run time could increase slightly if new parameterized tests add many cases. Unlikely to be measurable. - **No Flyway migration needed.** The issue explicitly says the extracted annotations are returned but not yet persisted. When Part 2 adds `birthYear` to the Person entity or a `type` column to `PersonNameAlias`, that will require a Flyway migration (`V{n}__add_person_birth_year.sql` or similar). Flag this for the follow-up issue. - **ODS import performance.** The regex stripping adds one operation per name string during import. For the typical import sizes in this project (hundreds to low thousands of rows), this is negligible. No performance concern. ### Suggestions - No DevOps action items for this issue. Clean backend-only change with no operational footprint. - For the follow-up issues (Part 2): if `birthYear` or annotation type columns are added, ensure the Flyway migration is backward-compatible (nullable columns, no NOT NULL without defaults) so rolling deployments do not break.
Author
Owner

🏗️ Markus Keller — Application Architect (Discussion Summary)

Interactive discussion with Marcel covering 3 open items. All resolved.

Resolved Items

  • Entry #4 Richard (Quast ?) - uncertain last name handling — Decision: use best-effort data. When parenthesized content contains a name-like string plus ? (e.g. (Quast ?)), extract the name part back into the cleaned result and store only ? as the annotation. Result: Richard (Quast ?) -> firstName="Richard", lastName="Quast", annotation="?". This preserves matchability - a "probably Quast" is more useful than a ? placeholder. For pure uncertainty markers like (?), the annotation is ? and the preceding name stays intact.

  • Scope: raw string only, no typed annotations — This issue strictly extracts the raw parenthesized content into the annotation field of SplitName (from #213). No AnnotationType enum, no birthYear population, no PersonNameAlias creation. Parser extracts, service interprets. Clean separation - interpretation logic belongs in follow-up issues.

  • Scope expanded: parseReceivers() paren handling must change — Currently parseReceivers() strips (...) in step 2 unconditionally (before checking for multi-person separators). This means An-column entries like Gertrud D.(Tuttu) have their annotation parens eaten by parseReceivers() before split() ever sees them. Fix: move paren extraction in parseReceivers() to only fire when a multi-separator (und/u) is present. Single-person inputs leave parens intact for split() to handle as annotations. Without this fix, the split() annotation extraction is useless for An-column entries.

Updated Input/Output Table for Entry #4

Input Cleaned name firstName lastName Annotation Notes
Richard (Quast ? ) Richard Quast Richard Quast ? Name-like content extracted back into name; only uncertainty marker in annotation
Ernst Kurmany (?) Ernst Kurmany Ernst Kurmany ? Pure uncertainty on existing name - no name content in parens
Clara de Gruyter(*1871) Clara de Gruyter Clara de Gruyter *1871 Birth year annotation, raw string
Gertrud D.(Tuttu) Gertrud D. Gertrud D. Tuttu Nickname annotation, raw string

Dependency Note

This issue now depends on #213 (preparatory refactor) which provides the SplitName.annotation field and the stripAnnotation() pipeline slot. Execution order: #190 (merged) -> #213 -> #210.

## 🏗️ Markus Keller — Application Architect (Discussion Summary) Interactive discussion with Marcel covering 3 open items. All resolved. ### Resolved Items - **Entry #4 Richard (Quast ?) - uncertain last name handling** — Decision: **use best-effort data**. When parenthesized content contains a name-like string plus `?` (e.g. `(Quast ?)`), extract the name part back into the cleaned result and store only `?` as the annotation. Result: `Richard (Quast ?)` -> firstName="Richard", lastName="Quast", annotation="?". This preserves matchability - a "probably Quast" is more useful than a `?` placeholder. For pure uncertainty markers like `(?)`, the annotation is `?` and the preceding name stays intact. - **Scope: raw string only, no typed annotations** — This issue strictly extracts the raw parenthesized content into the `annotation` field of `SplitName` (from #213). No `AnnotationType` enum, no `birthYear` population, no `PersonNameAlias` creation. Parser extracts, service interprets. Clean separation - interpretation logic belongs in follow-up issues. - **Scope expanded: `parseReceivers()` paren handling must change** — Currently `parseReceivers()` strips `(...)` in step 2 unconditionally (before checking for multi-person separators). This means An-column entries like `Gertrud D.(Tuttu)` have their annotation parens eaten by `parseReceivers()` before `split()` ever sees them. **Fix**: move paren extraction in `parseReceivers()` to only fire when a multi-separator (`und`/`u`) is present. Single-person inputs leave parens intact for `split()` to handle as annotations. Without this fix, the `split()` annotation extraction is useless for An-column entries. ### Updated Input/Output Table for Entry #4 | Input | Cleaned name | firstName | lastName | Annotation | Notes | |---|---|---|---|---|---| | `Richard (Quast ? )` | `Richard Quast` | `Richard` | `Quast` | `?` | Name-like content extracted back into name; only uncertainty marker in annotation | | `Ernst Kurmany (?)` | `Ernst Kurmany` | `Ernst` | `Kurmany` | `?` | Pure uncertainty on existing name - no name content in parens | | `Clara de Gruyter(*1871)` | `Clara de Gruyter` | `Clara` | `de Gruyter` | `*1871` | Birth year annotation, raw string | | `Gertrud D.(Tuttu)` | `Gertrud D.` | `Gertrud` | `D.` | `Tuttu` | Nickname annotation, raw string | ### Dependency Note This issue now depends on #213 (preparatory refactor) which provides the `SplitName.annotation` field and the `stripAnnotation()` pipeline slot. Execution order: #190 (merged) -> #213 -> #210.
Author
Owner

Implementation Complete

Both parts implemented on branch feat/issues-209-213-person-parser-enhancements.

Commits

Commit Description
e696e50 Implement stripAnnotation() for parenthesized content
e49ae5d Fix parseReceivers() to preserve annotation parens for single-person inputs

What changed

  • stripAnnotation() extracts trailing (...) content as annotation. Handles birth years (*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain names (Quast ?) where the name part is extracted back into the cleaned result
  • parseReceivers() moved paren extraction after the multi-separator check — single-person inputs keep parens for split() annotation extraction; multi-person inputs still use (Gruber) as shared last-name override
  • Uses [^)]* regex to prevent ReDoS
  • Raw annotation string preserved in SplitName.annotation — no typed interpretation in this issue

Test results

  • Backend: 710 tests passing (8 new annotation tests)
## Implementation Complete Both parts implemented on branch `feat/issues-209-213-person-parser-enhancements`. ### Commits | Commit | Description | |--------|-------------| | `e696e50` | Implement stripAnnotation() for parenthesized content | | `e49ae5d` | Fix parseReceivers() to preserve annotation parens for single-person inputs | ### What changed - **stripAnnotation()** extracts trailing `(...)` content as annotation. Handles birth years `(*1871)`, nicknames `(Tuttu)`, uncertainty markers `(?)`, and uncertain names `(Quast ?)` where the name part is extracted back into the cleaned result - **parseReceivers()** moved paren extraction after the multi-separator check — single-person inputs keep parens for `split()` annotation extraction; multi-person inputs still use `(Gruber)` as shared last-name override - Uses `[^)]*` regex to prevent ReDoS - Raw annotation string preserved in `SplitName.annotation` — no typed interpretation in this issue ### Test results - Backend: 710 tests passing (8 new annotation tests)
Sign in to join this conversation.
No Label feature person
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#210