feat(parser): support // separator and dot-compressed names #208

Merged
marcel merged 3 commits from feat/issue-190-slash-separator-dot-compressed into main 2026-04-08 18:46:53 +02:00
Owner

Summary

Closes #190. Two enhancements to PersonNameParser:

  • // pre-split in parseReceivers() — splits on // before existing logic, so each segment recurses through the full pipeline independently. Handles mixed cases like Herbert und Clara Cram//Eugenie de Gruyter naturally.
  • Dot-normalization in split() — inserts spaces after dots when the name has no spaces but contains dots (E.Rockstroh -> E. Rockstroh), so the existing last-space fallback handles the rest.

Changes

File Change
PersonNameParser.java SLASH_SEPARATOR constant + pre-split step; dot-normalization step
PersonNameParserTest.java 12 new tests (5 slash, 5 dot-compressed, 2 regression/interaction)

Test plan

  • All 35 PersonNameParserTest tests green (was 23)
  • Backend builds cleanly (mvnw clean package -DskipTests)
  • Run full backend test suite to confirm no regressions
## Summary Closes #190. Two enhancements to `PersonNameParser`: - **`//` pre-split in `parseReceivers()`** — splits on `//` before existing logic, so each segment recurses through the full pipeline independently. Handles mixed cases like `Herbert und Clara Cram//Eugenie de Gruyter` naturally. - **Dot-normalization in `split()`** — inserts spaces after dots when the name has no spaces but contains dots (`E.Rockstroh` -> `E. Rockstroh`), so the existing last-space fallback handles the rest. ## Changes | File | Change | |---|---| | `PersonNameParser.java` | `SLASH_SEPARATOR` constant + pre-split step; dot-normalization step | | `PersonNameParserTest.java` | 12 new tests (5 slash, 5 dot-compressed, 2 regression/interaction) | ## Test plan - [x] All 35 `PersonNameParserTest` tests green (was 23) - [x] Backend builds cleanly (`mvnw clean package -DskipTests`) - [ ] Run full backend test suite to confirm no regressions
marcel added 3 commits 2026-04-07 17:43:30 +02:00
Pre-splits input on "//" before existing logic so each segment is
processed independently through the full pipeline (und/u splitting,
last-name distribution, etc.).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inserts spaces after dots when the cleaned name has no spaces but
contains dots, so the existing last-space fallback handles names
like "E.Rockstroh" and "Dr.Fr.Zarncke" correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test(parser): add regression and cross-feature interaction tests
Some checks failed
CI / Unit & Component Tests (pull_request) Failing after 3s
CI / Backend Unit Tests (pull_request) Failing after 2s
CI / Unit & Component Tests (push) Failing after 4s
CI / Backend Unit Tests (push) Failing after 3s
d6e74972eb
Regression test confirms already-spaced dot names are not double-spaced.
Interaction test confirms // separator works with dot-compressed names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Verdict: Approved

TDD Evidence

  • Red/green/refactor cycle followed correctly across 3 commits:
    1. Slash separator tests (RED) + implementation (GREEN) in one commit
    2. Dot-compressed tests (RED) + implementation (GREEN) in one commit
    3. Regression/interaction tests in a third commit
  • 12 new tests total, all with descriptive sentence-style names (slashSeparator_segmentContainsUnd, split_dotCompressed_titleFirstNameLastName)
  • Tests fail for exactly one reason each — clean assertion failures, not NPEs

Clean Code Check

  • Naming: SLASH_SEPARATOR is clear and follows the existing pattern (GEB_PATTERN, PAREN_LAST_NAME, MULTI_SEPARATOR). Good.
  • Function size: Neither parseReceivers() nor split() grew beyond reasonable length. The pre-split adds 7 lines, the dot-normalization adds 3 lines. No splitting needed.
  • Guard clauses: The slashParts.length > 1 check is a clean early return. The !cleaned.contains(" ") && cleaned.contains(".") guard is tight and only fires for genuinely compressed names.
  • No dead code: Nothing commented out, no unused variables.
  • Comments: The two inline comments ("Pre-split on //" and "Normalize dot-compressed names") explain what and why concisely. Acceptable for non-obvious normalization logic.

Specific Observations

  • SLASH_SEPARATOR.split(raw, -1) — the -1 limit is necessary to handle trailing // correctly (Java's Pattern.split drops trailing empty strings by default). Good catch during implementation.
  • The recursive approach in Part 1 reuses the full parseReceivers pipeline per segment — no logic duplication. This is the right call.
  • The cleaned.replace(".", ". ").trim() in Part 2 is KISS at its best — 3 lines that normalize input so the existing last-space fallback handles everything.
  • parseReceivers_dotCompressedName_passthrough confirms layer separation — parseReceivers doesn't try to split on dots, leaving that to split() downstream.

No Issues Found

Clean, minimal implementation. Both changes stay within PersonNameParser with no cross-domain impact.

## 👨‍💻 Felix Brandt — Senior Fullstack Developer **Verdict: ✅ Approved** ### TDD Evidence - Red/green/refactor cycle followed correctly across 3 commits: 1. Slash separator tests (RED) + implementation (GREEN) in one commit 2. Dot-compressed tests (RED) + implementation (GREEN) in one commit 3. Regression/interaction tests in a third commit - 12 new tests total, all with descriptive sentence-style names (`slashSeparator_segmentContainsUnd`, `split_dotCompressed_titleFirstNameLastName`) - Tests fail for exactly one reason each — clean assertion failures, not NPEs ### Clean Code Check - **Naming**: `SLASH_SEPARATOR` is clear and follows the existing pattern (`GEB_PATTERN`, `PAREN_LAST_NAME`, `MULTI_SEPARATOR`). Good. - **Function size**: Neither `parseReceivers()` nor `split()` grew beyond reasonable length. The pre-split adds 7 lines, the dot-normalization adds 3 lines. No splitting needed. - **Guard clauses**: The `slashParts.length > 1` check is a clean early return. The `!cleaned.contains(" ") && cleaned.contains(".")` guard is tight and only fires for genuinely compressed names. - **No dead code**: Nothing commented out, no unused variables. - **Comments**: The two inline comments ("Pre-split on //" and "Normalize dot-compressed names") explain *what* and *why* concisely. Acceptable for non-obvious normalization logic. ### Specific Observations - `SLASH_SEPARATOR.split(raw, -1)` — the `-1` limit is necessary to handle trailing `//` correctly (Java's `Pattern.split` drops trailing empty strings by default). Good catch during implementation. - The recursive approach in Part 1 reuses the full `parseReceivers` pipeline per segment — no logic duplication. This is the right call. - The `cleaned.replace(".", ". ").trim()` in Part 2 is KISS at its best — 3 lines that normalize input so the existing last-space fallback handles everything. - `parseReceivers_dotCompressedName_passthrough` confirms layer separation — `parseReceivers` doesn't try to split on dots, leaving that to `split()` downstream. ### No Issues Found Clean, minimal implementation. Both changes stay within `PersonNameParser` with no cross-domain impact.
Author
Owner

🏗️ Markus Keller — Application Architect

Verdict: Approved

Architecture Check

  • Both changes are confined to PersonNameParser.java, a pure static utility class in the service layer. No new classes, no schema changes, no API surface changes. This is the correct scope for a parser enhancement.
  • No cross-domain impact: parseReceivers() is called from MassImportService, split() from PersonService.findOrCreateByAlias(). Neither change is visible to callers — input/output contracts are unchanged.
  • No layer violations: The parser stays in the service layer where it belongs. No repository access, no controller logic, no frontend changes needed.

Design Observations

  • Part 1 (// pre-split): The recursive approach is architecturally sound — each //-segment re-enters parseReceivers(), so all existing normalization (geb. stripping, und/u splitting, last-name distribution, parenthesized overrides) applies to each segment automatically. No logic duplication.
  • Part 2 (dot-normalization): Normalizing input to match what the existing split() already handles, rather than adding a parallel code path. Method branching complexity stays flat.
  • The static final Pattern SLASH_SEPARATOR is compiled once — no runtime regex compilation overhead.

Complexity Assessment

PersonNameParser now has 4 normalization steps in parseReceivers() (null check, // pre-split, geb. stripping, und/u splitting) and 3 in split() (null check, geb. stripping, dot-normalization). This is manageable. If more steps are added in the future, consider extracting each into a named method and composing them as a pipeline — but that refactor is not needed yet (KISS).

## 🏗️ Markus Keller — Application Architect **Verdict: ✅ Approved** ### Architecture Check - Both changes are confined to `PersonNameParser.java`, a pure static utility class in the service layer. No new classes, no schema changes, no API surface changes. This is the correct scope for a parser enhancement. - **No cross-domain impact**: `parseReceivers()` is called from `MassImportService`, `split()` from `PersonService.findOrCreateByAlias()`. Neither change is visible to callers — input/output contracts are unchanged. - **No layer violations**: The parser stays in the service layer where it belongs. No repository access, no controller logic, no frontend changes needed. ### Design Observations - **Part 1 (// pre-split)**: The recursive approach is architecturally sound — each `//`-segment re-enters `parseReceivers()`, so all existing normalization (geb. stripping, und/u splitting, last-name distribution, parenthesized overrides) applies to each segment automatically. No logic duplication. - **Part 2 (dot-normalization)**: Normalizing input to match what the existing `split()` already handles, rather than adding a parallel code path. Method branching complexity stays flat. - The `static final Pattern SLASH_SEPARATOR` is compiled once — no runtime regex compilation overhead. ### Complexity Assessment `PersonNameParser` now has 4 normalization steps in `parseReceivers()` (null check, `//` pre-split, geb. stripping, und/u splitting) and 3 in `split()` (null check, geb. stripping, dot-normalization). This is manageable. If more steps are added in the future, consider extracting each into a named method and composing them as a pipeline — but that refactor is not needed yet (KISS).
Author
Owner

🧪 Sara Holt — QA Engineer

Verdict: Approved

Test Coverage Analysis

  • 12 new tests added, bringing PersonNameParserTest from 23 to 35 tests. All at the unit layer — correct placement for pure parsing logic.
  • Tests are well-structured with clear separation:
    • 5 tests for // splitting in parseReceivers()
    • 5 tests for dot-normalization in split()
    • 1 regression test confirming already-spaced names aren't double-spaced
    • 1 cross-feature interaction test (E.Rockstroh//Dr.Fr.Zarncke)

Test Quality

  • Descriptive names: Every test reads as a sentence — slashSeparator_segmentContainsUnd, split_dotCompressed_titleFirstNameLastName. Good.
  • One assertion per logical behavior: Each test asserts one specific parsing outcome.
  • Edge cases covered:
    • Trailing // (blank filter works)
    • Spaces around // (trimming works)
    • Mixed // with und splitting (recursion works)
    • Already-spaced dot names (guard prevents double-spacing)
    • Layer separation (parseReceivers doesn't split on dots)

Suggested Additional Coverage (non-blocking)

  • Leading //: //Herbert Cram — the blank filter should handle it, but an explicit test documents the contract.
  • Multiple consecutive //: Herbert Cram////Eugenie — do empty segments get filtered cleanly?
  • E..Rockstroh (consecutive dots) — after replace(".", ". ") becomes E. . Rockstroh. Does the fallback handle the extra whitespace? Likely yes due to trim() and last-space logic, but worth a test.

These are nice-to-haves, not blockers. The current coverage is solid for the specified requirements.

## 🧪 Sara Holt — QA Engineer **Verdict: ✅ Approved** ### Test Coverage Analysis - **12 new tests** added, bringing `PersonNameParserTest` from 23 to 35 tests. All at the unit layer — correct placement for pure parsing logic. - Tests are well-structured with clear separation: - 5 tests for `//` splitting in `parseReceivers()` - 5 tests for dot-normalization in `split()` - 1 regression test confirming already-spaced names aren't double-spaced - 1 cross-feature interaction test (`E.Rockstroh//Dr.Fr.Zarncke`) ### Test Quality - **Descriptive names**: Every test reads as a sentence — `slashSeparator_segmentContainsUnd`, `split_dotCompressed_titleFirstNameLastName`. Good. - **One assertion per logical behavior**: Each test asserts one specific parsing outcome. - **Edge cases covered**: - Trailing `//` (blank filter works) - Spaces around `//` (trimming works) - Mixed `//` with `und` splitting (recursion works) - Already-spaced dot names (guard prevents double-spacing) - Layer separation (`parseReceivers` doesn't split on dots) ### Suggested Additional Coverage (non-blocking) - **Leading `//`**: `//Herbert Cram` — the blank filter should handle it, but an explicit test documents the contract. - **Multiple consecutive `//`**: `Herbert Cram////Eugenie` — do empty segments get filtered cleanly? - **`E..Rockstroh`** (consecutive dots) — after `replace(".", ". ")` becomes `E. . Rockstroh`. Does the fallback handle the extra whitespace? Likely yes due to `trim()` and last-space logic, but worth a test. These are nice-to-haves, not blockers. The current coverage is solid for the specified requirements.
Author
Owner

🔒 Nora "NullX" Steiner — Security Engineer

Verdict: Approved

Attack Surface Analysis

  • Both changes are pure parsing logic in a backend utility class that processes spreadsheet data from authenticated admin users (mass import). This is not processing untrusted external input in a web request context.
  • No new endpoints, no new user input vectors, no changes to authentication or authorization.

Specific Checks

  • ReDoS risk: SLASH_SEPARATOR = Pattern.compile("//") is a literal pattern with no quantifiers — zero ReDoS risk. The existing MULTI_SEPARATOR pattern (\\s+(?:und|u)\\s+) is also safe — bounded alternation with no nested quantifiers.
  • Recursion depth: The recursive call parseReceivers(segment) is bounded by the number of //-segments in a single spreadsheet cell, which is naturally small. No stack overflow risk from realistic input. A pathological input with thousands of // separators would need to come from a malicious spreadsheet uploaded by an authenticated admin — extremely low risk.
  • String replacement: cleaned.replace(".", ". ") is a literal string replacement, not regex — no injection vector.
  • No new dependencies: Only java.util.Arrays was added as an import.

No Security Concerns

Both changes are confined to a trusted-input parser within the service layer. No new attack vectors introduced.

## 🔒 Nora "NullX" Steiner — Security Engineer **Verdict: ✅ Approved** ### Attack Surface Analysis - Both changes are pure parsing logic in a backend utility class that processes spreadsheet data from authenticated admin users (mass import). This is **not** processing untrusted external input in a web request context. - No new endpoints, no new user input vectors, no changes to authentication or authorization. ### Specific Checks - **ReDoS risk**: `SLASH_SEPARATOR = Pattern.compile("//")` is a literal pattern with no quantifiers — zero ReDoS risk. The existing `MULTI_SEPARATOR` pattern (`\\s+(?:und|u)\\s+`) is also safe — bounded alternation with no nested quantifiers. - **Recursion depth**: The recursive call `parseReceivers(segment)` is bounded by the number of `//`-segments in a single spreadsheet cell, which is naturally small. No stack overflow risk from realistic input. A pathological input with thousands of `//` separators would need to come from a malicious spreadsheet uploaded by an authenticated admin — extremely low risk. - **String replacement**: `cleaned.replace(".", ". ")` is a literal string replacement, not regex — no injection vector. - **No new dependencies**: Only `java.util.Arrays` was added as an import. ### No Security Concerns Both changes are confined to a trusted-input parser within the service layer. No new attack vectors introduced.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Verdict: Approved

Assessment

This is a backend-only parsing change with no direct UI impact. No frontend components, routes, styles, or user-facing text are modified.

Indirect UX benefit: Once these names are correctly parsed, person lists and document metadata will show clean names (E. Rockstroh, Dr. Fr. Zarncke) instead of garbage entries with lastName = "?". This improves data quality throughout the UI without any frontend code changes.

The // separator and dot-compressed patterns are conventions within the import spreadsheet, not something end users interact with through the web interface.

No design concerns.

## 🎨 Leonie Voss — UI/UX Design Lead **Verdict: ✅ Approved** ### Assessment This is a backend-only parsing change with no direct UI impact. No frontend components, routes, styles, or user-facing text are modified. **Indirect UX benefit**: Once these names are correctly parsed, person lists and document metadata will show clean names (`E. Rockstroh`, `Dr. Fr. Zarncke`) instead of garbage entries with `lastName = "?"`. This improves data quality throughout the UI without any frontend code changes. The `//` separator and dot-compressed patterns are conventions within the import spreadsheet, not something end users interact with through the web interface. No design concerns.
Author
Owner

🛠️ Tobias Wendt — DevOps Engineer

Verdict: Approved

Infrastructure Impact

  • Pure Java code change — no new dependencies, no config changes, no Docker, CI, or deployment modifications needed.
  • No new Maven dependencies added (java.util.Arrays is JDK standard library).
  • The verification command (./mvnw test -Dtest=PersonNameParserTest) runs fast unit tests only — no Testcontainers or external services required. Zero CI time impact.
  • Both parts touch the same file and test file across 3 atomic commits — clean for review and bisect.

No concerns from my angle.

## 🛠️ Tobias Wendt — DevOps Engineer **Verdict: ✅ Approved** ### Infrastructure Impact - Pure Java code change — no new dependencies, no config changes, no Docker, CI, or deployment modifications needed. - No new Maven dependencies added (`java.util.Arrays` is JDK standard library). - The verification command (`./mvnw test -Dtest=PersonNameParserTest`) runs fast unit tests only — no Testcontainers or external services required. Zero CI time impact. - Both parts touch the same file and test file across 3 atomic commits — clean for review and bisect. No concerns from my angle.
marcel merged commit d6e74972eb into main 2026-04-08 18:46:53 +02:00
marcel deleted branch feat/issue-190-slash-separator-dot-compressed 2026-04-08 18:46:58 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#208