feat(massimport): handle dot-compressed names and titles in PersonNameParser.split() #184

Closed
opened 2026-04-06 14:22:45 +02:00 by marcel · 6 comments
Owner

Problem

The mass import spreadsheet contains abbreviated names with no spaces — dots act as separators between initials, titles, and last names. The current split() fallback only handles space-separated names, so these all produce lastName = "?":

Raw input Current result Expected result
E.Rockstroh ("E.Rockstroh", "?") ("E.", "Rockstroh")
E.M. ("E.M.", "?") ("E.", "M.")
Dr.Fr.Zarncke ("Dr.Fr.Zarncke", "?") ("Dr. Fr.", "Zarncke")
Dr.Zarnke ("Dr.Zarnke", "?") ("Dr.", "Zarnke")

Root Cause

split() (line 107, PersonNameParser.java) checks known last names first, then falls back to the last space. Names with no spaces bypass both paths and fall through to SplitName(cleaned, "?").

Solution

After the geb. stripping step in split(), add a dot-normalization step that applies only when the cleaned name has no spaces but contains dots:

// Normalize dot-compressed names: "Dr.Fr.Zarncke" → "Dr. Fr. Zarncke"
if (!cleaned.contains(" ") && cleaned.contains(".")) {
    cleaned = cleaned.replace(".", ". ").trim();
}

Then the existing known-last-name check and last-space fallback handle the rest:

Input After normalization Known last name? Last-space fallback
E.Rockstroh E. Rockstroh no ("E.", "Rockstroh")
E.M. E. M. no ("E.", "M.")
Dr.Fr.Zarncke Dr. Fr. Zarncke no ("Dr. Fr.", "Zarncke")
Dr.Zarnke Dr. Zarnke no ("Dr.", "Zarnke")

No changes needed to parseReceivers() — it already passes dot-compressed tokens through as single elements; split() is called downstream in PersonService.findOrCreateByAlias().

Files

File Change
backend/src/main/java/.../service/PersonNameParser.java Add dot-normalization step in split() (3 lines)
backend/src/test/java/.../service/PersonNameParserTest.java 4 new split_* tests + 1 parseReceivers passthrough test (TDD — red first)

No schema, API, or i18n changes needed.

New Tests

split_dotCompressed_initialAndLastName()       // E.Rockstroh → ("E.", "Rockstroh")
split_dotCompressed_twoInitials()              // E.M.        → ("E.", "M.")
split_dotCompressed_titleFirstNameLastName()   // Dr.Fr.Zarncke → ("Dr. Fr.", "Zarncke")
split_dotCompressed_titleAndLastName()         // Dr.Zarnke   → ("Dr.", "Zarnke")
parseReceivers_dotCompressedName_passthrough() // Dr.Fr.Zarncke → ["Dr.Fr.Zarncke"]

Verification

cd backend && ./mvnw test -Dtest=PersonNameParserTest
## Problem The mass import spreadsheet contains abbreviated names with no spaces — dots act as separators between initials, titles, and last names. The current `split()` fallback only handles space-separated names, so these all produce `lastName = "?"`: | Raw input | Current result | Expected result | |---|---|---| | `E.Rockstroh` | `("E.Rockstroh", "?")` | `("E.", "Rockstroh")` | | `E.M.` | `("E.M.", "?")` | `("E.", "M.")` | | `Dr.Fr.Zarncke` | `("Dr.Fr.Zarncke", "?")` | `("Dr. Fr.", "Zarncke")` | | `Dr.Zarnke` | `("Dr.Zarnke", "?")` | `("Dr.", "Zarnke")` | ## Root Cause `split()` (line 107, `PersonNameParser.java`) checks known last names first, then falls back to the last space. Names with no spaces bypass both paths and fall through to `SplitName(cleaned, "?")`. ## Solution After the `geb.` stripping step in `split()`, add a dot-normalization step that applies only when the cleaned name has **no spaces** but **contains dots**: ```java // Normalize dot-compressed names: "Dr.Fr.Zarncke" → "Dr. Fr. Zarncke" if (!cleaned.contains(" ") && cleaned.contains(".")) { cleaned = cleaned.replace(".", ". ").trim(); } ``` Then the existing known-last-name check and last-space fallback handle the rest: | Input | After normalization | Known last name? | Last-space fallback | |---|---|---|---| | `E.Rockstroh` | `E. Rockstroh` | no | `("E.", "Rockstroh")` ✓ | | `E.M.` | `E. M.` | no | `("E.", "M.")` ✓ | | `Dr.Fr.Zarncke` | `Dr. Fr. Zarncke` | no | `("Dr. Fr.", "Zarncke")` ✓ | | `Dr.Zarnke` | `Dr. Zarnke` | no | `("Dr.", "Zarnke")` ✓ | No changes needed to `parseReceivers()` — it already passes dot-compressed tokens through as single elements; `split()` is called downstream in `PersonService.findOrCreateByAlias()`. ## Files | File | Change | |---|---| | `backend/src/main/java/.../service/PersonNameParser.java` | Add dot-normalization step in `split()` (3 lines) | | `backend/src/test/java/.../service/PersonNameParserTest.java` | 4 new `split_*` tests + 1 `parseReceivers` passthrough test (TDD — red first) | No schema, API, or i18n changes needed. ## New Tests ```java split_dotCompressed_initialAndLastName() // E.Rockstroh → ("E.", "Rockstroh") split_dotCompressed_twoInitials() // E.M. → ("E.", "M.") split_dotCompressed_titleFirstNameLastName() // Dr.Fr.Zarncke → ("Dr. Fr.", "Zarncke") split_dotCompressed_titleAndLastName() // Dr.Zarnke → ("Dr.", "Zarnke") parseReceivers_dotCompressedName_passthrough() // Dr.Fr.Zarncke → ["Dr.Fr.Zarncke"] ``` ## Verification ```bash cd backend && ./mvnw test -Dtest=PersonNameParserTest ```
marcel added the feature label 2026-04-06 14:22:50 +02:00
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

  • The 3-line fix (cleaned.replace(".", ". ").trim()) is beautifully minimal — it normalizes the input so the existing last-space fallback does all the work. No new branching logic, no new code paths. KISS at its best.
  • The guard condition !cleaned.contains(" ") && cleaned.contains(".") is tight — it only fires when there are no spaces but there are dots. This avoids interfering with already-spaced names that happen to contain dots (like Dr. Zarncke).
  • The 5 proposed test names are well-structured. The parseReceivers_dotCompressedName_passthrough test is important — it confirms that parseReceivers treats Dr.Fr.Zarncke as a single token and doesn't try to split it at the und/u level.
  • One edge case I'd want to think through: what about a name that ends with a dot but has no other dots, like M.? After normalization it becomes M. → trimmed to M.. The existing fallback would then see no space and fall through to SplitName("M.", "?"). Is that the expected behavior, or should M. be handled differently?

Suggestions

  • Consider adding a test for a single-initial name like M. to document whether the current behavior (falls through to ? last name) is intentional or should be handled as a special case.
  • The placement "after the geb. stripping step" is important — if geb. stripping runs first, a name like geb.Rockstroh would become Rockstroh before the dot-normalization step, which is correct. Confirm this ordering in the implementation.
  • The replace(".", ". ") approach is a String method, not regex — that's the right choice here. Simple, readable, no regex overhead.
## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - The 3-line fix (`cleaned.replace(".", ". ").trim()`) is beautifully minimal — it normalizes the input so the existing last-space fallback does all the work. No new branching logic, no new code paths. KISS at its best. - The guard condition `!cleaned.contains(" ") && cleaned.contains(".")` is tight — it only fires when there are no spaces but there are dots. This avoids interfering with already-spaced names that happen to contain dots (like `Dr. Zarncke`). - The 5 proposed test names are well-structured. The `parseReceivers_dotCompressedName_passthrough` test is important — it confirms that `parseReceivers` treats `Dr.Fr.Zarncke` as a single token and doesn't try to split it at the `und`/`u` level. - One edge case I'd want to think through: what about a name that ends with a dot but has no other dots, like `M.`? After normalization it becomes `M. ` → trimmed to `M.`. The existing fallback would then see no space and fall through to `SplitName("M.", "?")`. Is that the expected behavior, or should `M.` be handled differently? ### Suggestions - Consider adding a test for a single-initial name like `M.` to document whether the current behavior (falls through to `?` last name) is intentional or should be handled as a special case. - The placement "after the `geb.` stripping step" is important — if `geb.` stripping runs first, a name like `geb.Rockstroh` would become `Rockstroh` before the dot-normalization step, which is correct. Confirm this ordering in the implementation. - The `replace(".", ". ")` approach is a String method, not regex — that's the right choice here. Simple, readable, no regex overhead.
Author
Owner

🏗️ Markus Keller — Application Architect

Questions & Observations

  • This is another well-scoped parser enhancement — 3 lines of normalization inside an existing method, no new classes, no schema changes, no API surface changes. The right level of intervention.
  • The approach of normalizing input to match what the existing logic already handles is architecturally sound. Rather than adding a parallel code path for dot-compressed names, the normalization makes them look like regular spaced names. This keeps the method's branching complexity flat.
  • No cross-domain impact. split() is a utility method within PersonNameParser, called downstream from PersonService.findOrCreateByAlias(). The change is invisible to callers.

Suggestions

  • The issue mentions this fix applies specifically in split(), not in parseReceivers(). This is an important distinction — parseReceivers() treats Dr.Fr.Zarncke as a single token (confirmed by the passthrough test). The normalization only happens when split() is asked to decompose a name into first/last. This layering is correct — just flagging it for the implementer to be precise about placement.
  • One architectural observation: PersonNameParser is accumulating a fair amount of normalization logic (geb. stripping, known last names, now dot-normalization). If more normalization steps are added in the future, consider whether a small pipeline of normalization steps (each a method) would be clearer than a growing list of if-checks in split(). Not needed now — just a watch point.
## 🏗️ Markus Keller — Application Architect ### Questions & Observations - This is another well-scoped parser enhancement — 3 lines of normalization inside an existing method, no new classes, no schema changes, no API surface changes. The right level of intervention. - The approach of normalizing input to match what the existing logic already handles is architecturally sound. Rather than adding a parallel code path for dot-compressed names, the normalization makes them look like regular spaced names. This keeps the method's branching complexity flat. - No cross-domain impact. `split()` is a utility method within `PersonNameParser`, called downstream from `PersonService.findOrCreateByAlias()`. The change is invisible to callers. ### Suggestions - The issue mentions this fix applies specifically in `split()`, not in `parseReceivers()`. This is an important distinction — `parseReceivers()` treats `Dr.Fr.Zarncke` as a single token (confirmed by the passthrough test). The normalization only happens when `split()` is asked to decompose a name into first/last. This layering is correct — just flagging it for the implementer to be precise about placement. - One architectural observation: `PersonNameParser` is accumulating a fair amount of normalization logic (geb. stripping, known last names, now dot-normalization). If more normalization steps are added in the future, consider whether a small pipeline of normalization steps (each a method) would be clearer than a growing list of if-checks in `split()`. Not needed now — just a watch point.
Author
Owner

🧪 Sara Holt — QA Engineer

Questions & Observations

  • The 5 proposed tests cover the four split() scenarios from the issue's table plus a parseReceivers passthrough test. Good layered coverage — unit tests for the low-level split() method and a higher-level test confirming the integration with parseReceivers.
  • Edge cases I'd want tested beyond the proposed set:
    • Name with trailing dot only: Rockstroh. — after normalization becomes Rockstroh. → trimmed to Rockstroh.. How does the split fallback handle this?
    • Name with dots AND spaces: Dr. Fr. Zarncke (already properly spaced) — the guard !cleaned.contains(" ") should prevent normalization. A regression test confirming this would be valuable.
    • Single character with dot: M. — as Felix noted, this might produce ("M.", "?"). Document the behavior with a test.
    • Multiple consecutive dots: E..Rockstroh — after replace(".", ". ") becomes E. . Rockstroh. Does the last-space fallback handle the extra spaces?

Suggestions

  • I'd prioritize the regression test for already-spaced names (Dr. Fr. Zarncke) — this confirms the guard clause works and that the normalization doesn't double-space names that are already correct.
  • The parseReceivers_dotCompressedName_passthrough test is valuable for confirming layer separation — it proves that parseReceivers doesn't try to split on dots, only split() does. Keep this test.
  • Running the full PersonNameParserTest class (not just the new tests) after implementation is essential — the normalization step could theoretically affect existing test cases if the guard condition has an edge case.
## 🧪 Sara Holt — QA Engineer ### Questions & Observations - The 5 proposed tests cover the four `split()` scenarios from the issue's table plus a `parseReceivers` passthrough test. Good layered coverage — unit tests for the low-level `split()` method and a higher-level test confirming the integration with `parseReceivers`. - Edge cases I'd want tested beyond the proposed set: - **Name with trailing dot only**: `Rockstroh.` — after normalization becomes `Rockstroh. ` → trimmed to `Rockstroh.`. How does the split fallback handle this? - **Name with dots AND spaces**: `Dr. Fr. Zarncke` (already properly spaced) — the guard `!cleaned.contains(" ")` should prevent normalization. A regression test confirming this would be valuable. - **Single character with dot**: `M.` — as Felix noted, this might produce `("M.", "?")`. Document the behavior with a test. - **Multiple consecutive dots**: `E..Rockstroh` — after `replace(".", ". ")` becomes `E. . Rockstroh`. Does the last-space fallback handle the extra spaces? ### Suggestions - I'd prioritize the regression test for already-spaced names (`Dr. Fr. Zarncke`) — this confirms the guard clause works and that the normalization doesn't double-space names that are already correct. - The `parseReceivers_dotCompressedName_passthrough` test is valuable for confirming layer separation — it proves that `parseReceivers` doesn't try to split on dots, only `split()` does. Keep this test. - Running the full `PersonNameParserTest` class (not just the new tests) after implementation is essential — the normalization step could theoretically affect existing test cases if the guard condition has an edge case.
Author
Owner

🔒 Nora "NullX" Steiner — Security Engineer

Questions & Observations

  • Same as #182: this is a pure parsing change in a backend utility class processing trusted spreadsheet data from authenticated admin users. No new attack surface.
  • The String.replace(".", ". ") call is a literal string replacement, not regex — no risk of ReDoS or regex injection.
  • No new endpoints, no user input vectors, no auth changes.

Suggestions

  • No security concerns from my angle. The change is a 3-line string normalization step within a trusted-input parser. No new attack vectors introduced.
## 🔒 Nora "NullX" Steiner — Security Engineer ### Questions & Observations - Same as #182: this is a pure parsing change in a backend utility class processing trusted spreadsheet data from authenticated admin users. No new attack surface. - The `String.replace(".", ". ")` call is a literal string replacement, not regex — no risk of ReDoS or regex injection. - No new endpoints, no user input vectors, no auth changes. ### Suggestions - No security concerns from my angle. The change is a 3-line string normalization step within a trusted-input parser. No new attack vectors introduced.
Author
Owner

🎨 Leonie Voss — UI/UX Design Lead

Questions & Observations

  • Backend-only parsing change with no UI impact. The dot-compressed names are a data quality issue in the import spreadsheet, not something users see in the web interface.
  • Once these names are correctly parsed, they'll display properly in person lists and document metadata — so this indirectly improves the user experience by showing clean names instead of garbage entries with lastName = "?".

Suggestions

  • No design concerns. The improvement is invisible to users except through better data quality in the person records — which is a good thing.
## 🎨 Leonie Voss — UI/UX Design Lead ### Questions & Observations - Backend-only parsing change with no UI impact. The dot-compressed names are a data quality issue in the import spreadsheet, not something users see in the web interface. - Once these names are correctly parsed, they'll display properly in person lists and document metadata — so this indirectly improves the user experience by showing clean names instead of garbage entries with `lastName = "?"`. ### Suggestions - No design concerns. The improvement is invisible to users except through better data quality in the person records — which is a good thing.
Author
Owner

🛠️ Tobias Wendt — DevOps Engineer

Questions & Observations

  • Pure Java code change, no infrastructure impact. No new dependencies, no config changes, no Docker or CI modifications needed.
  • Fast unit tests only — no Testcontainers or external services required. Zero CI time impact.

Suggestions

  • No concerns from my angle. This is a self-contained parser fix that doesn't touch infrastructure, dependencies, or deployment configuration.
## 🛠️ Tobias Wendt — DevOps Engineer ### Questions & Observations - Pure Java code change, no infrastructure impact. No new dependencies, no config changes, no Docker or CI modifications needed. - Fast unit tests only — no Testcontainers or external services required. Zero CI time impact. ### Suggestions - No concerns from my angle. This is a self-contained parser fix that doesn't touch infrastructure, dependencies, or deployment configuration.
Sign in to join this conversation.
No Label feature
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#184