- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON)
to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar.
The PolygonConverter is retained as a standalone utility but no longer
used on the entity. Hibernate 6 natively handles List<List<Double>>
serialization to JSONB.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI
Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback
Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health
Refs #226, #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add "amt" and "schule" suffixes to INSTITUTION_END in PersonTypeClassifier
so German government offices and schools are auto-classified on import.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add title to PersonUpdateDTO with @Size(max=50) constraint.
PersonService.createPerson and updatePerson now handle the title
field with blank-to-null normalization.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Testcontainers test verifying: SKIP returns null with no DB record,
INSTITUTION/GROUP store full name in lastName with null firstName
and correct personType, PERSON splits name normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove Architekt from WORD_PREFIXES (classifier handles it)
- Use Objects.equals for null-safe firstName/lastName comparison
- Remove unused trimmed variable in PersonTypeClassifier
- Fix containsWord to loop through all occurrences (finds
"Eltern" in "Nachbareltern Eltern")
- Extract DisplayNameFormatter utility shared by Person and
PersonSummaryDTO to eliminate display logic duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add @Nullable annotation to findOrCreateByAlias() return type.
Filter null results (from SKIP classification) in MassImportService
receiver list to prevent null elements in the receivers collection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-pass title stripping with loop for stacked titles:
- Dot-prefixes (Dr., Prof.) matched without trailing space
- Word-prefixes (Tante, Frau, Schwester, etc.) matched at
word boundary
- Stacked titles like "Prof. Dr. Muller" handled correctly
- Single token after title strip goes to lastName (not firstName)
Add 5 "von" last names to KNOWN_LAST_NAMES for correct splitting
of entries like "Freifrau von Massenbach".
15 new test cases + updated 3 existing tests for title behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classify raw name before processing. SKIP returns null (no Person
created). INSTITUTION/GROUP skip split() and store full name in
lastName with firstName=null and appropriate personType.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move paren extraction in parseReceivers() after the multi-separator
check so single-person entries like "Clara de Gruyter(*1871)" keep
their parens intact for split()'s annotation extraction. Multi-person
entries like "Hedi und Tutu (Gruber)" still use parens as shared
last-name override.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract trailing (...) content as annotation. Handles birth years
(*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain
names (Quast ?) where the name part is extracted back into the
cleaned result. Uses [^)]* regex to prevent ReDoS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When split() returns a non-null maidenName, PersonService now
creates a PersonNameAlias with type MAIDEN_NAME. The maiden name
is stored as lastName on the alias (no firstName).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify comma-prefix, no-dot, and multi-word maiden name variants
are correctly stripped in parseReceivers().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Widen pattern from `\s+geb\.\s+\S+` to `,?\s*geb\.?\s+(.+)$` to
handle: optional comma, optional dot, multi-word maiden names.
stripMaidenName() now captures the maiden name instead of discarding
it. Handles all 5 input variants from the ODS data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add displayName default method to PersonSummaryDTO
- Update native SQL queries to include title, person_type columns
- Add getInitials() utility to personFormat.ts
- Update abbreviateName/abbreviateCompact for nullable firstName
- Replace firstName+lastName concatenation with displayName in all
person-displaying components and server load files
- Regenerate API types with displayName on Person and PersonSummaryDTO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add title VARCHAR(50) column
- Add person_type VARCHAR(20) NOT NULL DEFAULT 'PERSON' with CHECK
constraint (PERSON, INSTITUTION, GROUP, UNKNOWN — SKIP excluded)
- Drop NOT NULL on first_name for non-person entities
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add title (nullable VARCHAR) and personType (enum, default PERSON)
- Make firstName nullable for non-person entities
- Add @Transient getDisplayName() as single source of truth for
name display, exposed via @Schema(READ_ONLY, REQUIRED)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PersonType has 5 values: PERSON, INSTITUTION, GROUP, UNKNOWN, SKIP.
SKIP is intentionally excluded from the DB CHECK constraint (added
in migration) as defense-in-depth. MAIDEN_NAME added to
PersonNameAliasType for #209.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract stripMaidenName, normalizeDotCompressed, stripAnnotation,
stripTitle, and splitByKnownLastNameOrFallback as individually
testable pipeline steps. Each extraction method is a pass-through
until its feature issue fills in the logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add title, maidenName, and annotation fields (all nullable) to
SplitName. All existing call sites pass null for new fields. Test
assertions updated to document the null-by-default contract.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regression test confirms already-spaced dot names are not double-spaced.
Interaction test confirms // separator works with dot-compressed names.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inserts spaces after dots when the cleaned name has no spaces but
contains dots, so the existing last-space fallback handles names
like "E.Rockstroh" and "Dr.Fr.Zarncke" correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-splits input on "//" before existing logic so each segment is
processed independently through the full pipeline (und/u splitting,
last-name distribution, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Jackson tried to serialize the lazy Person proxy when returning
alias list, causing a "no session" error. The back-reference is
only needed for JPA navigation, not for API responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds @NotBlank @Size(max=255) on lastName, @NotNull on type,
@Valid on controller parameter. Blank/null input now returns
400 instead of reaching the DB constraint. 2 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sender alias LEFT JOIN and receiver alias EXISTS subquery to
DocumentSpecifications.hasText(). Uses entity-graph navigation via
Person.nameAliases (@OneToMany) to avoid a separate DB roundtrip
while respecting domain boundaries. 2 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds LEFT JOIN to person_name_aliases in both searchByName (JPQL)
and searchWithDocumentCount (native SQL). Uses DISTINCT/GROUP BY
to prevent duplicate results. 4 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET returns aliases (no permission required), POST requires
WRITE_ALL, DELETE requires WRITE_ALL. 5 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getAliases (sorted by sort_order), addAlias (auto-incrementing
sort_order), removeAlias (with IDOR protection verifying alias
belongs to the given person). All TDD with 7 new unit tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces the alias domain model: entity with @ManyToOne to Person,
@OneToMany on Person for JPA graph navigation, repository with
sort_order queries, input DTO, and ALIAS_NOT_FOUND error code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates the alias table for historical name changes (marriage,
widowhood, etc.) and adds GIN trigram indexes on both the new
alias table and the existing persons table for substring search.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TODO comment explaining why SENDER/RECEIVER sort is in-memory
(JPA INNER JOIN drops null-sender docs) and note that pagination
will require a DB COUNT query in DocumentSearchResult.of().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A sender with lastName=null produced sort key "null Bob" which sorted
before names starting with lowercase letters (n < s, t, u, v...).
Now returns "" for null lastName, which the comparator places at end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously any value other than ASC/DESC silently defaulted to
DESC with no feedback. Now returns 400 Bad Request.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SENDER and RECEIVER are handled by in-memory sort before resolveSort
is called, making those switch cases unreachable. Removed and added
a comment making the invariant explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentSort is a query parameter enum, not a JPA entity.
Placing it in model/ violated the layer boundary — model/ should
contain only domain entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DocumentSort enum validated by Spring MVC (400 for unknown values)
- SENDER sort uses Spring Data Sort on sender.lastName/firstName
- RECEIVER sort uses in-memory sort by first receiver alphabetically
- UPLOAD_DATE sort uses createdAt; default sort is DATE DESC
- tagQ param wired to hasTagPartial specification
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hasText now JOINs sender (LEFT JOIN) and uses EXISTS subqueries for
receivers and tags to avoid duplicate rows
- hasTagPartial added for live debounced tag text filter (ILIKE partial match)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>