Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).
Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.
Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies
Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.
816 backend tests pass, 687 frontend tests pass.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.
- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add "amt" and "schule" suffixes to INSTITUTION_END in PersonTypeClassifier
so German government offices and schools are auto-classified on import.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Testcontainers test verifying: SKIP returns null with no DB record,
INSTITUTION/GROUP store full name in lastName with null firstName
and correct personType, PERSON splits name normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove Architekt from WORD_PREFIXES (classifier handles it)
- Use Objects.equals for null-safe firstName/lastName comparison
- Remove unused trimmed variable in PersonTypeClassifier
- Fix containsWord to loop through all occurrences (finds
"Eltern" in "Nachbareltern Eltern")
- Extract DisplayNameFormatter utility shared by Person and
PersonSummaryDTO to eliminate display logic duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-pass title stripping with loop for stacked titles:
- Dot-prefixes (Dr., Prof.) matched without trailing space
- Word-prefixes (Tante, Frau, Schwester, etc.) matched at
word boundary
- Stacked titles like "Prof. Dr. Muller" handled correctly
- Single token after title strip goes to lastName (not firstName)
Add 5 "von" last names to KNOWN_LAST_NAMES for correct splitting
of entries like "Freifrau von Massenbach".
15 new test cases + updated 3 existing tests for title behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classify raw name before processing. SKIP returns null (no Person
created). INSTITUTION/GROUP skip split() and store full name in
lastName with firstName=null and appropriate personType.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move paren extraction in parseReceivers() after the multi-separator
check so single-person entries like "Clara de Gruyter(*1871)" keep
their parens intact for split()'s annotation extraction. Multi-person
entries like "Hedi und Tutu (Gruber)" still use parens as shared
last-name override.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract trailing (...) content as annotation. Handles birth years
(*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain
names (Quast ?) where the name part is extracted back into the
cleaned result. Uses [^)]* regex to prevent ReDoS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When split() returns a non-null maidenName, PersonService now
creates a PersonNameAlias with type MAIDEN_NAME. The maiden name
is stored as lastName on the alias (no firstName).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify comma-prefix, no-dot, and multi-word maiden name variants
are correctly stripped in parseReceivers().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Widen pattern from `\s+geb\.\s+\S+` to `,?\s*geb\.?\s+(.+)$` to
handle: optional comma, optional dot, multi-word maiden names.
stripMaidenName() now captures the maiden name instead of discarding
it. Handles all 5 input variants from the ODS data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add displayName default method to PersonSummaryDTO
- Update native SQL queries to include title, person_type columns
- Add getInitials() utility to personFormat.ts
- Update abbreviateName/abbreviateCompact for nullable firstName
- Replace firstName+lastName concatenation with displayName in all
person-displaying components and server load files
- Regenerate API types with displayName on Person and PersonSummaryDTO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add title (nullable VARCHAR) and personType (enum, default PERSON)
- Make firstName nullable for non-person entities
- Add @Transient getDisplayName() as single source of truth for
name display, exposed via @Schema(READ_ONLY, REQUIRED)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PersonType has 5 values: PERSON, INSTITUTION, GROUP, UNKNOWN, SKIP.
SKIP is intentionally excluded from the DB CHECK constraint (added
in migration) as defense-in-depth. MAIDEN_NAME added to
PersonNameAliasType for #209.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract stripMaidenName, normalizeDotCompressed, stripAnnotation,
stripTitle, and splitByKnownLastNameOrFallback as individually
testable pipeline steps. Each extraction method is a pass-through
until its feature issue fills in the logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add title, maidenName, and annotation fields (all nullable) to
SplitName. All existing call sites pass null for new fields. Test
assertions updated to document the null-by-default contract.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regression test confirms already-spaced dot names are not double-spaced.
Interaction test confirms // separator works with dot-compressed names.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inserts spaces after dots when the cleaned name has no spaces but
contains dots, so the existing last-space fallback handles names
like "E.Rockstroh" and "Dr.Fr.Zarncke" correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-splits input on "//" before existing logic so each segment is
processed independently through the full pipeline (und/u splitting,
last-name distribution, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds @NotBlank @Size(max=255) on lastName, @NotNull on type,
@Valid on controller parameter. Blank/null input now returns
400 instead of reaching the DB constraint. 2 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sender alias LEFT JOIN and receiver alias EXISTS subquery to
DocumentSpecifications.hasText(). Uses entity-graph navigation via
Person.nameAliases (@OneToMany) to avoid a separate DB roundtrip
while respecting domain boundaries. 2 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds LEFT JOIN to person_name_aliases in both searchByName (JPQL)
and searchWithDocumentCount (native SQL). Uses DISTINCT/GROUP BY
to prevent duplicate results. 4 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET returns aliases (no permission required), POST requires
WRITE_ALL, DELETE requires WRITE_ALL. 5 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getAliases (sorted by sort_order), addAlias (auto-incrementing
sort_order), removeAlias (with IDOR protection verifying alias
belongs to the given person). All TDD with 7 new unit tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A sender with lastName=null produced sort key "null Bob" which sorted
before names starting with lowercase letters (n < s, t, u, v...).
Now returns "" for null lastName, which the comparator places at end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously any value other than ASC/DESC silently defaulted to
DESC with no feedback. Now returns 400 Bad Request.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentSort is a query parameter enum, not a JPA entity.
Placing it in model/ violated the layer boundary — model/ should
contain only domain entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DocumentSort enum validated by Spring MVC (400 for unknown values)
- SENDER sort uses Spring Data Sort on sender.lastName/firstName
- RECEIVER sort uses in-memory sort by first receiver alphabetically
- UPLOAD_DATE sort uses createdAt; default sort is DATE DESC
- tagQ param wired to hasTagPartial specification
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hasText now JOINs sender (LEFT JOIN) and uses EXISTS subqueries for
receivers and tags to avoid duplicate rows
- hasTagPartial added for live debounced tag text filter (ILIKE partial match)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RED/GREEN for CommentService:
- getCommentsForBlock(blockId): returns root comments filtered by blockId
- postBlockComment(documentId, blockId, content, mentions, author): creates
comment with block_id set
RED/GREEN for CommentController:
- GET /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST .../comments/{commentId}/replies (reuses existing replyToComment)
4 new tests: 2 service unit tests + 2 controller integration tests
All 25 CommentServiceTest + 24 CommentControllerTest green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TranscriptionService.reorderBlocks() now returns void (command).
Controller calls listBlocks() separately after reorder (query).
Updated test to match new void signature.
Fixes @Felix: "reorderBlocks violates command-query separation"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Blockers (14):
- B1: fix senderName/receiverName to use $derived instead of $state + sync $effect
- B2: migrate all korrespondenz components from messages-extra shim to paraglide m.*
- B3: i18n CorrespondenzEmptyState (heading, subtext, search placeholder)
- B4: add response.ok checks to admin layout server load
- B5: add response.ok checks to korrespondenz page server load
- B6: add page.server.spec.ts with 5 test suites for korrespondenz load function
- B7: add axe-core accessibility checks to all e2e korrespondenz tests
- B8: add Testcontainers JPQL tests for findSinglePersonCorrespondence (DISTINCT + sender)
- B9: hide auth reset-token endpoint from OpenAPI spec; remove from generated api.ts
- B11: replace amber hardcoded hex colors in SinglePersonHintBar with brand tokens
- B12: replace clipboard emoji with Heroicons SVG in SinglePersonHintBar
- B13: create DateInput component (German dd.mm.yyyy); use it in CorrespondenzFilterControls
- B14: add Paraglide compile step to CI workflow before lint/test
Suggestions (11):
- S1: make CorrespondentSuggestionsDropdown a pure display component; lift fetch to PersonBar
- S2: fix leftover messages-extra import in ConversationTimeline; use brand tokens for status dots
- S3: add intent comment to EntityNav openFlyout behavior
- S4: rename canManageGroups → canManagePermissions throughout admin
- S6: remove domFlush helper from DateInput spec; use expect.poll instead
- S7: replace test.skip with throw new Error in bilateral e2e tests
- S8: add inverse aria-disabled test for filter strip
- S9: remove sm:min-h-0 from sort button to preserve 44px touch target
- S10: add title attributes to tablet trigger buttons in EntityNav
- S11: delete messages-extra.ts shim entirely
Also: fix admin pages revealing blank strip at bottom (-mb-6 on admin layout)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A query of only spaces previously fell through to findAllWithDocumentCount,
exposing the full person list. Whitespace-only queries now return empty.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>