Added @JsonIgnoreProperties(ignoreUnknown = true) to OcrBlockResult so
new fields from the Python OCR service don't crash the Java parser,
while keeping FAIL_ON_UNKNOWN_PROPERTIES strict globally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrAsyncRunner was bypassing TranscriptionService — building blocks
directly and calling blockRepository.save(), skipping sanitizeText()
and saveVersion(). Also replaced N individual deleteBlock() calls with
a single bulk deleteAllBlocksByDocument() for OCR re-runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrService.startOcr() was setting scriptType on a detached entity,
silently losing the mutation. Added DocumentService.updateScriptType()
with @Transactional to persist the change properly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OcrController was injecting OcrJobRepository and OcrJobDocumentRepository
directly, violating the Controller → Service → Repository layering rule.
Moved getJob() and getDocumentOcrStatus() logic into OcrService.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the single extractBlocks() call with streamBlocks() that
processes pages incrementally. Each page's blocks are persisted
immediately via createSingleBlock(). Progress updates use the
ANALYZING_PAGE:current:total:blocks format. Per-page errors are
logged at WARN level without failing the entire job. The batch path
(processDocument) remains on the old extractBlocks() path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add streamBlocks() that POSTs to /ocr/stream and parses the NDJSON
response line by line with a dedicated ObjectMapper. Falls back to
the old /ocr endpoint via the default method when /ocr/stream returns
404. Uses a separate HttpClient with 5-minute request timeout for
streaming.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default method synthesizes Start/Page/Done events from
extractBlocks() results, providing backward compatibility for
implementations that don't support streaming natively.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defines the event types for NDJSON streaming OCR. Uses Java 21 sealed
interface with record subtypes for exhaustive pattern matching in the
consumer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-document OCR now creates an OcrJobDocument row so
GET /api/documents/{id}/ocr-status can find running jobs.
OcrAsyncRunner updates the job document status (RUNNING → DONE/FAILED).
Frontend checks OCR status when entering transcription mode —
if a job is running, resumes polling and shows the spinner.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.
Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies
Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.
816 backend tests pass, 687 frontend tests pass.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.
- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add "amt" and "schule" suffixes to INSTITUTION_END in PersonTypeClassifier
so German government offices and schools are auto-classified on import.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Testcontainers test verifying: SKIP returns null with no DB record,
INSTITUTION/GROUP store full name in lastName with null firstName
and correct personType, PERSON splits name normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove Architekt from WORD_PREFIXES (classifier handles it)
- Use Objects.equals for null-safe firstName/lastName comparison
- Remove unused trimmed variable in PersonTypeClassifier
- Fix containsWord to loop through all occurrences (finds
"Eltern" in "Nachbareltern Eltern")
- Extract DisplayNameFormatter utility shared by Person and
PersonSummaryDTO to eliminate display logic duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-pass title stripping with loop for stacked titles:
- Dot-prefixes (Dr., Prof.) matched without trailing space
- Word-prefixes (Tante, Frau, Schwester, etc.) matched at
word boundary
- Stacked titles like "Prof. Dr. Muller" handled correctly
- Single token after title strip goes to lastName (not firstName)
Add 5 "von" last names to KNOWN_LAST_NAMES for correct splitting
of entries like "Freifrau von Massenbach".
15 new test cases + updated 3 existing tests for title behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classify raw name before processing. SKIP returns null (no Person
created). INSTITUTION/GROUP skip split() and store full name in
lastName with firstName=null and appropriate personType.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move paren extraction in parseReceivers() after the multi-separator
check so single-person entries like "Clara de Gruyter(*1871)" keep
their parens intact for split()'s annotation extraction. Multi-person
entries like "Hedi und Tutu (Gruber)" still use parens as shared
last-name override.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract trailing (...) content as annotation. Handles birth years
(*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain
names (Quast ?) where the name part is extracted back into the
cleaned result. Uses [^)]* regex to prevent ReDoS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When split() returns a non-null maidenName, PersonService now
creates a PersonNameAlias with type MAIDEN_NAME. The maiden name
is stored as lastName on the alias (no firstName).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verify comma-prefix, no-dot, and multi-word maiden name variants
are correctly stripped in parseReceivers().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Widen pattern from `\s+geb\.\s+\S+` to `,?\s*geb\.?\s+(.+)$` to
handle: optional comma, optional dot, multi-word maiden names.
stripMaidenName() now captures the maiden name instead of discarding
it. Handles all 5 input variants from the ODS data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add displayName default method to PersonSummaryDTO
- Update native SQL queries to include title, person_type columns
- Add getInitials() utility to personFormat.ts
- Update abbreviateName/abbreviateCompact for nullable firstName
- Replace firstName+lastName concatenation with displayName in all
person-displaying components and server load files
- Regenerate API types with displayName on Person and PersonSummaryDTO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add title (nullable VARCHAR) and personType (enum, default PERSON)
- Make firstName nullable for non-person entities
- Add @Transient getDisplayName() as single source of truth for
name display, exposed via @Schema(READ_ONLY, REQUIRED)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PersonType has 5 values: PERSON, INSTITUTION, GROUP, UNKNOWN, SKIP.
SKIP is intentionally excluded from the DB CHECK constraint (added
in migration) as defense-in-depth. MAIDEN_NAME added to
PersonNameAliasType for #209.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract stripMaidenName, normalizeDotCompressed, stripAnnotation,
stripTitle, and splitByKnownLastNameOrFallback as individually
testable pipeline steps. Each extraction method is a pass-through
until its feature issue fills in the logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add title, maidenName, and annotation fields (all nullable) to
SplitName. All existing call sites pass null for new fields. Test
assertions updated to document the null-by-default contract.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regression test confirms already-spaced dot names are not double-spaced.
Interaction test confirms // separator works with dot-compressed names.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inserts spaces after dots when the cleaned name has no spaces but
contains dots, so the existing last-space fallback handles names
like "E.Rockstroh" and "Dr.Fr.Zarncke" correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-splits input on "//" before existing logic so each segment is
processed independently through the full pipeline (und/u splitting,
last-name distribution, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds @NotBlank @Size(max=255) on lastName, @NotNull on type,
@Valid on controller parameter. Blank/null input now returns
400 instead of reaching the DB constraint. 2 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sender alias LEFT JOIN and receiver alias EXISTS subquery to
DocumentSpecifications.hasText(). Uses entity-graph navigation via
Person.nameAliases (@OneToMany) to avoid a separate DB roundtrip
while respecting domain boundaries. 2 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds LEFT JOIN to person_name_aliases in both searchByName (JPQL)
and searchWithDocumentCount (native SQL). Uses DISTINCT/GROUP BY
to prevent duplicate results. 4 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET returns aliases (no permission required), POST requires
WRITE_ALL, DELETE requires WRITE_ALL. 5 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getAliases (sorted by sort_order), addAlias (auto-incrementing
sort_order), removeAlias (with IDOR protection verifying alias
belongs to the given person). All TDD with 7 new unit tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A sender with lastName=null produced sort key "null Bob" which sorted
before names starting with lowercase letters (n < s, t, u, v...).
Now returns "" for null lastName, which the comparator places at end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously any value other than ASC/DESC silently defaulted to
DESC with no feedback. Now returns 400 Bad Request.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentSort is a query parameter enum, not a JPA entity.
Placing it in model/ violated the layer boundary — model/ should
contain only domain entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DocumentSort enum validated by Spring MVC (400 for unknown values)
- SENDER sort uses Spring Data Sort on sender.lastName/firstName
- RECEIVER sort uses in-memory sort by first receiver alphabetically
- UPLOAD_DATE sort uses createdAt; default sort is DATE DESC
- tagQ param wired to hasTagPartial specification
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hasText now JOINs sender (LEFT JOIN) and uses EXISTS subqueries for
receivers and tags to avoid duplicate rows
- hasTagPartial added for live debounced tag text filter (ILIKE partial match)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RED/GREEN for CommentService:
- getCommentsForBlock(blockId): returns root comments filtered by blockId
- postBlockComment(documentId, blockId, content, mentions, author): creates
comment with block_id set
RED/GREEN for CommentController:
- GET /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST .../comments/{commentId}/replies (reuses existing replyToComment)
4 new tests: 2 service unit tests + 2 controller integration tests
All 25 CommentServiceTest + 24 CommentControllerTest green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>