Default RestClient timeout was 10 seconds — OCR on CPU takes minutes.
Set connect timeout to 10s, read timeout to 10 minutes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OcrService → OcrAsyncRunner was circular. Fixed by moving all OCR
processing logic (processDocument, clearExistingBlocks, createBlocks)
into OcrAsyncRunner. OcrService is now a thin entry point that
validates, creates the job, and dispatches to OcrAsyncRunner.
Architecture:
- OcrService: validates document, checks health, creates OcrJob, delegates
- OcrAsyncRunner: @Async processDocument + runSingleDocument + runBatch
- OcrBatchService: creates job + job documents, delegates to OcrAsyncRunner
- No circular dependencies
Single-document OCR is now async (returns jobId immediately).
Frontend polls GET /api/ocr/jobs/{jobId} every 3s until DONE/FAILED.
816 backend tests pass, 687 frontend tests pass.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OCR service was getting 403 Forbidden because it tried to
download PDFs from MinIO using plain internal URLs without
authentication. MinIO buckets are private.
- Add S3Presigner bean to MinioConfig
- FileService.generatePresignedUrl(): generates 15-min presigned URLs
- OcrService uses presigned URLs instead of plain internal URLs
- Remove unused s3InternalUrl / bucketName @Value fields from OcrService
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BlockSource enum: MANUAL, OCR
- V26 migration adds source + reviewed columns to transcription_blocks
- OcrService sets source=OCR when creating blocks
- TranscriptionService.reviewBlock() toggles the reviewed flag
- PUT /api/documents/{id}/transcription-blocks/{blockId}/review endpoint
- 5 new tests: reviewBlock toggle/untoggle/notfound, controller,
OcrService source=OCR verification
The reviewed flag enables the Kraken fine-tuning pipeline: only blocks
marked as reviewed by a human are exported as training data.
Refs #226
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace @Convert(PolygonConverter) with Hibernate native @JdbcTypeCode(SqlTypes.JSON)
to fix JDBC type mismatch — PostgreSQL requires jsonb type, not varchar.
The PolygonConverter is retained as a standalone utility but no longer
used on the entity. Hibernate 6 natively handles List<List<Double>>
serialization to JSONB.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python microservice (ocr-service/):
- FastAPI app with /ocr and /health endpoints
- Surya engine: transformer-based OCR for typewritten/modern handwriting
- Kraken engine: historical HTR for Kurrent/Suetterlin with
pure-Python polygon-to-quad approximation (gift wrapping + rotating calipers)
- Eager model loading at startup via lifespan context manager
- PDF download via httpx, page rendering via pypdfium2 at 300 DPI
Java RestClientOcrClient:
- Implements OcrClient + OcrHealthClient interfaces
- Calls Python service via Spring RestClient
- Health check with graceful fallback
Docker Compose:
- New ocr-service container (mem_limit 6g, no host ports)
- Health check with start_period 60s for model loading
- ocr_models volume for Kraken model files
- Backend depends on ocr-service health
Refs #226, #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OCR creates many adjacent text line annotations that would fail the
existing overlap check. createOcrAnnotation() accepts an optional
polygon and bypasses overlap detection entirely.
Refs #227
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add "amt" and "schule" suffixes to INSTITUTION_END in PersonTypeClassifier
so German government offices and schools are auto-classified on import.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add title to PersonUpdateDTO with @Size(max=50) constraint.
PersonService.createPerson and updatePerson now handle the title
field with blank-to-null normalization.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove Architekt from WORD_PREFIXES (classifier handles it)
- Use Objects.equals for null-safe firstName/lastName comparison
- Remove unused trimmed variable in PersonTypeClassifier
- Fix containsWord to loop through all occurrences (finds
"Eltern" in "Nachbareltern Eltern")
- Extract DisplayNameFormatter utility shared by Person and
PersonSummaryDTO to eliminate display logic duplication
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add @Nullable annotation to findOrCreateByAlias() return type.
Filter null results (from SKIP classification) in MassImportService
receiver list to prevent null elements in the receivers collection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-pass title stripping with loop for stacked titles:
- Dot-prefixes (Dr., Prof.) matched without trailing space
- Word-prefixes (Tante, Frau, Schwester, etc.) matched at
word boundary
- Stacked titles like "Prof. Dr. Muller" handled correctly
- Single token after title strip goes to lastName (not firstName)
Add 5 "von" last names to KNOWN_LAST_NAMES for correct splitting
of entries like "Freifrau von Massenbach".
15 new test cases + updated 3 existing tests for title behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Classify raw name before processing. SKIP returns null (no Person
created). INSTITUTION/GROUP skip split() and store full name in
lastName with firstName=null and appropriate personType.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move paren extraction in parseReceivers() after the multi-separator
check so single-person entries like "Clara de Gruyter(*1871)" keep
their parens intact for split()'s annotation extraction. Multi-person
entries like "Hedi und Tutu (Gruber)" still use parens as shared
last-name override.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract trailing (...) content as annotation. Handles birth years
(*1871), nicknames (Tuttu), uncertainty markers (?), and uncertain
names (Quast ?) where the name part is extracted back into the
cleaned result. Uses [^)]* regex to prevent ReDoS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When split() returns a non-null maidenName, PersonService now
creates a PersonNameAlias with type MAIDEN_NAME. The maiden name
is stored as lastName on the alias (no firstName).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Widen pattern from `\s+geb\.\s+\S+` to `,?\s*geb\.?\s+(.+)$` to
handle: optional comma, optional dot, multi-word maiden names.
stripMaidenName() now captures the maiden name instead of discarding
it. Handles all 5 input variants from the ODS data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add displayName default method to PersonSummaryDTO
- Update native SQL queries to include title, person_type columns
- Add getInitials() utility to personFormat.ts
- Update abbreviateName/abbreviateCompact for nullable firstName
- Replace firstName+lastName concatenation with displayName in all
person-displaying components and server load files
- Regenerate API types with displayName on Person and PersonSummaryDTO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add title (nullable VARCHAR) and personType (enum, default PERSON)
- Make firstName nullable for non-person entities
- Add @Transient getDisplayName() as single source of truth for
name display, exposed via @Schema(READ_ONLY, REQUIRED)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PersonType has 5 values: PERSON, INSTITUTION, GROUP, UNKNOWN, SKIP.
SKIP is intentionally excluded from the DB CHECK constraint (added
in migration) as defense-in-depth. MAIDEN_NAME added to
PersonNameAliasType for #209.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract stripMaidenName, normalizeDotCompressed, stripAnnotation,
stripTitle, and splitByKnownLastNameOrFallback as individually
testable pipeline steps. Each extraction method is a pass-through
until its feature issue fills in the logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add title, maidenName, and annotation fields (all nullable) to
SplitName. All existing call sites pass null for new fields. Test
assertions updated to document the null-by-default contract.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inserts spaces after dots when the cleaned name has no spaces but
contains dots, so the existing last-space fallback handles names
like "E.Rockstroh" and "Dr.Fr.Zarncke" correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pre-splits input on "//" before existing logic so each segment is
processed independently through the full pipeline (und/u splitting,
last-name distribution, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Jackson tried to serialize the lazy Person proxy when returning
alias list, causing a "no session" error. The back-reference is
only needed for JPA navigation, not for API responses.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds @NotBlank @Size(max=255) on lastName, @NotNull on type,
@Valid on controller parameter. Blank/null input now returns
400 instead of reaching the DB constraint. 2 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sender alias LEFT JOIN and receiver alias EXISTS subquery to
DocumentSpecifications.hasText(). Uses entity-graph navigation via
Person.nameAliases (@OneToMany) to avoid a separate DB roundtrip
while respecting domain boundaries. 2 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds LEFT JOIN to person_name_aliases in both searchByName (JPQL)
and searchWithDocumentCount (native SQL). Uses DISTINCT/GROUP BY
to prevent duplicate results. 4 new integration tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET returns aliases (no permission required), POST requires
WRITE_ALL, DELETE requires WRITE_ALL. 5 new controller tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
getAliases (sorted by sort_order), addAlias (auto-incrementing
sort_order), removeAlias (with IDOR protection verifying alias
belongs to the given person). All TDD with 7 new unit tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces the alias domain model: entity with @ManyToOne to Person,
@OneToMany on Person for JPA graph navigation, repository with
sort_order queries, input DTO, and ALIAS_NOT_FOUND error code.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TODO comment explaining why SENDER/RECEIVER sort is in-memory
(JPA INNER JOIN drops null-sender docs) and note that pagination
will require a DB COUNT query in DocumentSearchResult.of().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A sender with lastName=null produced sort key "null Bob" which sorted
before names starting with lowercase letters (n < s, t, u, v...).
Now returns "" for null lastName, which the comparator places at end.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously any value other than ASC/DESC silently defaulted to
DESC with no feedback. Now returns 400 Bad Request.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SENDER and RECEIVER are handled by in-memory sort before resolveSort
is called, making those switch cases unreachable. Removed and added
a comment making the invariant explicit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DocumentSort is a query parameter enum, not a JPA entity.
Placing it in model/ violated the layer boundary — model/ should
contain only domain entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DocumentSort enum validated by Spring MVC (400 for unknown values)
- SENDER sort uses Spring Data Sort on sender.lastName/firstName
- RECEIVER sort uses in-memory sort by first receiver alphabetically
- UPLOAD_DATE sort uses createdAt; default sort is DATE DESC
- tagQ param wired to hasTagPartial specification
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- hasText now JOINs sender (LEFT JOIN) and uses EXISTS subqueries for
receivers and tags to avoid duplicate rows
- hasTagPartial added for live debounced tag text filter (ILIKE partial match)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RED/GREEN for CommentService:
- getCommentsForBlock(blockId): returns root comments filtered by blockId
- postBlockComment(documentId, blockId, content, mentions, author): creates
comment with block_id set
RED/GREEN for CommentController:
- GET /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST /api/documents/{docId}/transcription-blocks/{blockId}/comments
- POST .../comments/{commentId}/replies (reuses existing replyToComment)
4 new tests: 2 service unit tests + 2 controller integration tests
All 25 CommentServiceTest + 24 CommentControllerTest green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TranscriptionService.reorderBlocks() now returns void (command).
Controller calls listBlocks() separately after reorder (query).
Updated test to match new void signature.
Fixes @Felix: "reorderBlocks violates command-query separation"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes from PR #178 review:
Migration fixes:
- V18/V19: fix FK references from app_users to users (correct table name)
- V18: change annotation_id FK from ON DELETE CASCADE to ON DELETE RESTRICT
(block is aggregate root, cascade flows from block, not annotation)
Backend fixes:
- TranscriptionService.deleteBlock(): remove userId param, delete block first
then annotation directly via repository (no ownership check — block owns annotation)
- TranscriptionService.sanitizeText(): remove flawed regex HTML stripping,
textarea content is plain text by design — just enforce max length
- TranscriptionBlockController.requireUserId(): throw DomainException.unauthorized()
instead of silently returning null on auth failure
- CreateTranscriptionBlockDTO: add @Min/@Positive validation on coordinates
- Add @Slf4j logging to TranscriptionService for create/delete operations
Frontend fixes:
- Delete DocumentBottomPanel.svelte entirely (issue #175 requirement)
- Remove redundant mode exclusivity $effect (handled at toggle call sites)
- Remove dead handleCommentClick + onCommentClick prop (comments are future work)
- Remove quote hint UI (depends on comment feature)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TranscriptionBlock entity with @Version optimistic locking
TranscriptionBlockVersion for edit history
TranscriptionService facade: CRUD, reorder, version history
TranscriptionBlockController: REST endpoints under /api/documents/{docId}/transcription-blocks
DTOs: Create, Update, Reorder
ErrorCode: TRANSCRIPTION_BLOCK_NOT_FOUND, TRANSCRIPTION_BLOCK_CONFLICT
DocumentComment: add block_id field for block-level comment threads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>