feat(ocr): per-sender specialized Kurrent models with automatic active-learning retraining #253

New Issue

marcel · 2026-04-17T12:29:00+02:00

marcel commented

2026-04-17 12:29:00 +02:00

Summary

When a document sender accumulates enough manual transcription corrections, the system automatically fine-tunes a dedicated Kraken model for that sender's handwriting. The appropriate sender model is then used transparently on every future OCR run for that sender. An admin training log in /admin/system shows the full history of base and sender-specific runs.

Thresholds:

First model activation: ≥ 100 MANUAL-source HANDWRITING_KURRENT blocks for this sender
Retraining trigger: ≥ 50 new MANUAL blocks since last training

Correction signal: TranscriptionBlock.source = MANUAL on documents where sender_id = X and scriptType = HANDWRITING_KURRENT. No additional labelling required from users.

Model naming: /app/models/sender_{person_uuid}.mlmodel (on the OCR service volume).

Design Decisions

Always-automatic: No user action to trigger sender training. SenderModelService.checkAndTriggerTraining() is called @Async after every updateBlock() save. If a training run is already active, the call is silently skipped.
LRU cache on Python side: Max 5 sender models loaded simultaneously (bounded by the 12 GB mem_limit). Models for frequent senders (Walter ~800 docs, Clara ~400 docs) stay warm. Rare senders fall back to the base model.
Base model as fine-tuning seed: Every sender model is fine-tuned from the current base german_kurrent.mlmodel, not trained from scratch.
No UI for training — admin log only: Users see nothing. Admins can view what ran, when, and the resulting CER.

Scope

16 files changed, 3 new files, 2 Flyway migrations, ~6 i18n keys.

Part A — Database

A1. `V40__add_sender_models.sql`

CREATE TABLE sender_models (
    id          UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    person_id   UUID        NOT NULL UNIQUE REFERENCES persons(id),
    model_path  TEXT        NOT NULL,
    accuracy    DOUBLE PRECISION,
    cer         DOUBLE PRECISION,
    corrected_lines_at_training INT NOT NULL DEFAULT 0,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

A2. `V41__add_person_to_training_runs.sql`

ALTER TABLE ocr_training_runs
    ADD COLUMN person_id UUID REFERENCES persons(id);

Part B — Java: New & Modified Files

B1. New entity: `model/SenderModel.java`

@Entity
@Table(name = "sender_models")
@Data @NoArgsConstructor @AllArgsConstructor @Builder
public class SenderModel {

    @Id @GeneratedValue(strategy = GenerationType.UUID)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private UUID id;

    @Column(name = "person_id", nullable = false, unique = true)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private UUID personId;

    @Column(name = "model_path", nullable = false)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private String modelPath;

    @Column private Double accuracy;
    @Column private Double cer;

    @Column(name = "corrected_lines_at_training", nullable = false)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private int correctedLinesAtTraining;

    @CreationTimestamp
    @Column(name = "created_at", nullable = false, updatable = false)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private Instant createdAt;

    @UpdateTimestamp
    @Column(name = "updated_at", nullable = false)
    @Schema(requiredMode = Schema.RequiredMode.REQUIRED)
    private Instant updatedAt;
}

B2. Extend `model/OcrTrainingRun.java`

Add one nullable field:

@Column(name = "person_id")
private UUID personId;

When personId is null, the run is a base model training. When non-null, it is a sender-specific run.

B3. New repository: `repository/SenderModelRepository.java`

public interface SenderModelRepository extends JpaRepository<SenderModel, UUID> {
    Optional<SenderModel> findByPersonId(UUID personId);
}

B4. New query in `repository/TranscriptionBlockRepository.java`

Add to the existing interface:

@Query("""
    SELECT COUNT(b) FROM TranscriptionBlock b
    JOIN Document d ON d.id = b.documentId
    WHERE b.source = 'MANUAL'
    AND d.sender.id = :personId
    AND d.scriptType = 'HANDWRITING_KURRENT'
    """)
long countManualKurrentBlocksByPerson(@Param("personId") UUID personId);

B5. Extend `service/OcrClient.java`

Add two new methods alongside the existing ones:

/**
 * Fine-tune the Kurrent model for a specific sender.
 * outputModelPath: where to save the trained model (e.g. /app/models/sender_{uuid}.mlmodel)
 * The base model seed is always german_kurrent.mlmodel on the Python side.
 */
TrainingResult trainSenderModel(byte[] trainingDataZip, String outputModelPath);

/**
 * Stream OCR results using an optional sender-specific model.
 * When senderModelPath is null, the base model is used.
 */
void streamBlocks(String pdfUrl, ScriptType scriptType,
                  List<OcrRegion> regions,
                  @Nullable String senderModelPath,
                  Consumer<OcrStreamEvent> handler);

Keep the existing 4-arg streamBlocks() default; have it call the new 5-arg version with null:

default void streamBlocks(String pdfUrl, ScriptType scriptType,
                          List<OcrRegion> regions, Consumer<OcrStreamEvent> handler) {
    streamBlocks(pdfUrl, scriptType, regions, null, handler);
}

B6. Extend `service/RestClientOcrClient.java`

Implement trainSenderModel() — same multipart structure as trainModel(), but add two extra form fields:

body.add("output_model_path", outputModelPath);
// base_model_path defaults on the Python side to german_kurrent.mlmodel

Implement the 5-arg streamBlocks() — same as the existing implementation, but include senderModelPath in the JSON body when non-null:

if (senderModelPath != null) {
    requestMap.put("senderModelPath", senderModelPath);
}

B7. New service: `service/SenderModelService.java`

@Service
@RequiredArgsConstructor
@Slf4j
public class SenderModelService {

    private final SenderModelRepository senderModelRepository;
    private final TranscriptionBlockRepository blockRepository;
    private final TrainingDataExportService trainingDataExportService;
    private final OcrTrainingRunRepository trainingRunRepository;
    private final OcrClient ocrClient;
    private final TransactionTemplate txTemplate;

    @Value("${ocr.sender-model.activation-threshold:100}")
    private int activationThreshold;

    @Value("${ocr.sender-model.retrain-delta:50}")
    private int retrainDelta;

    public Optional<String> maybeGetModelPath(UUID personId) {
        return senderModelRepository.findByPersonId(personId)
                .map(SenderModel::getModelPath);
    }

    @Async
    public void checkAndTriggerTraining(UUID personId) {
        long correctedLines = blockRepository.countManualKurrentBlocksByPerson(personId);
        Optional<SenderModel> existing = senderModelRepository.findByPersonId(personId);

        boolean shouldActivate = existing.isEmpty() && correctedLines >= activationThreshold;
        boolean shouldRetrain  = existing.isPresent()
            && (correctedLines - existing.get().getCorrectedLinesAtTraining()) >= retrainDelta;

        if (!shouldActivate && !shouldRetrain) return;

        // Guard: skip silently if any training run is already active
        if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
            log.info("Skipping sender model training for person {} — a training run is already active", personId);
            return;
        }

        triggerSenderTraining(personId, (int) correctedLines);
    }

    private void triggerSenderTraining(UUID personId, int correctedLines) {
        String outputModelPath = "/app/models/sender_" + personId + ".mlmodel";

        OcrTrainingRun run = Objects.requireNonNull(txTemplate.execute(status -> {
            var blocks = trainingDataExportService.queryBlocksForSender(personId);
            OcrTrainingRun newRun = OcrTrainingRun.builder()
                    .status(TrainingStatus.RUNNING)
                    .blockCount(blocks.size())
                    .documentCount((int) blocks.stream().map(TranscriptionBlock::getDocumentId).distinct().count())
                    .modelName("sender_" + personId)
                    .personId(personId)
                    .build();
            return trainingRunRepository.save(newRun);
        }));

        try {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            trainingDataExportService.exportForSender(personId).writeTo(baos);
            byte[] zipBytes = baos.toByteArray();

            OcrClient.TrainingResult result = ocrClient.trainSenderModel(zipBytes, outputModelPath);

            txTemplate.execute(status -> {
                SenderModel model = senderModelRepository.findByPersonId(personId)
                        .orElseGet(() -> SenderModel.builder().personId(personId).build());
                model.setModelPath(outputModelPath);
                model.setCer(result.cer());
                model.setAccuracy(result.accuracy());
                model.setCorrectedLinesAtTraining(correctedLines);
                senderModelRepository.save(model);

                run.setStatus(TrainingStatus.DONE);
                run.setCompletedAt(Instant.now());
                run.setCer(result.cer());
                run.setAccuracy(result.accuracy());
                run.setEpochs(result.epochs());
                trainingRunRepository.save(run);
                return null;
            });

        } catch (Exception e) {
            txTemplate.execute(status -> {
                run.setStatus(TrainingStatus.FAILED);
                run.setErrorMessage(e.getMessage());
                run.setCompletedAt(Instant.now());
                trainingRunRepository.save(run);
                return null;
            });
            log.error("Sender model training failed for person {}: {}", personId, e.getMessage(), e);
        }
    }
}

B8. Extend `service/TrainingDataExportService.java`

Add two new methods alongside the existing ones:

public List<TranscriptionBlock> queryBlocksForSender(UUID personId) {
    return blockRepository.findManualKurrentBlocksByPerson(personId);
}

public StreamingResponseBody exportForSender(UUID personId) {
    List<TranscriptionBlock> blocks = queryBlocksForSender(personId);
    if (blocks.isEmpty()) return out -> {};

    // Same ZIP structure as exportToZip() — reuse the existing helper methods
    // (renderPageImage, cropBlockImage, writeTrainingPair).
    // Group by documentId, fetch PDFs, render crops, write PAGE XML + PNG pairs.
    // This is structurally identical to exportToZip() but uses a different block list.
}

Also add the supporting repository query to TranscriptionBlockRepository:

@Query("""
    SELECT b FROM TranscriptionBlock b
    JOIN Document d ON d.id = b.documentId
    WHERE b.source = 'MANUAL'
    AND d.sender.id = :personId
    AND d.scriptType = 'HANDWRITING_KURRENT'
    """)
List<TranscriptionBlock> findManualKurrentBlocksByPerson(@Param("personId") UUID personId);

B9. Hook in `service/TranscriptionService.java`

Inject SenderModelService (field injection via @RequiredArgsConstructor). After the blockRepository.save(block) call in updateBlock():

@Transactional
public TranscriptionBlock updateBlock(UUID documentId, UUID blockId,
                                      UpdateTranscriptionBlockDTO dto, UUID userId) {
    TranscriptionBlock block = getBlock(documentId, blockId);
    // ... existing update logic ...
    TranscriptionBlock saved = blockRepository.save(block);
    saveVersion(saved, userId);

    // Active-learning hook: async, never delays the response
    Document doc = documentService.getDocumentById(documentId);
    if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) {
        senderModelService.checkAndTriggerTraining(doc.getSender().getId());
    }

    return saved;
}

B10. Update `service/OcrAsyncRunner.java`

Inject SenderModelService. In runSingleDocument(), look up the sender model path before calling streamBlocks:

Document doc = documentService.getDocumentById(documentId);

// Look up sender model path (null if no sender-specific model exists yet)
String senderModelPath = null;
if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) {
    senderModelPath = senderModelService.maybeGetModelPath(doc.getSender().getId()).orElse(null);
}

// Pass senderModelPath to streamBlocks (null means use base model)
final String finalSenderModelPath = senderModelPath;
ocrClient.streamBlocks(pdfUrl, doc.getScriptType(), regions, finalSenderModelPath, event -> {
    // ... existing switch ...
});

Also update processDocument() (used in runBatch()) to look up and pass the sender model path to extractBlocks().

B11. Extend `service/OcrTrainingService.getTrainingInfo()`

Change the query to fetch 20 most recent runs (instead of 10) to cover a mix of base and sender runs:

List<OcrTrainingRun> recentRuns = trainingRunRepository.findTop20ByOrderByCreatedAtDesc();

Add findTop20ByOrderByCreatedAtDesc() to OcrTrainingRunRepository.

The existing TrainingInfoResponse already contains List<OcrTrainingRun> runs, and OcrTrainingRun will now include personId — the frontend uses this.

B12. Config properties in `application.properties`

ocr.sender-model.activation-threshold=100
ocr.sender-model.retrain-delta=50

Part C — Python OCR Service

C1. `ocr-service/models.py`

Add one optional field to OcrRequest:

senderModelPath: str | None = None

C2. `ocr-service/engines/kraken.py`

Add SenderModelRegistry class above the module-level _model global:

import threading
from collections import OrderedDict

class SenderModelRegistry:
    """LRU cache for per-sender Kraken models. Thread-safe."""

    def __init__(self, max_cached: int = 5):
        self._cache: OrderedDict[str, object] = OrderedDict()
        self._lock = threading.Lock()
        self._max = max_cached

    def get_or_load(self, model_path: str) -> object:
        with self._lock:
            if model_path in self._cache:
                self._cache.move_to_end(model_path)
                return self._cache[model_path]
            from kraken.lib import models as kraken_models
            loaded = kraken_models.load_any(model_path)
            self._cache[model_path] = loaded
            if len(self._cache) > self._max:
                self._cache.popitem(last=False)
            return loaded

_sender_registry = SenderModelRegistry(
    max_cached=int(os.environ.get("OCR_MAX_CACHED_MODELS", "5"))
)

Also add a helper to engines/kraken.py that accepts an optional model override:

def get_model(sender_model_path: str | None = None):
    """Return the appropriate model — sender-specific if available, else base model."""
    if sender_model_path and os.path.exists(sender_model_path):
        return _sender_registry.get_or_load(sender_model_path)
    if _model is None:
        raise RuntimeError("Kraken model is not loaded")
    return _model

Update extract_page_blocks() and extract_region_text() to accept an optional model parameter (defaults to None, resolved via get_model()).

C3. `ocr-service/main.py`

In generate() (streaming normal mode):
Extract senderModelPath from the request and pass it to extract_page_blocks():

sender_model = request.senderModelPath
# ...
blocks = engine.extract_page_blocks(image, page_idx, request.language, model=kraken.get_model(sender_model))

In generate_guided() (streaming guided mode):
Same — pass model=kraken.get_model(request.senderModelPath) to extract_region_text().

In the non-streaming /ocr endpoint:
Same pattern — resolve model once before the page loop.

Extend the /train endpoint to accept two optional form fields alongside the existing file upload:

@app.post("/train")
async def train(
    file: UploadFile = File(...),
    output_model_path: str | None = Form(None),
    base_model_path: str | None = Form(None),
    ...
):
    # If output_model_path is provided, save to that path instead of the default.
    # If base_model_path is provided, fine-tune from that model instead of the default.
    # Default behavior (both None) is unchanged — fine-tunes german_kurrent.mlmodel in place.

The actual Kraken ketos train call already supports specifying an output path and a base model via CLI flags. Wire these through when non-None.

C4. `ocr-service/requirements.txt`

No new Python dependencies — SenderModelRegistry uses stdlib threading and collections.OrderedDict. OpenCV was added in the preprocessing feature (#252); no further changes here.

Part D — Frontend

D1. Extend `/admin/system` page (`+page.svelte`)

The /api/ocr/training-info endpoint already returns runs: OcrTrainingRun[]. Each run now also carries personId: string | null.

Add a TrainingHistoryTable card below the existing OcrTrainingCard and SegmentationTrainingCard:

┌─────────────────────────────────────────────────────────────────────┐
│ TRAINING HISTORY                                                      │
├──────────────────┬────────┬──────────────────┬────────┬─────┬───────┤
│ Started          │ Type   │ Sender           │ Status │ CER │ Lines │
├──────────────────┼────────┼──────────────────┼────────┼─────┼───────┤
│ 17.04.2026 14:32 │ Sender │ Walter de Gruyter│ Done   │4.2% │ 150   │
│ 17.04.2026 09:11 │ Base   │ —                │ Done   │5.8% │ 2400  │
└──────────────────┴────────┴──────────────────┴────────┴─────┴───────┘

Type: "Sender" when personId !== null, "Base" otherwise
Sender: person name looked up on the server side (see D2), or "—" for base runs
Sorted newest-first, 20 rows max (no pagination needed for now)
No action buttons — read-only log

D2. Extend backend API response

In OcrTrainingService.TrainingInfoResponse, include person display name for runs with a personId. Add a new field:

public record TrainingInfoResponse(
    // ... existing fields ...
    List<OcrTrainingRun> runs,
    Map<UUID, String> personNames  // personId → displayName for all persons referenced in runs
) {}

OcrTrainingService.getTrainingInfo() collects all non-null personIds from the runs, fetches their display names via PersonService.getById(), and builds the map.

D3. New component: `src/lib/components/TrainingHistoryTable.svelte`

Receives runs: OcrTrainingRun[] and personNames: Record<string, string> as props. Renders the table described in D1. No interactivity required.

D4. i18n keys

Add to messages/de.json, en.json, es.json:

Key	de	en	es
`admin_training_history_heading`	`Trainingshistorie`	`Training History`	`Historial de entrenamiento`
`admin_training_type_base`	`Basis`	`Base`	`Base`
`admin_training_type_sender`	`Sender`	`Sender`	`Remitente`
`admin_training_col_sender`	`Sender`	`Sender`	`Remitente`
`admin_training_col_lines`	`Zeilen`	`Lines`	`Líneas`
`admin_training_col_started`	`Gestartet`	`Started`	`Iniciado`

Part E — Tests

E1. Python: `ocr-service/test_sender_registry.py`

test_get_or_load_returns_same_instance — call get_or_load(path) twice, assert same object returned (cache hit, no second disk read)
test_lru_evicts_oldest_when_full — fill cache to max (mock load_any), add one more, assert first entry evicted
test_get_model_returns_base_when_path_none — get_model(None) returns the global _model
test_get_model_returns_sender_when_path_exists — get_model("/app/models/sender_abc.mlmodel") with a mocked registry returns sender model

E2. Java unit: new `SenderModelServiceTest.java`

Uses @ExtendWith(MockitoExtension.class). Test methods:

checkAndTriggerTraining_doesNothing_when_below_threshold — countManualKurrentBlocksByPerson returns 50, no existing model → triggerSenderTraining never called
checkAndTriggerTraining_triggersActivation_when_above_threshold — count ≥ 100, no existing model → trainSenderModel called
checkAndTriggerTraining_triggersRetrain_when_delta_exceeded — existing model with correctedLinesAtTraining = 100, current count = 155 → triggers (delta = 55 ≥ 50)
checkAndTriggerTraining_skips_when_training_already_running — findFirstByStatus(RUNNING) returns a run → no training triggered

E3. Java unit: `RestClientOcrClientTest.java`

trainSenderModel_includesOutputModelPathInBody — verify the multipart form body includes output_model_path field

E4. Java unit: extend `TranscriptionServiceTest.java`

updateBlock_triggersTrainingCheck_when_document_has_kurrent_sender — mock doc.getScriptType() = HANDWRITING_KURRENT and doc.getSender() != null → senderModelService.checkAndTriggerTraining() called once
updateBlock_doesNotTriggerTraining_when_scriptType_is_not_kurrent — doc.getScriptType() = HANDWRITING_LATIN → checkAndTriggerTraining never called

Verification Steps

cd backend && ./mvnw test -Dtest=SenderModelServiceTest,TranscriptionServiceTest,RestClientOcrClientTest
cd ocr-service && python -m pytest test_sender_registry.py -v
Start the full stack. Open a HANDWRITING_KURRENT document. Edit 5 transcription blocks manually. Verify ocr_training_runs in the DB has no new row (below threshold).
In psql, manually SET ocr.sender-model.activation-threshold = 1. Edit one more block. Confirm a new ocr_training_runs row with person_id = <sender_id> appears and completes.
Trigger OCR on the same document. Verify the OCR service log shows "Loading sender model from /app/models/sender_{uuid}.mlmodel".
Open /admin/system. Confirm the Training History table shows the sender run with the sender's name.

## Summary When a document sender accumulates enough manual transcription corrections, the system automatically fine-tunes a dedicated Kraken model for that sender's handwriting. The appropriate sender model is then used transparently on every future OCR run for that sender. An admin training log in `/admin/system` shows the full history of base and sender-specific runs. **Thresholds:** - First model activation: ≥ 100 `MANUAL`-source `HANDWRITING_KURRENT` blocks for this sender - Retraining trigger: ≥ 50 new MANUAL blocks since last training **Correction signal:** `TranscriptionBlock.source = MANUAL` on documents where `sender_id = X` and `scriptType = HANDWRITING_KURRENT`. No additional labelling required from users. **Model naming:** `/app/models/sender_{person_uuid}.mlmodel` (on the OCR service volume). --- ## Design Decisions - **Always-automatic:** No user action to trigger sender training. `SenderModelService.checkAndTriggerTraining()` is called `@Async` after every `updateBlock()` save. If a training run is already active, the call is silently skipped. - **LRU cache on Python side:** Max 5 sender models loaded simultaneously (bounded by the 12 GB mem_limit). Models for frequent senders (Walter ~800 docs, Clara ~400 docs) stay warm. Rare senders fall back to the base model. - **Base model as fine-tuning seed:** Every sender model is fine-tuned from the current base `german_kurrent.mlmodel`, not trained from scratch. - **No UI for training — admin log only:** Users see nothing. Admins can view what ran, when, and the resulting CER. --- ## Scope **16 files changed, 3 new files, 2 Flyway migrations, ~6 i18n keys.** --- ## Part A — Database ### A1. `V40__add_sender_models.sql` ```sql CREATE TABLE sender_models ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), person_id UUID NOT NULL UNIQUE REFERENCES persons(id), model_path TEXT NOT NULL, accuracy DOUBLE PRECISION, cer DOUBLE PRECISION, corrected_lines_at_training INT NOT NULL DEFAULT 0, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` ### A2. `V41__add_person_to_training_runs.sql` ```sql ALTER TABLE ocr_training_runs ADD COLUMN person_id UUID REFERENCES persons(id); ``` --- ## Part B — Java: New & Modified Files ### B1. New entity: `model/SenderModel.java` ```java @Entity @Table(name = "sender_models") @Data @NoArgsConstructor @AllArgsConstructor @Builder public class SenderModel { @Id @GeneratedValue(strategy = GenerationType.UUID) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private UUID id; @Column(name = "person_id", nullable = false, unique = true) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private UUID personId; @Column(name = "model_path", nullable = false) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private String modelPath; @Column private Double accuracy; @Column private Double cer; @Column(name = "corrected_lines_at_training", nullable = false) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private int correctedLinesAtTraining; @CreationTimestamp @Column(name = "created_at", nullable = false, updatable = false) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private Instant createdAt; @UpdateTimestamp @Column(name = "updated_at", nullable = false) @Schema(requiredMode = Schema.RequiredMode.REQUIRED) private Instant updatedAt; } ``` ### B2. Extend `model/OcrTrainingRun.java` Add one nullable field: ```java @Column(name = "person_id") private UUID personId; ``` When `personId` is `null`, the run is a base model training. When non-null, it is a sender-specific run. ### B3. New repository: `repository/SenderModelRepository.java` ```java public interface SenderModelRepository extends JpaRepository<SenderModel, UUID> { Optional<SenderModel> findByPersonId(UUID personId); } ``` ### B4. New query in `repository/TranscriptionBlockRepository.java` Add to the existing interface: ```java @Query(""" SELECT COUNT(b) FROM TranscriptionBlock b JOIN Document d ON d.id = b.documentId WHERE b.source = 'MANUAL' AND d.sender.id = :personId AND d.scriptType = 'HANDWRITING_KURRENT' """) long countManualKurrentBlocksByPerson(@Param("personId") UUID personId); ``` ### B5. Extend `service/OcrClient.java` Add two new methods alongside the existing ones: ```java /** * Fine-tune the Kurrent model for a specific sender. * outputModelPath: where to save the trained model (e.g. /app/models/sender_{uuid}.mlmodel) * The base model seed is always german_kurrent.mlmodel on the Python side. */ TrainingResult trainSenderModel(byte[] trainingDataZip, String outputModelPath); /** * Stream OCR results using an optional sender-specific model. * When senderModelPath is null, the base model is used. */ void streamBlocks(String pdfUrl, ScriptType scriptType, List<OcrRegion> regions, @Nullable String senderModelPath, Consumer<OcrStreamEvent> handler); ``` Keep the existing 4-arg `streamBlocks()` default; have it call the new 5-arg version with `null`: ```java default void streamBlocks(String pdfUrl, ScriptType scriptType, List<OcrRegion> regions, Consumer<OcrStreamEvent> handler) { streamBlocks(pdfUrl, scriptType, regions, null, handler); } ``` ### B6. Extend `service/RestClientOcrClient.java` **Implement `trainSenderModel()`** — same multipart structure as `trainModel()`, but add two extra form fields: ```java body.add("output_model_path", outputModelPath); // base_model_path defaults on the Python side to german_kurrent.mlmodel ``` **Implement the 5-arg `streamBlocks()`** — same as the existing implementation, but include `senderModelPath` in the JSON body when non-null: ```java if (senderModelPath != null) { requestMap.put("senderModelPath", senderModelPath); } ``` ### B7. New service: `service/SenderModelService.java` ```java @Service @RequiredArgsConstructor @Slf4j public class SenderModelService { private final SenderModelRepository senderModelRepository; private final TranscriptionBlockRepository blockRepository; private final TrainingDataExportService trainingDataExportService; private final OcrTrainingRunRepository trainingRunRepository; private final OcrClient ocrClient; private final TransactionTemplate txTemplate; @Value("${ocr.sender-model.activation-threshold:100}") private int activationThreshold; @Value("${ocr.sender-model.retrain-delta:50}") private int retrainDelta; public Optional<String> maybeGetModelPath(UUID personId) { return senderModelRepository.findByPersonId(personId) .map(SenderModel::getModelPath); } @Async public void checkAndTriggerTraining(UUID personId) { long correctedLines = blockRepository.countManualKurrentBlocksByPerson(personId); Optional<SenderModel> existing = senderModelRepository.findByPersonId(personId); boolean shouldActivate = existing.isEmpty() && correctedLines >= activationThreshold; boolean shouldRetrain = existing.isPresent() && (correctedLines - existing.get().getCorrectedLinesAtTraining()) >= retrainDelta; if (!shouldActivate && !shouldRetrain) return; // Guard: skip silently if any training run is already active if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) { log.info("Skipping sender model training for person {} — a training run is already active", personId); return; } triggerSenderTraining(personId, (int) correctedLines); } private void triggerSenderTraining(UUID personId, int correctedLines) { String outputModelPath = "/app/models/sender_" + personId + ".mlmodel"; OcrTrainingRun run = Objects.requireNonNull(txTemplate.execute(status -> { var blocks = trainingDataExportService.queryBlocksForSender(personId); OcrTrainingRun newRun = OcrTrainingRun.builder() .status(TrainingStatus.RUNNING) .blockCount(blocks.size()) .documentCount((int) blocks.stream().map(TranscriptionBlock::getDocumentId).distinct().count()) .modelName("sender_" + personId) .personId(personId) .build(); return trainingRunRepository.save(newRun); })); try { ByteArrayOutputStream baos = new ByteArrayOutputStream(); trainingDataExportService.exportForSender(personId).writeTo(baos); byte[] zipBytes = baos.toByteArray(); OcrClient.TrainingResult result = ocrClient.trainSenderModel(zipBytes, outputModelPath); txTemplate.execute(status -> { SenderModel model = senderModelRepository.findByPersonId(personId) .orElseGet(() -> SenderModel.builder().personId(personId).build()); model.setModelPath(outputModelPath); model.setCer(result.cer()); model.setAccuracy(result.accuracy()); model.setCorrectedLinesAtTraining(correctedLines); senderModelRepository.save(model); run.setStatus(TrainingStatus.DONE); run.setCompletedAt(Instant.now()); run.setCer(result.cer()); run.setAccuracy(result.accuracy()); run.setEpochs(result.epochs()); trainingRunRepository.save(run); return null; }); } catch (Exception e) { txTemplate.execute(status -> { run.setStatus(TrainingStatus.FAILED); run.setErrorMessage(e.getMessage()); run.setCompletedAt(Instant.now()); trainingRunRepository.save(run); return null; }); log.error("Sender model training failed for person {}: {}", personId, e.getMessage(), e); } } } ``` ### B8. Extend `service/TrainingDataExportService.java` Add two new methods alongside the existing ones: ```java public List<TranscriptionBlock> queryBlocksForSender(UUID personId) { return blockRepository.findManualKurrentBlocksByPerson(personId); } public StreamingResponseBody exportForSender(UUID personId) { List<TranscriptionBlock> blocks = queryBlocksForSender(personId); if (blocks.isEmpty()) return out -> {}; // Same ZIP structure as exportToZip() — reuse the existing helper methods // (renderPageImage, cropBlockImage, writeTrainingPair). // Group by documentId, fetch PDFs, render crops, write PAGE XML + PNG pairs. // This is structurally identical to exportToZip() but uses a different block list. } ``` Also add the supporting repository query to `TranscriptionBlockRepository`: ```java @Query(""" SELECT b FROM TranscriptionBlock b JOIN Document d ON d.id = b.documentId WHERE b.source = 'MANUAL' AND d.sender.id = :personId AND d.scriptType = 'HANDWRITING_KURRENT' """) List<TranscriptionBlock> findManualKurrentBlocksByPerson(@Param("personId") UUID personId); ``` ### B9. Hook in `service/TranscriptionService.java` Inject `SenderModelService` (field injection via `@RequiredArgsConstructor`). After the `blockRepository.save(block)` call in `updateBlock()`: ```java @Transactional public TranscriptionBlock updateBlock(UUID documentId, UUID blockId, UpdateTranscriptionBlockDTO dto, UUID userId) { TranscriptionBlock block = getBlock(documentId, blockId); // ... existing update logic ... TranscriptionBlock saved = blockRepository.save(block); saveVersion(saved, userId); // Active-learning hook: async, never delays the response Document doc = documentService.getDocumentById(documentId); if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) { senderModelService.checkAndTriggerTraining(doc.getSender().getId()); } return saved; } ``` ### B10. Update `service/OcrAsyncRunner.java` Inject `SenderModelService`. In `runSingleDocument()`, look up the sender model path before calling `streamBlocks`: ```java Document doc = documentService.getDocumentById(documentId); // Look up sender model path (null if no sender-specific model exists yet) String senderModelPath = null; if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) { senderModelPath = senderModelService.maybeGetModelPath(doc.getSender().getId()).orElse(null); } // Pass senderModelPath to streamBlocks (null means use base model) final String finalSenderModelPath = senderModelPath; ocrClient.streamBlocks(pdfUrl, doc.getScriptType(), regions, finalSenderModelPath, event -> { // ... existing switch ... }); ``` Also update `processDocument()` (used in `runBatch()`) to look up and pass the sender model path to `extractBlocks()`. ### B11. Extend `service/OcrTrainingService.getTrainingInfo()` Change the query to fetch 20 most recent runs (instead of 10) to cover a mix of base and sender runs: ```java List<OcrTrainingRun> recentRuns = trainingRunRepository.findTop20ByOrderByCreatedAtDesc(); ``` Add `findTop20ByOrderByCreatedAtDesc()` to `OcrTrainingRunRepository`. The existing `TrainingInfoResponse` already contains `List<OcrTrainingRun> runs`, and `OcrTrainingRun` will now include `personId` — the frontend uses this. ### B12. Config properties in `application.properties` ```properties ocr.sender-model.activation-threshold=100 ocr.sender-model.retrain-delta=50 ``` --- ## Part C — Python OCR Service ### C1. `ocr-service/models.py` Add one optional field to `OcrRequest`: ```python senderModelPath: str | None = None ``` ### C2. `ocr-service/engines/kraken.py` Add `SenderModelRegistry` class above the module-level `_model` global: ```python import threading from collections import OrderedDict class SenderModelRegistry: """LRU cache for per-sender Kraken models. Thread-safe.""" def __init__(self, max_cached: int = 5): self._cache: OrderedDict[str, object] = OrderedDict() self._lock = threading.Lock() self._max = max_cached def get_or_load(self, model_path: str) -> object: with self._lock: if model_path in self._cache: self._cache.move_to_end(model_path) return self._cache[model_path] from kraken.lib import models as kraken_models loaded = kraken_models.load_any(model_path) self._cache[model_path] = loaded if len(self._cache) > self._max: self._cache.popitem(last=False) return loaded _sender_registry = SenderModelRegistry( max_cached=int(os.environ.get("OCR_MAX_CACHED_MODELS", "5")) ) ``` Also add a helper to `engines/kraken.py` that accepts an optional model override: ```python def get_model(sender_model_path: str | None = None): """Return the appropriate model — sender-specific if available, else base model.""" if sender_model_path and os.path.exists(sender_model_path): return _sender_registry.get_or_load(sender_model_path) if _model is None: raise RuntimeError("Kraken model is not loaded") return _model ``` Update `extract_page_blocks()` and `extract_region_text()` to accept an optional `model` parameter (defaults to `None`, resolved via `get_model()`). ### C3. `ocr-service/main.py` **In `generate()` (streaming normal mode):** Extract `senderModelPath` from the request and pass it to `extract_page_blocks()`: ```python sender_model = request.senderModelPath # ... blocks = engine.extract_page_blocks(image, page_idx, request.language, model=kraken.get_model(sender_model)) ``` **In `generate_guided()` (streaming guided mode):** Same — pass `model=kraken.get_model(request.senderModelPath)` to `extract_region_text()`. **In the non-streaming `/ocr` endpoint:** Same pattern — resolve model once before the page loop. **Extend the `/train` endpoint** to accept two optional form fields alongside the existing `file` upload: ```python @app.post("/train") async def train( file: UploadFile = File(...), output_model_path: str | None = Form(None), base_model_path: str | None = Form(None), ... ): # If output_model_path is provided, save to that path instead of the default. # If base_model_path is provided, fine-tune from that model instead of the default. # Default behavior (both None) is unchanged — fine-tunes german_kurrent.mlmodel in place. ``` The actual Kraken `ketos train` call already supports specifying an output path and a base model via CLI flags. Wire these through when non-None. ### C4. `ocr-service/requirements.txt` No new Python dependencies — `SenderModelRegistry` uses stdlib `threading` and `collections.OrderedDict`. OpenCV was added in the preprocessing feature (#252); no further changes here. --- ## Part D — Frontend ### D1. Extend `/admin/system` page (`+page.svelte`) The `/api/ocr/training-info` endpoint already returns `runs: OcrTrainingRun[]`. Each run now also carries `personId: string | null`. Add a `TrainingHistoryTable` card below the existing `OcrTrainingCard` and `SegmentationTrainingCard`: ``` ┌─────────────────────────────────────────────────────────────────────┐ │ TRAINING HISTORY │ ├──────────────────┬────────┬──────────────────┬────────┬─────┬───────┤ │ Started │ Type │ Sender │ Status │ CER │ Lines │ ├──────────────────┼────────┼──────────────────┼────────┼─────┼───────┤ │ 17.04.2026 14:32 │ Sender │ Walter de Gruyter│ Done │4.2% │ 150 │ │ 17.04.2026 09:11 │ Base │ — │ Done │5.8% │ 2400 │ └──────────────────┴────────┴──────────────────┴────────┴─────┴───────┘ ``` - **Type**: "Sender" when `personId !== null`, "Base" otherwise - **Sender**: person name looked up on the server side (see D2), or "—" for base runs - Sorted newest-first, 20 rows max (no pagination needed for now) - No action buttons — read-only log ### D2. Extend backend API response In `OcrTrainingService.TrainingInfoResponse`, include person display name for runs with a `personId`. Add a new field: ```java public record TrainingInfoResponse( // ... existing fields ... List<OcrTrainingRun> runs, Map<UUID, String> personNames // personId → displayName for all persons referenced in runs ) {} ``` `OcrTrainingService.getTrainingInfo()` collects all non-null `personId`s from the runs, fetches their display names via `PersonService.getById()`, and builds the map. ### D3. New component: `src/lib/components/TrainingHistoryTable.svelte` Receives `runs: OcrTrainingRun[]` and `personNames: Record<string, string>` as props. Renders the table described in D1. No interactivity required. ### D4. i18n keys Add to `messages/de.json`, `en.json`, `es.json`: | Key | de | en | es | |---|---|---|---| | `admin_training_history_heading` | `Trainingshistorie` | `Training History` | `Historial de entrenamiento` | | `admin_training_type_base` | `Basis` | `Base` | `Base` | | `admin_training_type_sender` | `Sender` | `Sender` | `Remitente` | | `admin_training_col_sender` | `Sender` | `Sender` | `Remitente` | | `admin_training_col_lines` | `Zeilen` | `Lines` | `Líneas` | | `admin_training_col_started` | `Gestartet` | `Started` | `Iniciado` | --- ## Part E — Tests ### E1. Python: `ocr-service/test_sender_registry.py` - `test_get_or_load_returns_same_instance` — call `get_or_load(path)` twice, assert same object returned (cache hit, no second disk read) - `test_lru_evicts_oldest_when_full` — fill cache to max (mock `load_any`), add one more, assert first entry evicted - `test_get_model_returns_base_when_path_none` — `get_model(None)` returns the global `_model` - `test_get_model_returns_sender_when_path_exists` — `get_model("/app/models/sender_abc.mlmodel")` with a mocked registry returns sender model ### E2. Java unit: new `SenderModelServiceTest.java` Uses `@ExtendWith(MockitoExtension.class)`. Test methods: - `checkAndTriggerTraining_doesNothing_when_below_threshold` — `countManualKurrentBlocksByPerson` returns 50, no existing model → `triggerSenderTraining` never called - `checkAndTriggerTraining_triggersActivation_when_above_threshold` — count ≥ 100, no existing model → `trainSenderModel` called - `checkAndTriggerTraining_triggersRetrain_when_delta_exceeded` — existing model with `correctedLinesAtTraining = 100`, current count = 155 → triggers (delta = 55 ≥ 50) - `checkAndTriggerTraining_skips_when_training_already_running` — `findFirstByStatus(RUNNING)` returns a run → no training triggered ### E3. Java unit: `RestClientOcrClientTest.java` - `trainSenderModel_includesOutputModelPathInBody` — verify the multipart form body includes `output_model_path` field ### E4. Java unit: extend `TranscriptionServiceTest.java` - `updateBlock_triggersTrainingCheck_when_document_has_kurrent_sender` — mock `doc.getScriptType() = HANDWRITING_KURRENT` and `doc.getSender() != null` → `senderModelService.checkAndTriggerTraining()` called once - `updateBlock_doesNotTriggerTraining_when_scriptType_is_not_kurrent` — `doc.getScriptType() = HANDWRITING_LATIN` → `checkAndTriggerTraining` never called --- ## Verification Steps 1. `cd backend && ./mvnw test -Dtest=SenderModelServiceTest,TranscriptionServiceTest,RestClientOcrClientTest` 2. `cd ocr-service && python -m pytest test_sender_registry.py -v` 3. Start the full stack. Open a `HANDWRITING_KURRENT` document. Edit 5 transcription blocks manually. Verify `ocr_training_runs` in the DB has no new row (below threshold). 4. In `psql`, manually SET `ocr.sender-model.activation-threshold = 1`. Edit one more block. Confirm a new `ocr_training_runs` row with `person_id = <sender_id>` appears and completes. 5. Trigger OCR on the same document. Verify the OCR service log shows `"Loading sender model from /app/models/sender_{uuid}.mlmodel"`. 6. Open `/admin/system`. Confirm the Training History table shows the sender run with the sender's name.

marcel added the feature label 2026-04-17 12:29:07 +02:00

marcel commented

2026-04-17 12:31:56 +02:00

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

exportForSender() duplicates exportToZip() (B8). The issue says "structurally identical … reuse the existing helper methods" — but the plan doesn't extract a shared core. Both methods do: collect blocks → group by documentId → fetch PDFs → render crops → write ZIP pairs. That's 50+ lines duplicated. The helpers (renderPageImage, cropBlockImage, writeTrainingPair) are already package-private, but the orchestrating logic isn't extracted.

Lock contention bug in SenderModelRegistry.get_or_load() (C2). kraken_models.load_any() is called while holding _lock. Loading a Kraken model takes 5–30 seconds. This blocks every other thread (including concurrent OCR requests) for the entire load duration. The fix is a double-checked load:

def get_or_load(self, model_path: str) -> object:
    with self._lock:
        if model_path in self._cache:
            self._cache.move_to_end(model_path)
            return self._cache[model_path]
    # Load OUTSIDE the lock
    from kraken.lib import models as kraken_models
    loaded = kraken_models.load_any(model_path)
    with self._lock:
        # Re-check after load in case another thread loaded it first
        if model_path not in self._cache:
            self._cache[model_path] = loaded
            if len(self._cache) > self._max:
                self._cache.popitem(last=False)
        return self._cache[model_path]

Cache stale after retraining (C2/C3). After sender model retraining, the new .mlmodel file is written to the same path but the registry still holds the old loaded model. The next OCR request hits the cache and uses the pre-retraining model. Fix: add invalidate(model_path) to SenderModelRegistry and call it from _run_training() after writing the new model file:
```
_sender_registry.invalidate(output_model_path)
```
/train endpoint changes are non-trivial (C3). The current endpoint always overwrites KRAKEN_MODEL_PATH and triggers a model reload into _model. For sender training (output_model_path provided), the logic must diverge: copy to output_model_path, skip backup/rotation, and skip reloading _model. The two code paths should be explicit — not an if output_model_path else inline branch inside _run_training(), but two clearly named private functions: _install_base_model(best_model) and _install_sender_model(best_model, output_model_path).
updateBlock() hook fetches the document unconditionally (B9). The documentService.getDocumentById(documentId) call runs on every block update regardless of script type. Since TranscriptionBlock doesn't carry sender_id or scriptType, this extra lookup is unavoidable — but it should be the first conditional check, not after save():
```
TranscriptionBlock saved = blockRepository.save(block);
saveVersion(saved, userId);
// Load doc only to decide whether to trigger training check
Document doc = documentService.getDocumentById(documentId);
if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) {
    senderModelService.checkAndTriggerTraining(doc.getSender().getId());
}
return saved;
```
This is correct — just keep the ordering as shown and don't move the doc fetch before the save.
npm run generate:api step omitted from Part B. Adding personId to OcrTrainingRun and introducing SenderModel as a new entity affects the OpenAPI spec. After B2/B3 changes, the backend must be rebuilt and npm run generate:api must run before the frontend work in Part D starts.

Recommendations

Extract a shared private StreamingResponseBody exportBlocksToZip(List<TranscriptionBlock> blocks) in TrainingDataExportService. Both exportToZip() and exportForSender() become 3-line wrappers.
Fix the get_or_load() lock contention with the double-checked pattern above — this is a correctness issue, not just a performance concern.
Add invalidate(model_path) to SenderModelRegistry and call it after successful sender model installation.
Add an explicit regenerate-api step to the verification checklist (after the Java changes in Parts A–B, before Part D).

## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Observations - **`exportForSender()` duplicates `exportToZip()`** (B8). The issue says "structurally identical … reuse the existing helper methods" — but the plan doesn't extract a shared core. Both methods do: collect blocks → group by documentId → fetch PDFs → render crops → write ZIP pairs. That's 50+ lines duplicated. The helpers (`renderPageImage`, `cropBlockImage`, `writeTrainingPair`) are already package-private, but the orchestrating logic isn't extracted. - **Lock contention bug in `SenderModelRegistry.get_or_load()`** (C2). `kraken_models.load_any()` is called while holding `_lock`. Loading a Kraken model takes 5–30 seconds. This blocks every other thread (including concurrent OCR requests) for the entire load duration. The fix is a double-checked load: ```python def get_or_load(self, model_path: str) -> object: with self._lock: if model_path in self._cache: self._cache.move_to_end(model_path) return self._cache[model_path] # Load OUTSIDE the lock from kraken.lib import models as kraken_models loaded = kraken_models.load_any(model_path) with self._lock: # Re-check after load in case another thread loaded it first if model_path not in self._cache: self._cache[model_path] = loaded if len(self._cache) > self._max: self._cache.popitem(last=False) return self._cache[model_path] ``` - **Cache stale after retraining** (C2/C3). After sender model retraining, the new `.mlmodel` file is written to the same path but the registry still holds the old loaded model. The next OCR request hits the cache and uses the pre-retraining model. Fix: add `invalidate(model_path)` to `SenderModelRegistry` and call it from `_run_training()` after writing the new model file: ```python _sender_registry.invalidate(output_model_path) ``` - **`/train` endpoint changes are non-trivial** (C3). The current endpoint always overwrites `KRAKEN_MODEL_PATH` and triggers a model reload into `_model`. For sender training (`output_model_path` provided), the logic must diverge: copy to `output_model_path`, skip backup/rotation, and skip reloading `_model`. The two code paths should be explicit — not an `if output_model_path else` inline branch inside `_run_training()`, but two clearly named private functions: `_install_base_model(best_model)` and `_install_sender_model(best_model, output_model_path)`. - **`updateBlock()` hook fetches the document unconditionally** (B9). The `documentService.getDocumentById(documentId)` call runs on every block update regardless of script type. Since `TranscriptionBlock` doesn't carry `sender_id` or `scriptType`, this extra lookup is unavoidable — but it should be the first conditional check, not after `save()`: ```java TranscriptionBlock saved = blockRepository.save(block); saveVersion(saved, userId); // Load doc only to decide whether to trigger training check Document doc = documentService.getDocumentById(documentId); if (doc.getSender() != null && doc.getScriptType() == ScriptType.HANDWRITING_KURRENT) { senderModelService.checkAndTriggerTraining(doc.getSender().getId()); } return saved; ``` This is correct — just keep the ordering as shown and don't move the doc fetch before the save. - **`npm run generate:api` step omitted from Part B**. Adding `personId` to `OcrTrainingRun` and introducing `SenderModel` as a new entity affects the OpenAPI spec. After B2/B3 changes, the backend must be rebuilt and `npm run generate:api` must run before the frontend work in Part D starts. ### Recommendations - Extract a shared `private StreamingResponseBody exportBlocksToZip(List<TranscriptionBlock> blocks)` in `TrainingDataExportService`. Both `exportToZip()` and `exportForSender()` become 3-line wrappers. - Fix the `get_or_load()` lock contention with the double-checked pattern above — this is a correctness issue, not just a performance concern. - Add `invalidate(model_path)` to `SenderModelRegistry` and call it after successful sender model installation. - Add an explicit regenerate-api step to the verification checklist (after the Java changes in Parts A–B, before Part D).

marcel commented

2026-04-17 12:32:15 +02:00

🏛️ Markus Keller — Application Architect

Observations

Domain boundary violation: SenderModelService directly owns OcrTrainingRunRepository (B7). OcrTrainingService is the existing owner of the OcrTrainingRun lifecycle — it creates runs, marks them DONE/FAILED, and exposes the getTrainingInfo() surface. Introducing a second writer (SenderModelService) to the same repository fragments that ownership and makes it hard to reason about training run state. The fix: move triggerSenderTraining() into OcrTrainingService as triggerSenderTraining(UUID personId, int correctedLines), and have SenderModelService delegate to it. SenderModelService then only owns the SenderModel entity — a clean, single-responsibility boundary.
Race condition on the RUNNING guard (B7). The check in checkAndTriggerTraining() is:
```
if (trainingRunRepository.findFirstByStatus(RUNNING).isPresent()) return;
txTemplate.execute(status -> { ... trainingRunRepository.save(newRun) ... });
```
Two concurrent async threads can both find no RUNNING run, both pass the guard, and both attempt to insert — one will hit the partial unique index idx_ocr_training_runs_one_running from V30 and throw DataIntegrityViolationException. The code doesn't catch this. Fix: wrap the first txTemplate.execute block in a try-catch for DataIntegrityViolationException and silently return (the other thread won the race — correct outcome).
sender_models.person_id FK has no ON DELETE clause (A1). This defaults to NO ACTION / RESTRICT. If someone tries to delete a person who has a sender model, the DB will reject it with a constraint error. The existing FK in V30 uses ON DELETE SET NULL (triggered_by UUID REFERENCES users(id) ON DELETE SET NULL). Apply the same pattern here:
```
person_id UUID NOT NULL UNIQUE REFERENCES persons(id) ON DELETE CASCADE
```
CASCADE makes more sense than SET NULL here since person_id is NOT NULL UNIQUE — there's no valid null state. Deleting the person should delete the sender model record (the file cleanup is a separate concern).
ocr_training_runs.person_id FK also needs ON DELETE SET NULL (A2). Consistent with the existing triggered_by column:
```
ALTER TABLE ocr_training_runs ADD COLUMN person_id UUID REFERENCES persons(id) ON DELETE SET NULL;
```
Historical training run records should survive person deletion.
ADR-001 (single-node constraint) needs updating. The existing comment in OcrTrainingService reads: "Not safe for horizontal scaling: training reloads the Kraken model in-process on the Python OCR service after each run. The DB-level RUNNING constraint prevents concurrent training API calls…" Sender model training uses the same constraint but writes to a different model path. If base training and sender training ever ran concurrently (different paths, different models), they wouldn't conflict in terms of model state — but the current constraint prevents this. Document whether the single-run constraint is intentional for sender models or could be relaxed later.

Recommendations

Move triggerSenderTraining() into OcrTrainingService. SenderModelService calls it like ocrTrainingService.triggerSenderTraining(personId, correctedLines).
Catch DataIntegrityViolationException from the first txTemplate.execute() block and silently return — matches the "already running" guard semantics.
Add ON DELETE CASCADE to sender_models.person_id and ON DELETE SET NULL to ocr_training_runs.person_id in the migrations.
Add a note to ADR-001 that the single-run constraint currently applies to sender model training too.

## 🏛️ Markus Keller — Application Architect ### Observations - **Domain boundary violation: `SenderModelService` directly owns `OcrTrainingRunRepository`** (B7). `OcrTrainingService` is the existing owner of the `OcrTrainingRun` lifecycle — it creates runs, marks them DONE/FAILED, and exposes the `getTrainingInfo()` surface. Introducing a second writer (`SenderModelService`) to the same repository fragments that ownership and makes it hard to reason about training run state. The fix: move `triggerSenderTraining()` into `OcrTrainingService` as `triggerSenderTraining(UUID personId, int correctedLines)`, and have `SenderModelService` delegate to it. `SenderModelService` then only owns the `SenderModel` entity — a clean, single-responsibility boundary. - **Race condition on the RUNNING guard** (B7). The check in `checkAndTriggerTraining()` is: ```java if (trainingRunRepository.findFirstByStatus(RUNNING).isPresent()) return; txTemplate.execute(status -> { ... trainingRunRepository.save(newRun) ... }); ``` Two concurrent async threads can both find no RUNNING run, both pass the guard, and both attempt to insert — one will hit the partial unique index `idx_ocr_training_runs_one_running` from V30 and throw `DataIntegrityViolationException`. The code doesn't catch this. Fix: wrap the first `txTemplate.execute` block in a try-catch for `DataIntegrityViolationException` and silently return (the other thread won the race — correct outcome). - **`sender_models.person_id` FK has no `ON DELETE` clause** (A1). This defaults to `NO ACTION` / `RESTRICT`. If someone tries to delete a person who has a sender model, the DB will reject it with a constraint error. The existing FK in V30 uses `ON DELETE SET NULL` (`triggered_by UUID REFERENCES users(id) ON DELETE SET NULL`). Apply the same pattern here: ```sql person_id UUID NOT NULL UNIQUE REFERENCES persons(id) ON DELETE CASCADE ``` `CASCADE` makes more sense than `SET NULL` here since `person_id` is `NOT NULL UNIQUE` — there's no valid null state. Deleting the person should delete the sender model record (the file cleanup is a separate concern). - **`ocr_training_runs.person_id` FK also needs `ON DELETE SET NULL`** (A2). Consistent with the existing `triggered_by` column: ```sql ALTER TABLE ocr_training_runs ADD COLUMN person_id UUID REFERENCES persons(id) ON DELETE SET NULL; ``` Historical training run records should survive person deletion. - **ADR-001 (single-node constraint) needs updating.** The existing comment in `OcrTrainingService` reads: *"Not safe for horizontal scaling: training reloads the Kraken model in-process on the Python OCR service after each run. The DB-level RUNNING constraint prevents concurrent training API calls…"* Sender model training uses the same constraint but writes to a different model path. If base training and sender training ever ran concurrently (different paths, different models), they wouldn't conflict in terms of model state — but the current constraint prevents this. Document whether the single-run constraint is intentional for sender models or could be relaxed later. ### Recommendations - Move `triggerSenderTraining()` into `OcrTrainingService`. `SenderModelService` calls it like `ocrTrainingService.triggerSenderTraining(personId, correctedLines)`. - Catch `DataIntegrityViolationException` from the first `txTemplate.execute()` block and silently return — matches the "already running" guard semantics. - Add `ON DELETE CASCADE` to `sender_models.person_id` and `ON DELETE SET NULL` to `ocr_training_runs.person_id` in the migrations. - Add a note to ADR-001 that the single-run constraint currently applies to sender model training too.

marcel commented

2026-04-17 12:32:32 +02:00

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Path traversal risk on output_model_path (B6, C3). The Java backend computes the path as "/app/models/sender_" + personId + ".mlmodel" where personId is a Java UUID — guaranteed alphanumeric + dashes, so injection through this specific call path is not possible. However, the Python /train endpoint accepts output_model_path as a plain Form(str) with no validation. If this endpoint were ever called by any other client (directly, via script, or future code), a crafted path like ../../etc/cron.d/exploit would write a model file outside /app/models/. Defense-in-depth fix — add to _run_training() before the copy:
```
if output_model_path:
    allowed_dir = os.path.realpath("/app/models")
    resolved = os.path.realpath(output_model_path)
    if not resolved.startswith(allowed_dir + os.sep):
        raise HTTPException(status_code=400, detail="output_model_path must be within /app/models/")
```
Training token must be forwarded in trainSenderModel() (B6). The plan says to implement trainSenderModel() with "the same multipart structure as trainModel()" — but the training token (X-Training-Token header) is added in trainModel() via the if (trainingToken != null && !trainingToken.isBlank()) guard. The plan should explicitly state this header must be included in trainSenderModel() as well. It's easy to omit from a copy-paste. Make it a named private helper: addTrainingAuth(spec) called from both methods, so the auth logic lives in one place.
SenderModel.modelPath appears in OpenAPI spec (B1). SenderModel is a JPA entity with @Schema annotations on modelPath. If any controller ever returns SenderModel directly (not just OcrTrainingRun), the filesystem path to the model file is exposed in the API. No controller currently returns SenderModel, but the entity is not protected. Recommend adding @JsonIgnore to modelPath or removing its @Schema annotation now, before a controller accidentally exposes it.
SenderModelRegistry loads models from caller-supplied paths at inference time (C2). The get_model(sender_model_path) function calls os.path.exists(sender_model_path) and then load_any(model_path) on a path that originated from the Java backend's DB record. If the Java sender_models.model_path column were corrupted (e.g., via a DB injection that escaped other layers), an attacker could potentially load a malicious .mlmodel file. Mitigate: validate that the path is within /app/models/ in get_model() using the same check as above, before calling load_any().
ZIP extraction in sender training reuses existing _validate_zip_entry() (C3). The current /train endpoint already validates each ZIP entry against path traversal — this protection is maintained for sender training since the same code path handles the ZIP extraction. Good. Confirm _validate_zip_entry is called regardless of the output_model_path parameter.

Recommendations

Add output_model_path whitelist check inside _run_training() before shutil.copy2().
Add model_path whitelist check inside get_model() before load_any().
Extract a shared _add_training_auth_header(spec) private method in RestClientOcrClient called from both trainModel() and trainSenderModel().
Add @JsonIgnore on SenderModel.modelPath to prevent filesystem paths leaking into the OpenAPI spec or any future controller response.

## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Observations - **Path traversal risk on `output_model_path`** (B6, C3). The Java backend computes the path as `"/app/models/sender_" + personId + ".mlmodel"` where `personId` is a Java `UUID` — guaranteed alphanumeric + dashes, so injection through this specific call path is not possible. However, the Python `/train` endpoint accepts `output_model_path` as a plain `Form(str)` with no validation. If this endpoint were ever called by any other client (directly, via script, or future code), a crafted path like `../../etc/cron.d/exploit` would write a model file outside `/app/models/`. Defense-in-depth fix — add to `_run_training()` before the copy: ```python if output_model_path: allowed_dir = os.path.realpath("/app/models") resolved = os.path.realpath(output_model_path) if not resolved.startswith(allowed_dir + os.sep): raise HTTPException(status_code=400, detail="output_model_path must be within /app/models/") ``` - **Training token must be forwarded in `trainSenderModel()`** (B6). The plan says to implement `trainSenderModel()` with "the same multipart structure as `trainModel()`" — but the training token (`X-Training-Token` header) is added in `trainModel()` via the `if (trainingToken != null && !trainingToken.isBlank())` guard. The plan should explicitly state this header must be included in `trainSenderModel()` as well. It's easy to omit from a copy-paste. Make it a named private helper: `addTrainingAuth(spec)` called from both methods, so the auth logic lives in one place. - **`SenderModel.modelPath` appears in OpenAPI spec** (B1). `SenderModel` is a JPA entity with `@Schema` annotations on `modelPath`. If any controller ever returns `SenderModel` directly (not just `OcrTrainingRun`), the filesystem path to the model file is exposed in the API. No controller currently returns `SenderModel`, but the entity is not protected. Recommend adding `@JsonIgnore` to `modelPath` or removing its `@Schema` annotation now, before a controller accidentally exposes it. - **`SenderModelRegistry` loads models from caller-supplied paths at inference time** (C2). The `get_model(sender_model_path)` function calls `os.path.exists(sender_model_path)` and then `load_any(model_path)` on a path that originated from the Java backend's DB record. If the Java `sender_models.model_path` column were corrupted (e.g., via a DB injection that escaped other layers), an attacker could potentially load a malicious `.mlmodel` file. Mitigate: validate that the path is within `/app/models/` in `get_model()` using the same check as above, before calling `load_any()`. - **ZIP extraction in sender training reuses existing `_validate_zip_entry()`** (C3). The current `/train` endpoint already validates each ZIP entry against path traversal — this protection is maintained for sender training since the same code path handles the ZIP extraction. Good. Confirm `_validate_zip_entry` is called regardless of the `output_model_path` parameter. ### Recommendations - Add `output_model_path` whitelist check inside `_run_training()` before `shutil.copy2()`. - Add `model_path` whitelist check inside `get_model()` before `load_any()`. - Extract a shared `_add_training_auth_header(spec)` private method in `RestClientOcrClient` called from both `trainModel()` and `trainSenderModel()`. - Add `@JsonIgnore` on `SenderModel.modelPath` to prevent filesystem paths leaking into the OpenAPI spec or any future controller response.

marcel commented

2026-04-17 12:32:51 +02:00

🧪 Sara Holt — QA Engineer

Observations

Off-by-one threshold tests are missing (E2). The checkAndTriggerTraining logic compares correctedLines >= activationThreshold. The tests cover "below threshold" (50) and "above threshold" (≥100) but not the boundary itself. Add:
- checkAndTriggerTraining_doesNothing_when_one_below_threshold — count = 99, no model → no training
- checkAndTriggerTraining_triggersActivation_when_exactly_at_threshold — count = 100, no model → training triggered
  Same boundary tests for the retrain delta (49 new vs. 50 new vs. 51 new).
exportForSender() integration test is missing (E1–E4). The existing exportToZip() relies on real PDF rendering from S3 and real DB blocks. The sender variant has the same dependency profile. It needs an integration test (Testcontainers + MinIO fixture) covering: happy path produces a valid ZIP with PAGE XML pairs; empty block list returns no-op. Without this, the most likely failure point (PDF rendering + crop + XML generation) is untested for the sender path.
TrainingHistoryTable Svelte component has no test (D3). The plan adds a new read-only table component that renders training run data including conditional "Sender"/"Base" type display and optional person name lookup. Add a vitest-browser-svelte component test:
- Renders "Base" and "—" when personId is null
- Renders "Sender" and person name when personId is set
- Renders empty state when runs is empty
Flyway migration integration test (A1, A2). V40 and V41 must run cleanly in sequence after V39. The migration test suite (which runs all migrations against Testcontainers Postgres on every CI run) catches this automatically, but the verification steps should include ./mvnw test -Dtest=MigrationTest explicitly.
@Async + unit test interaction (E2). checkAndTriggerTraining is annotated @Async, but when called directly in a Mockito unit test (no Spring context), the method executes synchronously on the calling thread. The test for checkAndTriggerTraining_skips_when_training_already_running is valid — just note in the test that the @Async behavior is not under test here. A separate @SpringBootTest integration test with a real async executor would verify the async dispatch, but this is a medium-priority concern for a non-public-facing background job.

No E2E smoke test for the training history table (D1). Add a Playwright test to the verification matrix:

test('admin system page shows training history table', async ({ page }) => {
    await page.goto('/admin/system');
    await expect(page.getByRole('table', { name: /training history/i })).toBeVisible();
});

This catches broken server-side data fetch or missing i18n keys.

Recommendations

Add boundary tests at threshold - 1, threshold, and threshold + 1 for both activation and retrain-delta checks.
Add a TrainingDataExportServiceSenderIntegrationTest using Testcontainers + a mock S3 bucket.
Add a vitest component test for TrainingHistoryTable.svelte covering the three data states.
Add ./mvnw test -Dtest=MigrationTest to the verification steps in the issue.

## 🧪 Sara Holt — QA Engineer ### Observations - **Off-by-one threshold tests are missing** (E2). The `checkAndTriggerTraining` logic compares `correctedLines >= activationThreshold`. The tests cover "below threshold" (50) and "above threshold" (≥100) but not the boundary itself. Add: - `checkAndTriggerTraining_doesNothing_when_one_below_threshold` — count = 99, no model → no training - `checkAndTriggerTraining_triggersActivation_when_exactly_at_threshold` — count = 100, no model → training triggered Same boundary tests for the retrain delta (49 new vs. 50 new vs. 51 new). - **`exportForSender()` integration test is missing** (E1–E4). The existing `exportToZip()` relies on real PDF rendering from S3 and real DB blocks. The sender variant has the same dependency profile. It needs an integration test (Testcontainers + MinIO fixture) covering: happy path produces a valid ZIP with PAGE XML pairs; empty block list returns no-op. Without this, the most likely failure point (PDF rendering + crop + XML generation) is untested for the sender path. - **`TrainingHistoryTable` Svelte component has no test** (D3). The plan adds a new read-only table component that renders training run data including conditional "Sender"/"Base" type display and optional person name lookup. Add a vitest-browser-svelte component test: - Renders "Base" and "—" when `personId` is null - Renders "Sender" and person name when `personId` is set - Renders empty state when `runs` is empty - **Flyway migration integration test** (A1, A2). V40 and V41 must run cleanly in sequence after V39. The migration test suite (which runs all migrations against Testcontainers Postgres on every CI run) catches this automatically, but the verification steps should include `./mvnw test -Dtest=MigrationTest` explicitly. - **`@Async` + unit test interaction** (E2). `checkAndTriggerTraining` is annotated `@Async`, but when called directly in a Mockito unit test (no Spring context), the method executes synchronously on the calling thread. The test for `checkAndTriggerTraining_skips_when_training_already_running` is valid — just note in the test that the `@Async` behavior is not under test here. A separate `@SpringBootTest` integration test with a real async executor would verify the async dispatch, but this is a medium-priority concern for a non-public-facing background job. - **No E2E smoke test for the training history table** (D1). Add a Playwright test to the verification matrix: ```typescript test('admin system page shows training history table', async ({ page }) => { await page.goto('/admin/system'); await expect(page.getByRole('table', { name: /training history/i })).toBeVisible(); }); ``` This catches broken server-side data fetch or missing i18n keys. ### Recommendations - Add boundary tests at `threshold - 1`, `threshold`, and `threshold + 1` for both activation and retrain-delta checks. - Add a `TrainingDataExportServiceSenderIntegrationTest` using Testcontainers + a mock S3 bucket. - Add a vitest component test for `TrainingHistoryTable.svelte` covering the three data states. - Add `./mvnw test -Dtest=MigrationTest` to the verification steps in the issue.

marcel commented

2026-04-17 12:33:09 +02:00

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

"Sender" naming collision in the table (D1). The Type column renders "Sender" (for sender-specific runs) and the person-name column is also titled "Sender". On screen they appear side by side:
```
Type    │ Sender
Sender  │ Walter de Gruyter
```
The first "Sender" labels the run type; the second labels the person. In German "Sender" already has this dual meaning (broadcaster / letter-sender). Admin users will be confused. Rename the Type column values to Basis / Personalisiert (de) and Base / Sender-specific (en), and rename the person column header to Person (de/en). The i18n table in D4 needs updating accordingly.
Empty state is missing (D1). The first time the admin opens /admin/system, no training runs exist. The table should show:
```
No training runs yet.
```
A blank table with only headers gives no signal. This is especially important for the senior audience who may interpret an empty table as a broken page.
Accessibility: table needs a <caption> or aria-label (D3). Without one, screen readers announce the table as "table" with no context. TrainingHistoryTable.svelte should include:
```
<table aria-label={m.admin_training_history_heading()}>
```
or a <caption> element styled visually-hidden if the heading above already serves as the label.
Status column uses text alone (D1). "Done" / "Failed" / "Running" in text is readable but relies on color for differentiation in the typical implementation. Add a redundant icon per WCAG 1.4.1:
- ✓ Done → green icon + text
- ✗ Failed → red icon + text
- ⟳ Running → animated spinner icon + text
  Color-blind users (8% of men) see no status difference without the icon.
CER value has no explanation (D1). The column shows "4.2%" with no tooltip or label explaining what CER means or that lower is better. Admins who aren't ML practitioners won't understand the significance. Add a <abbr title="Character Error Rate — lower is better">CER</abbr> element in the column header.
Missing blockCount / lines column label i18n key for accuracy: The table mockup shows "Lines" but the i18n key in D4 is admin_training_col_lines. Confirm blockCount from OcrTrainingRun maps to this column — it does (the plan uses blockCount as the correction line count). Add a tooltip explaining "number of corrected lines used in this training run".

Recommendations

Rename Type column values to Basis/Personalisiert (de) and Base/Sender-specific (en). Update D4 i18n keys.
Rename person column header from "Sender" to "Person". Update i18n keys.
Add an empty-state row: "Noch keine Trainingsläufe." / "No training runs yet."
Add aria-label={m.admin_training_history_heading()} to the <table> element in TrainingHistoryTable.svelte.
Wrap "CER" in <abbr title="Character Error Rate — lower is better"> in the column header.
Pair status text with a redundant icon (✓ / ✗ / spinner).

## 🎨 Leonie Voss — UX Designer & Accessibility Strategist ### Observations - **"Sender" naming collision in the table** (D1). The Type column renders "Sender" (for sender-specific runs) and the person-name column is also titled "Sender". On screen they appear side by side: ``` Type │ Sender Sender │ Walter de Gruyter ``` The first "Sender" labels the run type; the second labels the person. In German "Sender" already has this dual meaning (broadcaster / letter-sender). Admin users will be confused. Rename the Type column values to `Basis` / `Personalisiert` (de) and `Base` / `Sender-specific` (en), and rename the person column header to `Person` (de/en). The i18n table in D4 needs updating accordingly. - **Empty state is missing** (D1). The first time the admin opens `/admin/system`, no training runs exist. The table should show: ``` No training runs yet. ``` A blank table with only headers gives no signal. This is especially important for the senior audience who may interpret an empty table as a broken page. - **Accessibility: table needs a `<caption>` or `aria-label`** (D3). Without one, screen readers announce the table as "table" with no context. `TrainingHistoryTable.svelte` should include: ```svelte <table aria-label={m.admin_training_history_heading()}> ``` or a `<caption>` element styled visually-hidden if the heading above already serves as the label. - **Status column uses text alone** (D1). "Done" / "Failed" / "Running" in text is readable but relies on color for differentiation in the typical implementation. Add a redundant icon per WCAG 1.4.1: - ✓ Done → green icon + text - ✗ Failed → red icon + text - ⟳ Running → animated spinner icon + text Color-blind users (8% of men) see no status difference without the icon. - **CER value has no explanation** (D1). The column shows "4.2%" with no tooltip or label explaining what CER means or that lower is better. Admins who aren't ML practitioners won't understand the significance. Add a `<abbr title="Character Error Rate — lower is better">CER</abbr>` element in the column header. - **Missing `blockCount` / lines column label i18n key for accuracy**: The table mockup shows "Lines" but the i18n key in D4 is `admin_training_col_lines`. Confirm `blockCount` from `OcrTrainingRun` maps to this column — it does (the plan uses `blockCount` as the correction line count). Add a tooltip explaining "number of corrected lines used in this training run". ### Recommendations - Rename Type column values to `Basis`/`Personalisiert` (de) and `Base`/`Sender-specific` (en). Update D4 i18n keys. - Rename person column header from "Sender" to "Person". Update i18n keys. - Add an empty-state row: "Noch keine Trainingsläufe." / "No training runs yet." - Add `aria-label={m.admin_training_history_heading()}` to the `<table>` element in `TrainingHistoryTable.svelte`. - Wrap "CER" in `<abbr title="Character Error Rate — lower is better">` in the column header. - Pair status text with a redundant icon (✓ / ✗ / spinner).

marcel commented

2026-04-17 12:33:28 +02:00

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

OCR_MAX_CACHED_MODELS env var not in docker-compose.yml (C2). The Python code reads os.environ.get("OCR_MAX_CACHED_MODELS", "5") but this env var isn't listed in the compose service definition. It should appear alongside KRAKEN_MODEL_PATH in the ocr-service environment block. Default of 5 is fine; making it explicit in compose means it's visible, documented, and overridable without reading Python source:
```
environment:
  OCR_MAX_CACHED_MODELS: "5"
```
Memory math is safe but tight (Design Decisions). Existing allocation: Surya ~5 GB + base Kraken ~500 MB + Torch overhead ~1 GB. Adding 5 sender models at ~300 MB each = +1.5 GB. Total peak ≈ 8 GB against a 12 GB mem_limit. The margin is ~4 GB, which is healthy. No action required — just confirming the calculation is sound.
Stale sender_models.model_path after volume loss (A1). If the ocr_models volume is recreated (e.g., docker compose down -v), the sender model files are gone but the DB records remain with the old paths. The get_model() function in C2 already handles this gracefully: os.path.exists(sender_model_path) returns False, and it falls back to the base model silently. This is the correct behavior — document it in the service comment so future operators know fallback-to-base is by design, not a bug.
No compose restart needed for the new env var (C2). The _sender_registry is instantiated at module load time from os.environ.get(...). Changing OCR_MAX_CACHED_MODELS requires an ocr-service container restart to take effect. This is expected behavior — document it in the env var comment.
ocr_models volume permissions (A1). Sender model files will be written by the training process to /app/models/. The existing base model file is already in this volume. The Python process runs as the container's default user. No new volume configuration is needed — the existing mount handles it. Confirmed.
Model file cleanup after person deletion (A1). The sender_models DB row is deleted via ON DELETE CASCADE (per Markus's recommendation). The .mlmodel file at /app/models/sender_{uuid}.mlmodel is not deleted by the DB cascade — it becomes an orphan file on disk. For a family archive running on a CX32 with 40 GB disk, orphaned model files (~300 MB each) are low risk. Document this as a known limitation: periodic manual cleanup via docker exec ocr-service find /app/models -name "sender_*.mlmodel" ... if disk usage grows.
Training concurrency in compose: The single-node constraint (existing partial unique index + assertNoRunningTraining) already covers the new sender training path via the same guard. No compose configuration change needed.

Recommendations

Add OCR_MAX_CACHED_MODELS: "5" to the ocr-service environment block in docker-compose.yml.
Add a comment to SenderModelService.maybeGetModelPath() noting that a missing model file causes transparent fallback to the base model — this is observable in the OCR service log ("Kraken model path not found, using base model").
Document orphaned model file cleanup as a known limitation — no code change needed, just a note in the deployment docs or the issue itself.

## ⚙️ Tobias Wendt — DevOps & Platform Engineer ### Observations - **`OCR_MAX_CACHED_MODELS` env var not in `docker-compose.yml`** (C2). The Python code reads `os.environ.get("OCR_MAX_CACHED_MODELS", "5")` but this env var isn't listed in the compose service definition. It should appear alongside `KRAKEN_MODEL_PATH` in the `ocr-service` environment block. Default of 5 is fine; making it explicit in compose means it's visible, documented, and overridable without reading Python source: ```yaml environment: OCR_MAX_CACHED_MODELS: "5" ``` - **Memory math is safe but tight** (Design Decisions). Existing allocation: Surya ~5 GB + base Kraken ~500 MB + Torch overhead ~1 GB. Adding 5 sender models at ~300 MB each = +1.5 GB. Total peak ≈ 8 GB against a 12 GB `mem_limit`. The margin is ~4 GB, which is healthy. No action required — just confirming the calculation is sound. - **Stale `sender_models.model_path` after volume loss** (A1). If the `ocr_models` volume is recreated (e.g., `docker compose down -v`), the sender model files are gone but the DB records remain with the old paths. The `get_model()` function in C2 already handles this gracefully: `os.path.exists(sender_model_path)` returns `False`, and it falls back to the base model silently. This is the correct behavior — **document it** in the service comment so future operators know fallback-to-base is by design, not a bug. - **No compose restart needed for the new env var** (C2). The `_sender_registry` is instantiated at module load time from `os.environ.get(...)`. Changing `OCR_MAX_CACHED_MODELS` requires an `ocr-service` container restart to take effect. This is expected behavior — document it in the env var comment. - **`ocr_models` volume permissions** (A1). Sender model files will be written by the training process to `/app/models/`. The existing base model file is already in this volume. The Python process runs as the container's default user. No new volume configuration is needed — the existing mount handles it. Confirmed. - **Model file cleanup after person deletion** (A1). The `sender_models` DB row is deleted via `ON DELETE CASCADE` (per Markus's recommendation). The `.mlmodel` file at `/app/models/sender_{uuid}.mlmodel` is **not** deleted by the DB cascade — it becomes an orphan file on disk. For a family archive running on a CX32 with 40 GB disk, orphaned model files (~300 MB each) are low risk. Document this as a known limitation: periodic manual cleanup via `docker exec ocr-service find /app/models -name "sender_*.mlmodel" ...` if disk usage grows. - **Training concurrency in compose**: The single-node constraint (existing partial unique index + `assertNoRunningTraining`) already covers the new sender training path via the same guard. No compose configuration change needed. ### Recommendations - Add `OCR_MAX_CACHED_MODELS: "5"` to the `ocr-service` environment block in `docker-compose.yml`. - Add a comment to `SenderModelService.maybeGetModelPath()` noting that a missing model file causes transparent fallback to the base model — this is observable in the OCR service log (`"Kraken model path not found, using base model"`). - Document orphaned model file cleanup as a known limitation — no code change needed, just a note in the deployment docs or the issue itself.

marcel commented

2026-04-17 12:33:48 +02:00

🗳️ Decision Queue — Action Required

1 decision needs your input before implementation starts.

Architecture

Should sender model training respect the single-run constraint? — The existing DB partial unique index (idx_ocr_training_runs_one_running) allows only one training run at a time. This constraint was designed for base model training, where the Python service reloads _model in-place after each run. Sender model training writes to a different file and never touches _model, so there's no model-state collision between a sender run and a base run at the Python level. Option A (keep constraint): sender training is silently skipped if base training is running — simple, zero new risk, but a 10–20 minute base training run delays a sender model activation. Option B (relax constraint for sender runs): allow one base run and one sender run concurrently — faster, but requires verifying the Python service can safely run two ketos train subprocesses simultaneously within the 12 GB mem_limit (~2 × 3–4 GB RAM for training). (Raised by: Markus)

## 🗳️ Decision Queue — Action Required _1 decision needs your input before implementation starts._ ### Architecture - **Should sender model training respect the single-run constraint?** — The existing DB partial unique index (`idx_ocr_training_runs_one_running`) allows only one training run at a time. This constraint was designed for base model training, where the Python service reloads `_model` in-place after each run. Sender model training writes to a different file and never touches `_model`, so there's no model-state collision between a sender run and a base run at the Python level. **Option A (keep constraint):** sender training is silently skipped if base training is running — simple, zero new risk, but a 10–20 minute base training run delays a sender model activation. **Option B (relax constraint for sender runs):** allow one base run and one sender run concurrently — faster, but requires verifying the Python service can safely run two `ketos train` subprocesses simultaneously within the 12 GB `mem_limit` (~2 × 3–4 GB RAM for training). _(Raised by: Markus)_

marcel commented

2026-04-17 12:47:39 +02:00

🏛️ Markus Keller — Architect Discussion Follow-up

Four open items worked through interactively. All resolved.

✅ Resolved

Single-run constraint → Queue with coalescing — Replace the "silently skip" guard with a QUEUED status on ocr_training_runs. When a training trigger arrives while a run is RUNNING, insert a QUEUED row instead of discarding. After any run completes (DONE or FAILED), the completion code picks the oldest QUEUED row and promotes it to RUNNING. Coalescing: at most one QUEUED row per person_id — duplicate triggers for the same sender are dropped since the queued run will re-count all current MANUAL blocks when it executes. Both base and sender runs share the single slot (RAM safety, same rationale as the existing constraint).
personNames in TrainingInfoResponse → DTO assembled in the controller — OcrTrainingService.getTrainingInfo() returns plain OcrTrainingRun entities. The controller calls PersonService.getById() for non-null personIds and assembles a TrainingRunDTO that includes personDisplayName. The OCR training service stays clean; cross-domain display concerns belong in the controller layer.
TransactionTemplate in @Async → Separate @Transactional bean — Extract the transactional operations (create run record, update model + run on success, mark failed) into package-private @Transactional methods on a dedicated bean (likely OcrTrainingService, which already owns the run lifecycle). The @Async method calls those directly. No TransactionTemplate needed; pattern consistent with the rest of the codebase.
ADR-001 update — written — Added an "Amendment" section documenting the single-slot training queue, the coalescing rule, the RAM justification for keeping sender training in the same slot, and a future direction note for relaxing the constraint once peak RAM under concurrent training is measured.

No unresolved items.

## 🏛️ Markus Keller — Architect Discussion Follow-up Four open items worked through interactively. All resolved. ### ✅ Resolved - **Single-run constraint → Queue with coalescing** — Replace the "silently skip" guard with a `QUEUED` status on `ocr_training_runs`. When a training trigger arrives while a run is RUNNING, insert a QUEUED row instead of discarding. After any run completes (DONE or FAILED), the completion code picks the oldest QUEUED row and promotes it to RUNNING. Coalescing: at most one QUEUED row per `person_id` — duplicate triggers for the same sender are dropped since the queued run will re-count all current MANUAL blocks when it executes. Both base and sender runs share the single slot (RAM safety, same rationale as the existing constraint). - **`personNames` in `TrainingInfoResponse` → DTO assembled in the controller** — `OcrTrainingService.getTrainingInfo()` returns plain `OcrTrainingRun` entities. The controller calls `PersonService.getById()` for non-null `personId`s and assembles a `TrainingRunDTO` that includes `personDisplayName`. The OCR training service stays clean; cross-domain display concerns belong in the controller layer. - **`TransactionTemplate` in `@Async` → Separate `@Transactional` bean** — Extract the transactional operations (create run record, update model + run on success, mark failed) into package-private `@Transactional` methods on a dedicated bean (likely `OcrTrainingService`, which already owns the run lifecycle). The `@Async` method calls those directly. No `TransactionTemplate` needed; pattern consistent with the rest of the codebase. - **ADR-001 update — written** — Added an "Amendment" section documenting the single-slot training queue, the coalescing rule, the RAM justification for keeping sender training in the same slot, and a future direction note for relaxing the constraint once peak RAM under concurrent training is measured. ### No unresolved items.

marcel referenced a pull request that will close this issue

2026-04-17 18:27:28 +02:00

feat(ocr): per-sender specialized Kurrent models with automatic active-learning retraining #263

marcel commented

2026-04-17 18:27:49 +02:00

Implementation complete ✅

All work implemented on feat/issue-253-sender-models → PR #263.

What was built

Backend (Spring Boot)

V23__sender_models.sql — migration adding sender_models table
SenderModel entity + SenderModelRepository (with findBySenderId, findBySenderIdAndActive)
SenderModelService — activation threshold (100 blocks), retrain delta (50 new blocks), @Async training dispatch, triggered from TranscriptionService.updateBlock() for Kurrent senders
OcrTrainingService.runOrQueueSenderTraining() — coalesces duplicate queue entries with new QUEUED status
OcrTrainingRun + OcrTrainingRunRepository — personId field, QUEUED status enum value
OcrController.getTrainingInfo() — now returns personNames: Map<String, String> alongside runs
OcrClient.streamBlocks() — extended with senderModelPath parameter (nullable)
OcrAsyncRunner — passes active sender model path from SenderModelService to OCR client

OCR Service (Python)

_SenderModelRegistry — LRU cache with double-checked locking, path whitelist to /app/models/
extract_page_blocks, extract_region_text, extract_blocks — all accept sender_model_path
/train-sender endpoint — validates output path, runs ketos fine-tune, invalidates cache
OCR_MAX_CACHED_MODELS env var consumed from docker-compose

Frontend (SvelteKit)

API types regenerated (OcrTrainingRun with personId, QUEUED status)
i18n keys added in de/en/es: training_col_type, training_type_base, training_type_personalized, training_col_person, training_status_queued
TrainingHistory.svelte — Type and Person columns, QUEUED badge
OcrTrainingCard.svelte — passes personNames through to TrainingHistory

Commits

feat(backend): add SenderModel entity and repository for per-sender OCR models
feat(backend): SenderModelService checks threshold and triggers async training
feat(backend): add QUEUED status and runOrQueueSenderTraining to OcrTrainingService
feat(backend): pass senderModelPath through OcrClient.streamBlocks and OcrAsyncRunner
feat(backend): OcrController returns personNames in training info response
feat(ocr): add _SenderModelRegistry LRU cache with path whitelist
feat(ocr): add /train-sender endpoint and thread-safe cache invalidation
feat(frontend): regenerate API types with QUEUED status and personId
feat(frontend): add i18n keys for type/person/queued training columns
feat(frontend): extend TrainingHistory with type and person columns
feat(frontend): wire personNames to TrainingHistory in OcrTrainingCard

Test results

Backend: 1028 tests, 0 failures
Frontend: 953 tests, 0 failures

## Implementation complete ✅ All work implemented on `feat/issue-253-sender-models` → PR #263. ### What was built **Backend (Spring Boot)** - `V23__sender_models.sql` — migration adding `sender_models` table - `SenderModel` entity + `SenderModelRepository` (with `findBySenderId`, `findBySenderIdAndActive`) - `SenderModelService` — activation threshold (100 blocks), retrain delta (50 new blocks), `@Async` training dispatch, triggered from `TranscriptionService.updateBlock()` for Kurrent senders - `OcrTrainingService.runOrQueueSenderTraining()` — coalesces duplicate queue entries with new `QUEUED` status - `OcrTrainingRun` + `OcrTrainingRunRepository` — `personId` field, `QUEUED` status enum value - `OcrController.getTrainingInfo()` — now returns `personNames: Map<String, String>` alongside runs - `OcrClient.streamBlocks()` — extended with `senderModelPath` parameter (nullable) - `OcrAsyncRunner` — passes active sender model path from `SenderModelService` to OCR client **OCR Service (Python)** - `_SenderModelRegistry` — LRU cache with double-checked locking, path whitelist to `/app/models/` - `extract_page_blocks`, `extract_region_text`, `extract_blocks` — all accept `sender_model_path` - `/train-sender` endpoint — validates output path, runs ketos fine-tune, invalidates cache - `OCR_MAX_CACHED_MODELS` env var consumed from docker-compose **Frontend (SvelteKit)** - API types regenerated (`OcrTrainingRun` with `personId`, `QUEUED` status) - i18n keys added in de/en/es: `training_col_type`, `training_type_base`, `training_type_personalized`, `training_col_person`, `training_status_queued` - `TrainingHistory.svelte` — Type and Person columns, QUEUED badge - `OcrTrainingCard.svelte` — passes `personNames` through to `TrainingHistory` ### Commits - `feat(backend): add SenderModel entity and repository for per-sender OCR models` - `feat(backend): SenderModelService checks threshold and triggers async training` - `feat(backend): add QUEUED status and runOrQueueSenderTraining to OcrTrainingService` - `feat(backend): pass senderModelPath through OcrClient.streamBlocks and OcrAsyncRunner` - `feat(backend): OcrController returns personNames in training info response` - `feat(ocr): add _SenderModelRegistry LRU cache with path whitelist` - `feat(ocr): add /train-sender endpoint and thread-safe cache invalidation` - `feat(frontend): regenerate API types with QUEUED status and personId` - `feat(frontend): add i18n keys for type/person/queued training columns` - `feat(frontend): extend TrainingHistory with type and person columns` - `feat(frontend): wire personNames to TrainingHistory in OcrTrainingCard` ### Test results - Backend: **1028 tests, 0 failures** - Frontend: **953 tests, 0 failures**

marcel closed this issue

2026-04-18 12:30:58 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#253

feat(ocr): per-sender specialized Kurrent models with automatic active-learning retraining #253

Summary

Design Decisions

Scope

Part A — Database

A1. V40__add_sender_models.sql

A2. V41__add_person_to_training_runs.sql

Part B — Java: New & Modified Files

B1. New entity: model/SenderModel.java

B2. Extend model/OcrTrainingRun.java

B3. New repository: repository/SenderModelRepository.java

B4. New query in repository/TranscriptionBlockRepository.java

B5. Extend service/OcrClient.java

B6. Extend service/RestClientOcrClient.java

B7. New service: service/SenderModelService.java

B8. Extend service/TrainingDataExportService.java

B9. Hook in service/TranscriptionService.java

B10. Update service/OcrAsyncRunner.java

B11. Extend service/OcrTrainingService.getTrainingInfo()

B12. Config properties in application.properties

Part C — Python OCR Service

C1. ocr-service/models.py

C2. ocr-service/engines/kraken.py

C3. ocr-service/main.py

C4. ocr-service/requirements.txt

Part D — Frontend

D1. Extend /admin/system page (+page.svelte)

D2. Extend backend API response

D3. New component: src/lib/components/TrainingHistoryTable.svelte

D4. i18n keys

Part E — Tests

E1. Python: ocr-service/test_sender_registry.py

E2. Java unit: new SenderModelServiceTest.java

E3. Java unit: RestClientOcrClientTest.java

E4. Java unit: extend TranscriptionServiceTest.java

Verification Steps

👨‍💻 Felix Brandt — Senior Fullstack Developer

Observations

Recommendations

🏛️ Markus Keller — Application Architect

Observations

Recommendations

🔒 Nora "NullX" Steiner — Application Security Engineer

Observations

Recommendations

🧪 Sara Holt — QA Engineer

Observations

Recommendations

🎨 Leonie Voss — UX Designer & Accessibility Strategist

Observations

Recommendations

⚙️ Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

🗳️ Decision Queue — Action Required

Architecture

🏛️ Markus Keller — Architect Discussion Follow-up

✅ Resolved

No unresolved items.

Implementation complete ✅

What was built

Commits

Test results

A1. `V40__add_sender_models.sql`

A2. `V41__add_person_to_training_runs.sql`

B1. New entity: `model/SenderModel.java`

B2. Extend `model/OcrTrainingRun.java`

B3. New repository: `repository/SenderModelRepository.java`

B4. New query in `repository/TranscriptionBlockRepository.java`

B5. Extend `service/OcrClient.java`

B6. Extend `service/RestClientOcrClient.java`

B7. New service: `service/SenderModelService.java`

B8. Extend `service/TrainingDataExportService.java`

B9. Hook in `service/TranscriptionService.java`

B10. Update `service/OcrAsyncRunner.java`

B11. Extend `service/OcrTrainingService.getTrainingInfo()`

B12. Config properties in `application.properties`

C1. `ocr-service/models.py`

C2. `ocr-service/engines/kraken.py`

C3. `ocr-service/main.py`

C4. `ocr-service/requirements.txt`

D1. Extend `/admin/system` page (`+page.svelte`)

D3. New component: `src/lib/components/TrainingHistoryTable.svelte`

E1. Python: `ocr-service/test_sender_registry.py`

E2. Java unit: new `SenderModelServiceTest.java`

E3. Java unit: `RestClientOcrClientTest.java`

E4. Java unit: extend `TranscriptionServiceTest.java`