Compare commits

..

7 Commits

Author SHA1 Message Date
Marcel
13955a5459 test(search): add sender name FTS coverage and combined filter test
Some checks failed
CI / Unit & Component Tests (push) Failing after 3s
CI / Backend Unit Tests (push) Failing after 2s
CI / Unit & Component Tests (pull_request) Failing after 1s
CI / Backend Unit Tests (pull_request) Failing after 0s
- should_find_document_by_sender_name — symmetric with existing receiver test
- fts_combined_with_status_filter_excludes_non_matching_status — verifies
  hasIds(rankedIds).and(hasStatus(...)) two-phase search works together

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 11:03:37 +02:00
Marcel
5affe21b79 refactor(search): replace O(n²) indexOf with HashMap for rank ordering
ids.indexOf() scans the full list for each document, giving O(n²) total.
Build a Map<UUID, Integer> once at O(n) and use getOrDefault at O(1) per
document. Behavior is identical; existing tests remain green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:59:05 +02:00
Marcel
e621bdd890 fix(search): respect DATE sort when text is present — do not override with relevance
When a user explicitly selects DATE sort with a text query active, the
previous code treated it identically to RELEVANCE, silently discarding
the user's sort choice. Remove DATE from the useRankOrder condition so
that explicit DATE sort always goes through the standard JPA sort path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:57:24 +02:00
Marcel
3421f3203c feat(fts): backfill search_vector for all existing documents (V35)
Some checks failed
CI / Unit & Component Tests (push) Failing after 2s
CI / Backend Unit Tests (push) Failing after 1s
CI / Unit & Component Tests (pull_request) Failing after 2s
CI / Backend Unit Tests (pull_request) Failing after 3s
Fires the BEFORE UPDATE trigger for every documents row, which recomputes
the tsvector from all currently-linked metadata, blocks, receivers, and tags.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 23:47:45 +02:00
Marcel
349f74d39a feat(fts): replace ILIKE hasText with FTS two-phase search and RELEVANCE sort
- DocumentSort: add RELEVANCE enum value
- DocumentSpecifications: remove hasText() ILIKE, add hasIds(List<UUID>)
  for FTS-pre-filtered ID sets
- DocumentService.searchDocuments(): FTS two-phase path — findRankedIdsByFts()
  returns ranked UUIDs, hasIds() narrows subsequent Specification query,
  in-memory re-sort preserves rank order; RELEVANCE is the default when
  text is present and no explicit non-relevance sort is requested
- DocumentSpecificationsTest: remove hasText() tests (Specification removed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 23:46:24 +02:00
Marcel
efeb913d4c feat(fts): add search_vector column, GIN index, DB triggers, and FTS repository method (V34)
- V34 migration: adds search_vector tsvector column with GIN index
- BEFORE INSERT/UPDATE trigger on documents rebuilds vector from title (A),
  summary + transcription_blocks.text (B), sender/receiver names (C),
  tag names + location (D) using german FTS config
- AFTER triggers on transcription_blocks, document_receivers, document_tags
  touch the parent document row to re-fire the BEFORE UPDATE trigger
- DocumentRepository.findRankedIdsByFts() native query using websearch_to_tsquery
- DocumentFtsTest: 12 integration tests covering stemming, trigger sync,
  ranking, stop words, malformed input, receiver and tag search

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 23:38:12 +02:00
Marcel
fc27043d40 chore: add Claude personas, skills, memory, and project docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:22:39 +02:00
27 changed files with 129 additions and 824 deletions

View File

@@ -21,10 +21,9 @@ PORT_FRONTEND=5173
PORT_MAILPIT_UI=8100
PORT_MAILPIT_SMTP=1025
# OCR Training — secret token required to call /train and /segtrain on the OCR service.
# Also set in the backend so it can pass the token through. Must not be empty in production.
# Generate with: python3 -c "import secrets; print(secrets.token_hex(32))"
OCR_TRAINING_TOKEN=change-me-in-production
# OCR Training — set a secret token to protect the /train and /segtrain endpoints on the
# Python OCR microservice. Leave empty to disable token authentication (development only).
# OCR_TRAINING_TOKEN=change-me-in-production
# Production SMTP — uncomment and fill in to send real emails instead of catching them
# APP_BASE_URL=https://your-domain.example.com

View File

@@ -1,4 +0,0 @@
target/
.git/
*.md
api_tests/

View File

@@ -1,18 +1,9 @@
FROM eclipse-temurin:21.0.10_7-jdk-noble AS builder
FROM eclipse-temurin:21-jdk
WORKDIR /app
# Copy wrapper and POM first — dependency layer is cached separately from source
COPY .mvn .mvn
COPY mvnw pom.xml ./
RUN --mount=type=cache,target=/root/.m2 ./mvnw dependency:go-offline -q
COPY src ./src
# -Dmaven.test.skip=true skips test compilation entirely (not just execution)
RUN --mount=type=cache,target=/root/.m2 ./mvnw clean package -Dmaven.test.skip=true -q
FROM eclipse-temurin:21.0.10_7-jre-noble
WORKDIR /app
# Spring Boot repackages to *.jar; pre-repackage artifact uses .jar.original, not .jar
COPY --from=builder /app/target/*.jar app.jar
EXPOSE 8080
CMD ["java", "-jar", "app.jar"]
# Source code and mvnw are mounted via docker-compose volume at runtime.
# Maven dependencies are cached in a named volume (~/.m2).
CMD ["./mvnw", "spring-boot:run"]

View File

@@ -12,5 +12,5 @@ public interface OcrTrainingRunRepository extends JpaRepository<OcrTrainingRun,
Optional<OcrTrainingRun> findFirstByStatus(TrainingStatus status);
List<OcrTrainingRun> findTop10ByOrderByCreatedAtDesc();
List<OcrTrainingRun> findTop5ByOrderByCreatedAtDesc();
}

View File

@@ -45,13 +45,6 @@ public class OcrTrainingService {
List<OcrTrainingRun> runs
) {}
private void assertNoRunningTraining() {
if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING,
"A training run is already in progress");
}
}
// Not safe for horizontal scaling: training reloads the Kraken model in-process on the
// Python OCR service after each run. The DB-level RUNNING constraint (V30 partial unique
// index) prevents concurrent training API calls, but cannot prevent two OCR service replicas
@@ -60,7 +53,10 @@ public class OcrTrainingService {
// Short transaction: guard check + create RUNNING row, then commit immediately.
// The DB connection is released before the OCR HTTP call, which can take several minutes.
OcrTrainingRun run = Objects.requireNonNull(txTemplate.execute(status -> {
assertNoRunningTraining();
if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING,
"A training run is already in progress");
}
var eligibleBlocks = trainingDataExportService.queryEligibleBlocks();
if (eligibleBlocks.size() < 5) {
@@ -124,7 +120,10 @@ public class OcrTrainingService {
public OcrTrainingRun triggerSegTraining(UUID triggeredBy) {
// Same pattern as triggerTraining: narrow transactions around DB writes only.
OcrTrainingRun run = Objects.requireNonNull(txTemplate.execute(status -> {
assertNoRunningTraining();
if (trainingRunRepository.findFirstByStatus(TrainingStatus.RUNNING).isPresent()) {
throw DomainException.conflict(ErrorCode.TRAINING_ALREADY_RUNNING,
"A training run is already in progress");
}
var segBlocks = segmentationTrainingExportService.querySegmentationBlocks();
if (segBlocks.size() < 5) {
@@ -163,12 +162,11 @@ public class OcrTrainingService {
return Objects.requireNonNull(txTemplate.execute(status -> {
run.setStatus(TrainingStatus.DONE);
run.setCompletedAt(Instant.now());
run.setCer(result.cer());
run.setLoss(result.loss());
run.setAccuracy(result.accuracy());
run.setEpochs(result.epochs());
OcrTrainingRun updated = trainingRunRepository.save(run);
log.info("[trainingRun={}] Segmentation training completed — cer={} epochs={}", runId, result.cer(), result.epochs());
log.info("[trainingRun={}] Segmentation training completed — epochs={}", runId, result.epochs());
return updated;
}));
} catch (Exception e) {
@@ -195,7 +193,7 @@ public class OcrTrainingService {
int totalOcrBlocks = (int) blockRepository.count();
int availableSegBlocks = segmentationTrainingExportService.querySegmentationBlocks().size();
List<OcrTrainingRun> recentRuns = trainingRunRepository.findTop10ByOrderByCreatedAtDesc();
List<OcrTrainingRun> recentRuns = trainingRunRepository.findTop5ByOrderByCreatedAtDesc();
OcrTrainingRun lastRun = recentRuns.isEmpty() ? null : recentRuns.get(0);
return new TrainingInfoResponse(

View File

@@ -53,7 +53,7 @@ class OcrTrainingServiceTest {
service = new OcrTrainingService(runRepository, exportService, segExportService, ocrClient, healthClient, blockRepository, txTemplate);
when(blockRepository.count()).thenReturn(0L);
when(runRepository.findTop10ByOrderByCreatedAtDesc()).thenReturn(List.of());
when(runRepository.findTop5ByOrderByCreatedAtDesc()).thenReturn(List.of());
when(segExportService.querySegmentationBlocks()).thenReturn(List.of());
}
@@ -146,90 +146,6 @@ class OcrTrainingServiceTest {
run.getStatus() == TrainingStatus.FAILED && run.getErrorMessage() != null));
}
// ─── triggerSegTraining ───────────────────────────────────────────────────
@Test
void triggerSegTraining_throws409_whenRunningRunExists() {
when(runRepository.findFirstByStatus(TrainingStatus.RUNNING))
.thenReturn(Optional.of(OcrTrainingRun.builder()
.id(UUID.randomUUID()).status(TrainingStatus.RUNNING)
.blockCount(5).documentCount(2).modelName("blla").build()));
assertThatThrownBy(() -> service.triggerSegTraining(null))
.isInstanceOf(DomainException.class)
.extracting("status")
.satisfies(s -> assertThat(s.toString()).contains("409"));
}
@Test
void triggerSegTraining_throws422_whenFewerThan5Segments() {
when(runRepository.findFirstByStatus(TrainingStatus.RUNNING)).thenReturn(Optional.empty());
when(segExportService.querySegmentationBlocks()).thenReturn(List.of(
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(UUID.randomUUID()).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(UUID.randomUUID()).build()
));
assertThatThrownBy(() -> service.triggerSegTraining(null))
.isInstanceOf(DomainException.class);
}
@Test
void triggerSegTraining_createsRunWithBlla_andMarksDoneWithCer() throws Exception {
when(runRepository.findFirstByStatus(TrainingStatus.RUNNING)).thenReturn(Optional.empty());
UUID docA = UUID.randomUUID();
UUID docB = UUID.randomUUID();
List<TranscriptionBlock> segs = List.of(
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docB).build()
);
when(segExportService.querySegmentationBlocks()).thenReturn(segs);
when(segExportService.exportToZip()).thenReturn(out -> {});
when(ocrClient.segtrainModel(any())).thenReturn(new OcrClient.TrainingResult(null, 0.92, 0.08, 5));
OcrTrainingRun saved = OcrTrainingRun.builder()
.id(UUID.randomUUID()).status(TrainingStatus.RUNNING)
.blockCount(5).documentCount(2).modelName("blla").build();
when(runRepository.save(any())).thenReturn(saved);
service.triggerSegTraining(null);
verify(runRepository, atLeastOnce()).save(argThat(run ->
run.getStatus() == TrainingStatus.DONE
&& "blla".equals(run.getModelName())
&& run.getCer() != null));
}
@Test
void triggerSegTraining_marksRunFailed_whenOcrClientThrows() throws Exception {
when(runRepository.findFirstByStatus(TrainingStatus.RUNNING)).thenReturn(Optional.empty());
UUID docA = UUID.randomUUID();
List<TranscriptionBlock> segs = List.of(
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build(),
TranscriptionBlock.builder().id(UUID.randomUUID()).documentId(docA).build()
);
when(segExportService.querySegmentationBlocks()).thenReturn(segs);
when(segExportService.exportToZip()).thenReturn(out -> {});
when(ocrClient.segtrainModel(any())).thenThrow(new RuntimeException("seg timeout"));
OcrTrainingRun saved = OcrTrainingRun.builder()
.id(UUID.randomUUID()).status(TrainingStatus.RUNNING)
.blockCount(5).documentCount(1).modelName("blla").build();
when(runRepository.save(any())).thenReturn(saved);
service.triggerSegTraining(null);
verify(runRepository, atLeastOnce()).save(argThat(run ->
run.getStatus() == TrainingStatus.FAILED && run.getErrorMessage() != null));
}
// ─── Orphan recovery ──────────────────────────────────────────────────────
@Test

View File

@@ -83,11 +83,11 @@ services:
restart: unless-stopped
expose:
- "8000"
mem_limit: 12g
memswap_limit: 12g
mem_limit: 8g
memswap_limit: 8g
volumes:
- ocr_models:/app/models
- ocr_cache:/root/.cache # Hugging Face / ketos model download cache — prevents re-downloads on container recreate
- ocr_cache:/root/.cache
environment:
KRAKEN_MODEL_PATH: /app/models/german_kurrent.mlmodel
TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"
@@ -102,7 +102,7 @@ services:
interval: 10s
timeout: 5s
retries: 12
start_period: 120s
start_period: 60s
# --- Backend: Spring Boot ---
backend:
@@ -112,7 +112,9 @@ services:
container_name: archive-backend
restart: unless-stopped
volumes:
- ./backend:/app
- ./import:/import
- maven_cache:/root/.m2
depends_on:
db:
condition: service_healthy
@@ -143,7 +145,6 @@ services:
SPRING_MAIL_PROPERTIES_MAIL_SMTP_AUTH: ${MAIL_SMTP_AUTH:-false}
SPRING_MAIL_PROPERTIES_MAIL_SMTP_STARTTLS_ENABLE: ${MAIL_STARTTLS_ENABLE:-false}
APP_OCR_BASE_URL: http://ocr-service:8000
APP_OCR_TRAINING_TOKEN: "${OCR_TRAINING_TOKEN:-}"
ports:
- "${PORT_BACKEND}:8080"
networks:
@@ -153,7 +154,7 @@ services:
interval: 15s
timeout: 5s
retries: 10
start_period: 30s # JAR starts in ~15s; was 60s when compilation happened at startup
start_period: 60s
# --- Frontend: SvelteKit (Dev Server) ---
frontend:
@@ -189,5 +190,6 @@ networks:
volumes:
frontend_node_modules:
maven_cache:
ocr_models:
ocr_cache:

View File

@@ -79,8 +79,6 @@
"docs_list_from": "Von",
"docs_list_to": "An",
"docs_list_unknown": "Unbekannt",
"docs_group_undated": "Undatiert",
"docs_group_unknown": "Unbekannt",
"doc_section_who_when": "Wer & Wann",
"doc_section_description": "Beschreibung",
"doc_section_file": "Datei",
@@ -560,7 +558,6 @@
"training_history_col_cer": "Fehlerrate",
"training_status_done": "Fertig",
"training_status_failed": "Fehler",
"training_error_detail_label": "Fehlerdetails",
"training_status_running": "Läuft…",
"training_seg_heading": "Segmentierung trainieren",
"training_seg_description": "Starte ein neues Training mit annotierten Segmentierungsbereichen, um die Texterkennung zu verbessern.",

View File

@@ -79,8 +79,6 @@
"docs_list_from": "From",
"docs_list_to": "To",
"docs_list_unknown": "Unknown",
"docs_group_undated": "Undated",
"docs_group_unknown": "Unknown",
"doc_section_who_when": "Who & When",
"doc_section_description": "Description",
"doc_section_file": "File",
@@ -560,7 +558,6 @@
"training_history_col_cer": "Error Rate",
"training_status_done": "Done",
"training_status_failed": "Failed",
"training_error_detail_label": "Error details",
"training_status_running": "Running…",
"training_seg_heading": "Train segmentation",
"training_seg_description": "Start a new training run using annotated segmentation regions to improve text detection.",

View File

@@ -79,8 +79,6 @@
"docs_list_from": "De",
"docs_list_to": "Para",
"docs_list_unknown": "Desconocido",
"docs_group_undated": "Sin fecha",
"docs_group_unknown": "Desconocido",
"doc_section_who_when": "Quién & Cuándo",
"doc_section_description": "Descripción",
"doc_section_file": "Archivo",
@@ -560,7 +558,6 @@
"training_history_col_cer": "Tasa de error",
"training_status_done": "Listo",
"training_status_failed": "Error",
"training_error_detail_label": "Detalles del error",
"training_status_running": "Ejecutando…",
"training_seg_heading": "Entrenar segmentación",
"training_seg_description": "Inicia un nuevo entrenamiento con regiones de segmentación anotadas para mejorar la detección de texto.",

View File

@@ -1,15 +0,0 @@
<script lang="ts">
let { label }: { label: string } = $props();
</script>
<div
data-testid="group-divider"
role="separator"
aria-label={label}
class="relative flex items-center py-2 text-center"
>
<div class="flex-grow border-t border-line"></div>
<span class="mx-4 font-sans text-sm font-bold tracking-widest text-ink/60 uppercase">{label}</span
>
<div class="flex-grow border-t border-line"></div>
</div>

View File

@@ -1,23 +0,0 @@
import { describe, expect, it, afterEach } from 'vitest';
import { cleanup, render } from 'vitest-browser-svelte';
import { page } from 'vitest/browser';
import GroupDivider from './GroupDivider.svelte';
afterEach(() => cleanup());
describe('GroupDivider', () => {
it('renders the label text', async () => {
render(GroupDivider, { label: '1938' });
await expect.element(page.getByText('1938')).toBeInTheDocument();
});
it('has data-testid="group-divider" on the root element', async () => {
render(GroupDivider, { label: 'Test' });
await expect.element(page.getByTestId('group-divider')).toBeInTheDocument();
});
it('renders a person name label', async () => {
render(GroupDivider, { label: 'Anna Müller' });
await expect.element(page.getByText('Anna Müller')).toBeInTheDocument();
});
});

View File

@@ -20,12 +20,6 @@ interface Props {
let { runs }: Props = $props();
const COLLAPSED_COUNT = 3;
let expanded = $state(false);
const visibleRuns = $derived(expanded ? runs : runs.slice(0, COLLAPSED_COUNT));
const hasMore = $derived(runs.length > COLLAPSED_COUNT);
const dateFormatter = new Intl.DateTimeFormat('de-DE', {
day: 'numeric',
month: 'short',
@@ -52,7 +46,7 @@ function formatCer(cer: number | undefined | null): string {
<th class="hidden pb-2 text-right md:table-cell">{m.training_history_col_cer()}</th>
</tr>
</thead>
<tbody id="training-history-rows">
<tbody>
{#if runs.length === 0}
<tr>
<td colspan="5" class="py-4 text-center text-sm text-ink-2">
@@ -60,7 +54,7 @@ function formatCer(cer: number | undefined | null): string {
</td>
</tr>
{:else}
{#each visibleRuns as run (run.id)}
{#each runs as run (run.id)}
<tr class="border-b border-line/50 last:border-0">
<td class="py-2 text-ink-2">{formatDate(run.createdAt)}</td>
<td class="py-2">
@@ -85,6 +79,7 @@ function formatCer(cer: number | undefined | null): string {
{:else if run.status === 'FAILED'}
<span
class="inline-flex items-center gap-1 rounded-sm bg-red-100 px-1.5 py-0.5 text-xs font-medium text-red-700"
title={run.errorMessage}
>
<svg
aria-hidden="true"
@@ -100,21 +95,13 @@ function formatCer(cer: number | undefined | null): string {
</svg>
{m.training_status_failed()}
</span>
{#if run.errorMessage}
<details class="mt-0.5">
<summary class="cursor-pointer text-xs text-red-700 underline">
{m.training_error_detail_label()}
</summary>
<p class="mt-1 text-xs text-red-600">{run.errorMessage}</p>
</details>
{/if}
{:else}
<span
class="inline-flex items-center gap-1 rounded-sm bg-yellow-100 px-1.5 py-0.5 text-xs font-medium text-yellow-700"
>
<span
aria-hidden="true"
class="h-1.5 w-1.5 rounded-full bg-yellow-500 motion-safe:animate-pulse"
class="h-1.5 w-1.5 animate-pulse rounded-full bg-yellow-500"
></span>
{m.training_status_running()}
</span>
@@ -130,17 +117,3 @@ function formatCer(cer: number | undefined | null): string {
{/if}
</tbody>
</table>
{#if hasMore}
<div class="mt-2 text-center">
<button
type="button"
aria-expanded={expanded}
aria-controls="training-history-rows"
class="text-xs font-medium text-ink-3 transition-colors hover:text-ink"
onclick={() => (expanded = !expanded)}
>
{expanded ? m.comp_expandable_show_less() : m.comp_expandable_show_more()}
</button>
</div>
{/if}

View File

@@ -1,52 +0,0 @@
import { afterEach, describe, expect, it } from 'vitest';
import { cleanup, render } from 'vitest-browser-svelte';
import { page } from 'vitest/browser';
import TrainingHistory from './TrainingHistory.svelte';
afterEach(cleanup);
function makeRun(i: number) {
return {
id: `run-${i}`,
status: 'DONE' as const,
blockCount: 10,
documentCount: 2,
modelName: 'german_kurrent',
createdAt: `2026-01-0${i + 1}T12:00:00Z`,
completedAt: `2026-01-0${i + 1}T12:05:00Z`
};
}
const fiveRuns = Array.from({ length: 5 }, (_, i) => makeRun(i));
const twoRuns = Array.from({ length: 2 }, (_, i) => makeRun(i));
describe('TrainingHistory — expand/collapse', () => {
it('shows only 3 runs initially when more than 3 exist', async () => {
render(TrainingHistory, { runs: fiveRuns });
const rows = page.getByRole('row');
// 1 header row + 3 data rows = 4 total
await expect.element(rows.nth(3)).toBeInTheDocument();
await expect.element(rows.nth(4)).not.toBeInTheDocument();
await expect.element(page.getByRole('button', { name: /Mehr anzeigen/i })).toBeInTheDocument();
});
it('shows all runs after clicking the expand button', async () => {
render(TrainingHistory, { runs: fiveRuns });
await page.getByRole('button', { name: /Mehr anzeigen/i }).click();
const rows = page.getByRole('row');
// 1 header row + 5 data rows = 6 total
await expect.element(rows.nth(5)).toBeInTheDocument();
});
it('hides the toggle button when 3 or fewer runs exist', async () => {
render(TrainingHistory, { runs: twoRuns });
await expect
.element(page.getByRole('button', { name: /Mehr anzeigen/i }))
.not.toBeInTheDocument();
});
});

View File

@@ -1,165 +0,0 @@
import { describe, expect, it } from 'vitest';
import { groupDocuments } from './groupDocuments';
type Doc = {
id: string;
documentDate?: string | null;
sender?: { displayName: string } | null;
receivers?: { displayName: string }[];
};
const doc = (overrides: Partial<Doc> & { id: string }): Doc => ({
documentDate: null,
sender: null,
receivers: [],
...overrides
});
// ─── DATE sort ───────────────────────────────────────────────────────────────
describe('groupDocuments — DATE sort', () => {
it('produces one group per distinct year', () => {
const docs = [
doc({ id: 'a', documentDate: '1923-04-12' }),
doc({ id: 'b', documentDate: '1938-01-01' }),
doc({ id: 'c', documentDate: '1965-08-03' })
];
const groups = groupDocuments(docs, 'DATE', 'Undatiert');
expect(groups.map((g) => g.label)).toEqual(['1923', '1938', '1965']);
expect(groups.every((g) => g.documents.length === 1)).toBe(true);
});
it('puts multiple docs from the same year into one group', () => {
const docs = [
doc({ id: 'a', documentDate: '1938-03-01' }),
doc({ id: 'b', documentDate: '1938-11-15' })
];
const groups = groupDocuments(docs, 'DATE', 'Undatiert');
expect(groups).toHaveLength(1);
expect(groups[0].label).toBe('1938');
expect(groups[0].documents).toHaveLength(2);
});
it('places undated docs in the fallback group at the bottom', () => {
const docs = [
doc({ id: 'a', documentDate: '1938-01-01' }),
doc({ id: 'b', documentDate: null }),
doc({ id: 'c', documentDate: null })
];
const groups = groupDocuments(docs, 'DATE', 'Undatiert');
expect(groups).toHaveLength(2);
expect(groups[0].label).toBe('1938');
expect(groups[1].label).toBe('Undatiert');
expect(groups[1].documents.map((d) => d.id)).toEqual(['b', 'c']);
});
it('returns one group with fallback label when all docs are undated', () => {
const docs = [doc({ id: 'a' }), doc({ id: 'b' })];
const groups = groupDocuments(docs, 'DATE', 'Undatiert');
expect(groups).toHaveLength(1);
expect(groups[0].label).toBe('Undatiert');
});
it('returns one group when all docs are from the same year', () => {
const docs = [
doc({ id: 'a', documentDate: '1938-01-01' }),
doc({ id: 'b', documentDate: '1938-06-15' })
];
const groups = groupDocuments(docs, 'DATE', 'Undatiert');
expect(groups).toHaveLength(1);
});
});
// ─── SENDER sort ─────────────────────────────────────────────────────────────
describe('groupDocuments — SENDER sort', () => {
it('produces one group per distinct sender', () => {
const docs = [
doc({ id: 'a', sender: { displayName: 'Anna Müller' } }),
doc({ id: 'b', sender: { displayName: 'Karl Bauer' } }),
doc({ id: 'c', sender: { displayName: 'Anna Müller' } })
];
const groups = groupDocuments(docs, 'SENDER', 'Unbekannt');
expect(groups.map((g) => g.label)).toEqual(['Anna Müller', 'Karl Bauer']);
expect(groups[0].documents).toHaveLength(2);
expect(groups[1].documents).toHaveLength(1);
});
it('places docs with no sender in the fallback group at the bottom', () => {
const docs = [
doc({ id: 'a', sender: { displayName: 'Anna Müller' } }),
doc({ id: 'b', sender: null })
];
const groups = groupDocuments(docs, 'SENDER', 'Unbekannt');
expect(groups).toHaveLength(2);
expect(groups[0].label).toBe('Anna Müller');
expect(groups[1].label).toBe('Unbekannt');
expect(groups[1].documents[0].id).toBe('b');
});
});
// ─── RECEIVER sort ───────────────────────────────────────────────────────────
describe('groupDocuments — RECEIVER sort', () => {
it('a doc with two receivers appears in both receiver groups', () => {
const docs = [
doc({
id: 'a',
receivers: [{ displayName: 'Anna' }, { displayName: 'Karl' }]
})
];
const groups = groupDocuments(docs, 'RECEIVER', 'Unbekannt');
expect(groups.map((g) => g.label)).toEqual(['Anna', 'Karl']);
expect(groups[0].documents[0].id).toBe('a');
expect(groups[1].documents[0].id).toBe('a');
});
it('places docs with no receivers in the fallback group at the bottom', () => {
const docs = [
doc({ id: 'a', receivers: [{ displayName: 'Anna' }] }),
doc({ id: 'b', receivers: [] })
];
const groups = groupDocuments(docs, 'RECEIVER', 'Unbekannt');
expect(groups).toHaveLength(2);
expect(groups[0].label).toBe('Anna');
expect(groups[1].label).toBe('Unbekannt');
expect(groups[1].documents[0].id).toBe('b');
});
it('composite keys are unique: groupLabel + doc.id identifies each item', () => {
const docs = [
doc({ id: 'a', receivers: [{ displayName: 'Anna' }, { displayName: 'Karl' }] }),
doc({ id: 'b', receivers: [{ displayName: 'Anna' }] })
];
const groups = groupDocuments(docs, 'RECEIVER', 'Unbekannt');
const keys = groups.flatMap((g) => g.documents.map((d) => `${g.label}-${d.id}`));
const uniqueKeys = new Set(keys);
expect(uniqueKeys.size).toBe(keys.length);
});
});
// ─── Non-groupable sorts ──────────────────────────────────────────────────────
describe('groupDocuments — non-groupable sorts', () => {
it('TITLE sort returns one group containing all documents', () => {
const docs = [doc({ id: 'a' }), doc({ id: 'b' })];
const groups = groupDocuments(docs, 'TITLE', 'Undatiert');
expect(groups).toHaveLength(1);
expect(groups[0].documents).toHaveLength(2);
});
it('UPLOAD_DATE sort returns one group containing all documents', () => {
const docs = [doc({ id: 'a' }), doc({ id: 'b' })];
const groups = groupDocuments(docs, 'UPLOAD_DATE', 'Undatiert');
expect(groups).toHaveLength(1);
expect(groups[0].documents).toHaveLength(2);
});
});
// ─── Edge cases ──────────────────────────────────────────────────────────────
describe('groupDocuments — edge cases', () => {
it('returns empty array for an empty document list', () => {
expect(groupDocuments([], 'DATE', 'Undatiert')).toEqual([]);
});
});

View File

@@ -1,56 +0,0 @@
export type GroupableDoc = {
id: string;
documentDate?: string | null;
sender?: { displayName: string } | null;
receivers?: { displayName: string }[];
};
export type DocumentGroup<T extends GroupableDoc> = {
label: string;
documents: T[];
};
const GROUPABLE_SORTS = ['DATE', 'SENDER', 'RECEIVER'] as const;
type GroupableSort = (typeof GROUPABLE_SORTS)[number];
export function groupDocuments<T extends GroupableDoc>(
docs: T[],
sort: string,
fallbackLabel: string
): DocumentGroup<T>[] {
if (docs.length === 0) return [];
if (!GROUPABLE_SORTS.includes(sort as GroupableSort)) {
return [{ label: '', documents: [...docs] }];
}
const groupMap = new Map<string, T[]>();
const fallbackDocs: T[] = [];
for (const doc of docs) {
const keys = extractGroupKeys(doc, sort as GroupableSort);
if (keys.length === 0) {
fallbackDocs.push(doc);
} else {
for (const key of keys) {
if (!groupMap.has(key)) groupMap.set(key, []);
groupMap.get(key)!.push(doc);
}
}
}
const groups = [...groupMap.entries()].map(([label, documents]) => ({ label, documents }));
if (fallbackDocs.length > 0) groups.push({ label: fallbackLabel, documents: fallbackDocs });
return groups;
}
function extractGroupKeys<T extends GroupableDoc>(doc: T, sort: GroupableSort): string[] {
if (sort === 'DATE') {
const year = doc.documentDate
? String(new Date(doc.documentDate + 'T12:00:00').getFullYear())
: null;
return year ? [year] : [];
}
if (sort === 'SENDER') return doc.sender ? [doc.sender.displayName] : [];
if (sort === 'RECEIVER') return (doc.receivers ?? []).map((r) => r.displayName);
return [];
}

View File

@@ -13,18 +13,8 @@ export async function load({ url, fetch }) {
const senderId = url.searchParams.get('senderId') || '';
const receiverId = url.searchParams.get('receiverId') || '';
const tags = url.searchParams.getAll('tag');
const VALID_SORTS = ['DATE', 'TITLE', 'SENDER', 'RECEIVER', 'UPLOAD_DATE'] as const;
type ValidSort = (typeof VALID_SORTS)[number];
const rawSort = url.searchParams.get('sort') ?? 'DATE';
const sort: ValidSort = (VALID_SORTS as readonly string[]).includes(rawSort)
? (rawSort as ValidSort)
: 'DATE';
const VALID_DIRS = ['asc', 'desc'] as const;
type ValidDir = (typeof VALID_DIRS)[number];
const rawDir = url.searchParams.get('dir') ?? 'desc';
const dir: ValidDir = (VALID_DIRS as readonly string[]).includes(rawDir)
? (rawDir as ValidDir)
: 'desc';
const sort = url.searchParams.get('sort') || 'DATE';
const dir = url.searchParams.get('dir') || 'desc';
const tagQ = url.searchParams.get('tagQ') || '';
const isDashboard = !q && !from && !to && !senderId && !receiverId && !tags.length && !tagQ;
@@ -45,7 +35,7 @@ export async function load({ url, fetch }) {
receiverId: receiverId || undefined,
tag: tags.length ? tags : undefined,
tagQ: tagQ || undefined,
sort,
sort: sort as 'DATE' | 'TITLE' | 'SENDER' | 'RECEIVER' | 'UPLOAD_DATE',
dir: dir || undefined
}
}

View File

@@ -139,7 +139,6 @@ const showRightColumn = $derived(data.canWrite || (data.incompleteDocs?.length ?
error={data.error}
total={data.total ?? 0}
q={q}
sort={sort}
/>
{/if}
</main>

View File

@@ -30,9 +30,7 @@ let {
sort?: string;
} = $props();
const fallbackLabel = $derived(
(sort ?? 'DATE') === 'DATE' ? m.docs_group_undated() : m.docs_group_unknown()
);
const fallbackLabel = $derived(sort === 'DATE' ? m.docs_group_undated() : m.docs_group_unknown());
const groupedDocuments = $derived.by(() =>
groupDocuments(documents, sort ?? 'DATE', fallbackLabel)
);

View File

@@ -1,12 +1,10 @@
import { describe, expect, it, vi, afterEach } from 'vitest';
import { cleanup, render } from 'vitest-browser-svelte';
import { describe, expect, it, vi } from 'vitest';
import { render } from 'vitest-browser-svelte';
import { page } from 'vitest/browser';
import DocumentList from './DocumentList.svelte';
vi.mock('$app/navigation', () => ({ goto: vi.fn() }));
afterEach(() => cleanup());
const baseProps = {
documents: [],
canWrite: false,
@@ -15,14 +13,7 @@ const baseProps = {
q: ''
};
type DocOverrides = {
id?: string;
documentDate?: string | null;
sender?: { firstName?: string | null; lastName: string; displayName: string } | null;
receivers?: { firstName?: string | null; lastName: string; displayName: string }[];
};
const makeDoc = (overrides: DocOverrides = {}) => ({
const makeDoc = () => ({
id: '1',
title: 'Testbrief',
originalFilename: 'testbrief.pdf',
@@ -30,9 +21,8 @@ const makeDoc = (overrides: DocOverrides = {}) => ({
documentDate: '2024-03-15',
location: null,
sender: null,
receivers: [] as { firstName?: string | null; lastName: string; displayName: string }[],
tags: [],
...overrides
receivers: [],
tags: []
});
describe('DocumentList result count', () => {
@@ -59,59 +49,3 @@ describe('DocumentList empty state with search term', () => {
await expect.element(page.getByText(/"Urlaub"/)).toBeInTheDocument();
});
});
// ─── Group headers ────────────────────────────────────────────────────────────
describe('DocumentList group headers', () => {
it('renders group-divider elements when DATE sort spans multiple years', async () => {
const documents = [
makeDoc({ id: '1', documentDate: '1923-04-12' }),
makeDoc({ id: '2', documentDate: '1965-08-03' })
];
render(DocumentList, { ...baseProps, documents, total: 2, sort: 'DATE' });
await expect.element(page.getByTestId('group-divider').first()).toBeInTheDocument();
});
it('does not render group-divider when DATE sort has only one distinct year', async () => {
const documents = [
makeDoc({ id: '1', documentDate: '1938-01-01' }),
makeDoc({ id: '2', documentDate: '1938-06-15' })
];
render(DocumentList, { ...baseProps, documents, total: 2, sort: 'DATE' });
await expect.element(page.getByTestId('group-divider')).not.toBeInTheDocument();
});
it('does not render group-divider for TITLE sort', async () => {
const documents = [
makeDoc({ id: '1', documentDate: '1923-04-12' }),
makeDoc({ id: '2', documentDate: '1965-08-03' })
];
render(DocumentList, { ...baseProps, documents, total: 2, sort: 'TITLE' });
await expect.element(page.getByTestId('group-divider')).not.toBeInTheDocument();
});
it('shows Undatiert fallback label when sort is undefined and doc has no date', async () => {
const documents = [
makeDoc({ id: '1', documentDate: '1938-01-01' }),
makeDoc({ id: '2', documentDate: null })
];
render(DocumentList, { ...baseProps, documents, total: 2 }); // sort omitted — defaults to DATE grouping
await expect.element(page.getByText(/UNDATIERT/i)).toBeInTheDocument();
});
it('a doc with two receivers appears in both receiver groups', async () => {
const documents = [
makeDoc({
id: '1',
receivers: [
{ firstName: null, lastName: 'Müller', displayName: 'Anna Müller' },
{ firstName: null, lastName: 'Bauer', displayName: 'Karl Bauer' }
]
})
];
render(DocumentList, { ...baseProps, documents, total: 1, sort: 'RECEIVER' });
const links = page.getByRole('link', { name: /Testbrief/ });
await expect.element(links.first()).toBeInTheDocument();
await expect.element(links.nth(1)).toBeInTheDocument();
});
});

View File

@@ -1,8 +1,6 @@
<script lang="ts">
import { m } from '$lib/paraglide/messages.js';
import { formatDate } from '$lib/utils/date';
import GroupDivider from '$lib/components/GroupDivider.svelte';
import { groupDocuments } from '$lib/utils/groupDocuments';
let {
documents,
@@ -31,15 +29,22 @@ let {
const documentYears = $derived(
documents
.map((doc) =>
doc.documentDate ? new Date(doc.documentDate + 'T12:00:00').getFullYear() : null
)
.map((doc) => (doc.documentDate ? new Date(doc.documentDate).getFullYear() : null))
.filter((y): y is number => y !== null)
);
const yearFrom = $derived(documentYears.length > 0 ? Math.min(...documentYears) : null);
const yearTo = $derived(documentYears.length > 0 ? Math.max(...documentYears) : null);
const documentGroups = $derived.by(() => groupDocuments(documents, 'DATE', ''));
const enrichedDocuments = $derived(
documents.map((doc, i) => {
const year = doc.documentDate ? new Date(doc.documentDate).getFullYear() : null;
const prevYear =
i > 0 && documents[i - 1].documentDate
? new Date(documents[i - 1].documentDate!).getFullYear()
: null;
return { doc, year, showYearDivider: year !== null && year !== prevYear };
})
);
</script>
<!-- Summary bar -->
@@ -77,83 +82,87 @@ const documentGroups = $derived.by(() => groupDocuments(documents, 'DATE', ''));
<div class="p-6 md:p-8">
<div class="relative z-10 flex flex-col gap-4">
{#each documentGroups as group (group.label)}
{#if group.label}
<GroupDivider label={group.label} />
{#each enrichedDocuments as { doc, year, showYearDivider } (doc.id)}
{#if showYearDivider}
<div data-testid="year-divider" class="relative flex items-center py-2 text-center">
<div class="flex-grow border-t border-line"></div>
<span class="mx-4 font-sans text-xs font-bold tracking-widest text-ink/40 uppercase"
>{year}</span
>
<div class="flex-grow border-t border-line"></div>
</div>
{/if}
{#each group.documents as doc (doc.id)}
{@const isRight = doc.sender?.id === senderId}
{@const isRight = doc.sender?.id === senderId}
<!-- Message Row -->
<div class="flex w-full {isRight ? 'justify-end' : 'justify-start'}">
<!-- Bubble Group -->
<div
class="flex max-w-[90%] gap-3 md:max-w-[70%] {isRight
<!-- Message Row -->
<div class="flex w-full {isRight ? 'justify-end' : 'justify-start'}">
<!-- Bubble Group -->
<div
class="flex max-w-[90%] gap-3 md:max-w-[70%] {isRight
? 'flex-row-reverse'
: 'flex-row'}"
>
<!-- AVATAR -->
<div class="mt-auto mb-1 hidden flex-shrink-0 sm:block">
<div
class="flex h-8 w-8 items-center justify-center rounded-full border font-serif text-xs shadow-sm
>
<!-- AVATAR -->
<div class="mt-auto mb-1 hidden flex-shrink-0 sm:block">
<div
class="flex h-8 w-8 items-center justify-center rounded-full border font-serif text-xs shadow-sm
{isRight
? 'border-primary bg-primary text-primary-fg'
: 'border-line bg-surface text-ink'}"
>
{#if doc.sender}
{doc.sender.firstName ? doc.sender.firstName[0] : doc.sender.lastName[0]}{doc.sender.lastName[0]}
{:else}
?
{/if}
</div>
>
{#if doc.sender}
{doc.sender.firstName ? doc.sender.firstName[0] : doc.sender.lastName[0]}{doc.sender.lastName[0]}
{:else}
?
{/if}
</div>
</div>
<!-- BUBBLE CARD -->
<a
href="/documents/{doc.id}"
class="group block transform rounded border p-4 shadow-sm transition-all duration-200 hover:-translate-y-0.5 hover:shadow-md
<!-- BUBBLE CARD -->
<a
href="/documents/{doc.id}"
class="group block transform rounded border p-4 shadow-sm transition-all duration-200 hover:-translate-y-0.5 hover:shadow-md
{isRight
? 'rounded-br-none border-primary bg-primary text-primary-fg'
: 'rounded-bl-none border-line bg-muted/50 text-ink'}"
>
<!-- Header -->
<div class="mb-2 flex items-start justify-between gap-4">
<h3
class="font-serif text-sm leading-snug font-medium {isRight
>
<!-- Header -->
<div class="mb-2 flex items-start justify-between gap-4">
<h3
class="font-serif text-sm leading-snug font-medium {isRight
? 'text-primary-fg'
: 'text-ink'}"
>
{doc.title || doc.originalFilename}
</h3>
>
{doc.title || doc.originalFilename}
</h3>
<!-- Status Dot -->
<span
class="mt-1.5 h-1.5 w-1.5 flex-shrink-0 rounded-full
<!-- Status Dot -->
<span
class="mt-1.5 h-1.5 w-1.5 flex-shrink-0 rounded-full
{doc.status === 'UPLOADED' ? 'bg-accent' : 'bg-yellow-400'}"
title={doc.status}
>
</span>
</div>
title={doc.status}
>
</span>
</div>
<!-- Metadata -->
<div
class="flex flex-wrap gap-3 font-sans text-[10px] tracking-wider uppercase opacity-80 {isRight
<!-- Metadata -->
<div
class="flex flex-wrap gap-3 font-sans text-[10px] tracking-wider uppercase opacity-80 {isRight
? 'text-primary-fg/70'
: 'text-ink-2'}"
>
>
<span class="flex items-center">
{doc.documentDate ? formatDate(doc.documentDate) : '—'}
</span>
{#if doc.location}
<span class="flex items-center">
{doc.documentDate ? formatDate(doc.documentDate) : '—'}
{doc.location}
</span>
{#if doc.location}
<span class="flex items-center">
{doc.location}
</span>
{/if}
</div>
</a>
</div>
{/if}
</div>
</a>
</div>
{/each}
</div>
{/each}
</div>
</div>

View File

@@ -116,7 +116,7 @@ describe('Conversations page summary', () => {
describe('Conversations page year dividers', () => {
it('renders a year divider for the first document', async () => {
render(Page, { data: withDocs });
await expect.element(page.getByTestId('group-divider').first()).toHaveTextContent('1923');
await expect.element(page.getByTestId('year-divider').first()).toHaveTextContent('1923');
});
it('renders a divider for each new year in the document list', async () => {
@@ -128,8 +128,8 @@ describe('Conversations page year dividers', () => {
]
};
render(Page, { data });
await expect.element(page.getByTestId('group-divider').first()).toHaveTextContent('1923');
await expect.element(page.getByTestId('group-divider').nth(1)).toHaveTextContent('1965');
await expect.element(page.getByTestId('year-divider').first()).toHaveTextContent('1923');
await expect.element(page.getByTestId('year-divider').nth(1)).toHaveTextContent('1965');
});
it('does not render a second divider for documents from the same year', async () => {
@@ -142,8 +142,8 @@ describe('Conversations page year dividers', () => {
};
render(Page, { data });
// Only one divider for 1923; 1965 divider should not appear
await expect.element(page.getByTestId('group-divider').first()).toHaveTextContent('1923');
await expect.element(page.getByTestId('group-divider').nth(1)).not.toBeInTheDocument();
await expect.element(page.getByTestId('year-divider').first()).toHaveTextContent('1923');
await expect.element(page.getByTestId('year-divider').nth(1)).not.toBeInTheDocument();
});
});

View File

@@ -1,4 +1,4 @@
FROM python:3.11.9-slim
FROM python:3.11-slim
WORKDIR /app
@@ -21,8 +21,6 @@ RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN chmod +x /app/entrypoint.sh
EXPOSE 8000
CMD ["/app/entrypoint.sh"]
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

View File

@@ -1,80 +0,0 @@
"""Validates the blla segmentation base model and downloads it if needed.
Run at container startup before uvicorn. ketos 7 requires the model in
CoreML protobuf or safetensors format — legacy PyTorch ZIP archives
(torch.save output from kraken <4) are not loadable and will be replaced.
Exits non-zero on failure so Docker marks the container unhealthy rather
than silently starting with a broken model.
"""
import glob
import logging
import os
import shutil
import subprocess
import sys
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s:ensure_blla_model:%(message)s",
)
log = logging.getLogger(__name__)
BLLA_MODEL_PATH = os.environ.get("BLLA_MODEL_PATH", "/app/models/blla.mlmodel")
# DOI for "General segmentation model for print and handwriting" — ketos 7 compatible.
BLLA_MODEL_DOI = "10.5281/zenodo.14602569"
HTRMOPO_DIR = os.path.expanduser("~/.local/share/htrmopo")
def _model_is_loadable(path: str) -> bool:
try:
from kraken.lib import vgsl
vgsl.TorchVGSLModel.load_model(path)
return True
except (RuntimeError, OSError, ValueError) as e:
log.warning("Model at %s failed to load: %s", path, e)
return False
except Exception:
log.debug("Unexpected error loading model at %s", path, exc_info=True)
return False
def _download_blla() -> str:
log.info("Downloading blla model (DOI %s) ...", BLLA_MODEL_DOI)
result = subprocess.run(
["kraken", "get", BLLA_MODEL_DOI],
capture_output=True,
text=True,
)
if result.returncode != 0:
log.error("kraken get failed: %s", result.stderr)
sys.exit(1)
candidates = sorted(glob.glob(os.path.join(HTRMOPO_DIR, "*/blla.mlmodel")))
if not candidates:
log.error("Downloaded blla.mlmodel not found under %s", HTRMOPO_DIR)
sys.exit(1)
return candidates[-1]
def main() -> None:
if os.path.exists(BLLA_MODEL_PATH):
if _model_is_loadable(BLLA_MODEL_PATH):
log.info("blla model OK: %s", BLLA_MODEL_PATH)
return
log.warning(
"blla model at %s is in an incompatible format — replacing", BLLA_MODEL_PATH
)
os.rename(BLLA_MODEL_PATH, BLLA_MODEL_PATH + ".incompatible")
os.makedirs(os.path.dirname(BLLA_MODEL_PATH), exist_ok=True)
downloaded = _download_blla()
shutil.copy2(downloaded, BLLA_MODEL_PATH)
log.info("Installed blla model at %s", BLLA_MODEL_PATH)
if __name__ == "__main__":
main()

View File

@@ -1,9 +0,0 @@
#!/bin/bash
set -euo pipefail
# Validate the blla segmentation base model and download it if missing or
# incompatible. ketos 7 dropped support for legacy PyTorch ZIP archives —
# this ensures the volume always holds a loadable CoreML protobuf model.
python3 /app/ensure_blla_model.py
exec uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

View File

@@ -472,35 +472,16 @@ async def segtrain_model(
"-q", "fixed",
"-N", "10",
]
# Train at 800px height. The default blla model uses 1800px, which peaks at
# ~7+ GB on CPU and kills the host (ketos ignores -s when -i is present, so
# we cannot override the height of an existing model).
# Strategy: only use the base model if it is already at 800px (i.e. was
# produced by a previous fine-tuning run here). Otherwise train from scratch —
# the first run bootstraps a 800px model; all subsequent runs fine-tune it.
seg_spec = (
"[1,800,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 "
"Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]"
)
use_base_model = False
if os.path.exists(blla_model_path):
try:
from kraken.lib import vgsl as _vgsl
_m = _vgsl.TorchVGSLModel.load_model(blla_model_path)
use_base_model = _m.input[2] == 800 # input is (batch, channels, H, W)
if not use_base_model:
log.info(
"Base model height is %dpx — skipping -i to avoid OOM; "
"will train from scratch at 800px",
_m.input[2],
)
except Exception as exc:
log.warning("Could not inspect base model height, training from scratch: %s", exc)
if use_base_model:
cmd += ["-i", blla_model_path, "--resize", "union", "-s", seg_spec]
cmd += ["-i", blla_model_path, "--resize", "both"]
else:
cmd += ["-s", seg_spec]
# No pretrained model — train from scratch with reduced height (800px)
# to keep peak RAM under ~200 MB on CPU (default 1800px uses ~500 MB+)
cmd += [
"-s",
"[1,800,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 "
"Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]",
]
cmd += xml_files
log.info("Running: %s", " ".join(cmd[:5]) + " ...")
@@ -512,8 +493,7 @@ async def segtrain_model(
raise RuntimeError(f"ketos segtrain failed (exit {proc.returncode}): {proc.stderr[-500:]}")
accuracy, epochs = _parse_best_checkpoint(checkpoint_dir)
cer = round(1.0 - accuracy, 4) if accuracy is not None else None
log.info("Segmentation training complete — epochs=%s accuracy=%s cer=%s", epochs, accuracy, cer)
log.info("Segmentation training complete — epochs=%s accuracy=%s", epochs, accuracy)
best_model = _find_best_model(checkpoint_dir)
if best_model is None:
@@ -528,7 +508,7 @@ async def segtrain_model(
shutil.copy2(best_model, blla_model_path)
log.info("Replaced blla model at %s", blla_model_path)
return {"loss": None, "accuracy": accuracy, "cer": cer, "epochs": epochs}
return {"loss": None, "accuracy": accuracy, "cer": None, "epochs": epochs}
result = await asyncio.to_thread(_run_segtrain)
return result

View File

@@ -1,69 +0,0 @@
"""Unit tests for ensure_blla_model.main()."""
from unittest.mock import MagicMock, call, patch
import ensure_blla_model
# ─── Model already loadable ───────────────────────────────────────────────────
def test_main_returns_early_when_model_is_loadable():
"""When the model exists and loads cleanly, no download or rename occurs."""
with (
patch("os.path.exists", return_value=True),
patch.object(ensure_blla_model, "_model_is_loadable", return_value=True),
patch.object(ensure_blla_model, "_download_blla") as mock_download,
patch("os.rename") as mock_rename,
):
ensure_blla_model.main()
mock_download.assert_not_called()
mock_rename.assert_not_called()
# ─── Model exists but is incompatible ─────────────────────────────────────────
def test_main_replaces_incompatible_model():
"""An incompatible model is renamed and replaced with a fresh download."""
fake_path = "/app/models/blla.mlmodel"
downloaded_path = "/tmp/downloaded.mlmodel"
with (
patch.object(ensure_blla_model, "BLLA_MODEL_PATH", fake_path),
patch("os.path.exists", return_value=True),
patch.object(ensure_blla_model, "_model_is_loadable", return_value=False),
patch.object(ensure_blla_model, "_download_blla", return_value=downloaded_path),
patch("os.rename") as mock_rename,
patch("shutil.copy2") as mock_copy,
patch("os.makedirs"),
):
ensure_blla_model.main()
mock_rename.assert_called_once_with(fake_path, fake_path + ".incompatible")
mock_copy.assert_called_once_with(downloaded_path, fake_path)
# ─── Model missing ────────────────────────────────────────────────────────────
def test_main_downloads_when_model_missing():
"""When the model file doesn't exist at all, it is downloaded without rename."""
fake_path = "/app/models/blla.mlmodel"
downloaded_path = "/tmp/downloaded.mlmodel"
with (
patch.object(ensure_blla_model, "BLLA_MODEL_PATH", fake_path),
patch("os.path.exists", return_value=False),
patch.object(ensure_blla_model, "_model_is_loadable") as mock_loadable,
patch.object(ensure_blla_model, "_download_blla", return_value=downloaded_path),
patch("os.rename") as mock_rename,
patch("shutil.copy2") as mock_copy,
patch("os.makedirs"),
):
ensure_blla_model.main()
mock_loadable.assert_not_called()
mock_rename.assert_not_called()
mock_copy.assert_called_once_with(downloaded_path, fake_path)