docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7

Processes all 7 CLAUDE.md files according to the 3-bucket classification.
Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md,
domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last.

### scripts/CLAUDE.md → scripts/README.md
New `scripts/README.md` with full script documentation (preserving the
⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md`
reduced to a pointer + "document new scripts in README.md" reminder.

### .devcontainer/CLAUDE.md → .devcontainer/README.md
New `.devcontainer/README.md` with all configuration, usage, and limitations.
`devcontainer/CLAUDE.md` reduced to a single pointer line.

### docs/CLAUDE.md → docs/README.md
New `docs/README.md` covering the folder structure, ADR guide, infrastructure
docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder.

### ocr-service/CLAUDE.md
Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6).
Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk.

### backend/CLAUDE.md
- Layering Rules → pointer to docs/ARCHITECTURE.md
- Error Handling → pointer to CONTRIBUTING.md + reminder
- Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder
- Package Structure → tagged TODO post-REFACTOR-1
- Fixed errors.ts path to frontend/src/lib/shared/errors.ts
- Added ANNOTATE_ALL + BLOG_WRITE to permission list
- Key Entities, Entity Code Style, Services → kept (Bucket-2)

### root CLAUDE.md
- Stack, Infrastructure, Dev Container → pointers
- Layering Rules, Error Handling, Security, OpenAPI, API Client,
  Date Handling, UI Components, Frontend Error Handling → pointers + reminders
- Package Structure → tagged TODO post-REFACTOR-1
- Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2)

### frontend/CLAUDE.md
- API Client Pattern, Date Handling → pointers + reminders
- Key UI Components → pointer to domain READMEs
- Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-05 23:33:41 +02:00
parent a5f4b0df31
commit e2c86626f7
11 changed files with 452 additions and 732 deletions

View File

@@ -1,96 +1,3 @@
# Dev Container — Familienarchiv
# Dev Container
## Overview
VS Code Dev Container configuration for a pre-configured development environment. Includes Java 21, Maven, and Node.js 24 — everything needed to work on both backend and frontend.
## Configuration
File: `.devcontainer/devcontainer.json`
### Included Features
| Feature | Version | Purpose |
|---|---|---|
| Java | 21 | Spring Boot backend |
| Maven | bundled with Java feature | Build tool |
| Node.js | 24 | SvelteKit frontend |
### VS Code Extensions (Auto-installed)
| Extension | Purpose |
|---|---|
| `vscjava.vscode-java-pack` | Java language support, debugging, testing |
| `vmware.vscode-spring-boot` | Spring Boot tooling |
| `gabrielbb.vscode-lombok` | Lombok annotation support |
| `humao.rest-client` | HTTP request files (for `backend/api_tests/`) |
### Ports
- `8080` forwarded to host — access backend at `http://localhost:8080`
### User
Runs as `vscode` user (not root) for security.
## How to Use
### Prerequisites
- VS Code with the **Dev Containers** extension installed
- Docker running locally
### Open in Dev Container
1. Open the project in VS Code
2. Press `F1` → type "Dev Containers: Reopen in Container"
3. VS Code will:
- Build the container using the root `docker-compose.yml`
- Install Java 21, Maven, and Node 24
- Install the listed extensions
- Mount the workspace folder
### Working Inside the Container
Once inside the container, you have access to both stacks:
```bash
# Backend
cd backend
./mvnw spring-boot:run
# Frontend (in a new terminal)
cd frontend
npm install
npm run dev
```
The container reuses the `docker-compose.yml` services, so PostgreSQL and MinIO are available automatically.
### Forwarding Frontend Port
The devcontainer config only forwards port 8080 by default. To access the frontend dev server (port 5173 or 3000), either:
1. Add `5173` to `forwardPorts` in `devcontainer.json`, or
2. Use the VS Code "Ports" panel to forward it dynamically
## Limitations
- The devcontainer attaches to the `backend` service from `docker-compose.yml`, so it inherits those environment variables
- OCR service and other containers should be started separately via `docker-compose up -d`
- GPU passthrough for OCR training is not configured
## Customization
To add more tools or extensions, edit `.devcontainer/devcontainer.json`:
```json
{
"features": {
"ghcr.io/devcontainers/features/python:1": {
"version": "3.11"
}
},
"forwardPorts": [8080, 5173, 3000]
}
```
→ See [.devcontainer/README.md](./README.md) for configuration, usage, and known limitations.

94
.devcontainer/README.md Normal file
View File

@@ -0,0 +1,94 @@
# Dev Container — Familienarchiv
VS Code Dev Container configuration for a pre-configured development environment. Includes Java 21, Maven, and Node.js 24 — everything needed to work on both backend and frontend.
## Configuration
File: `.devcontainer/devcontainer.json`
### Included Features
| Feature | Version | Purpose |
| ------- | ------------------------- | ------------------- |
| Java | 21 | Spring Boot backend |
| Maven | bundled with Java feature | Build tool |
| Node.js | 24 | SvelteKit frontend |
### VS Code Extensions (Auto-installed)
| Extension | Purpose |
| --------------------------- | --------------------------------------------- |
| `vscjava.vscode-java-pack` | Java language support, debugging, testing |
| `vmware.vscode-spring-boot` | Spring Boot tooling |
| `gabrielbb.vscode-lombok` | Lombok annotation support |
| `humao.rest-client` | HTTP request files (for `backend/api_tests/`) |
### Ports
- `8080` forwarded to host — access backend at `http://localhost:8080`
### User
Runs as `vscode` user (not root) for security.
## How to Use
### Prerequisites
- VS Code with the **Dev Containers** extension installed
- Docker running locally
### Open in Dev Container
1. Open the project in VS Code
2. Press `F1` → type "Dev Containers: Reopen in Container"
3. VS Code will:
- Build the container using the root `docker-compose.yml`
- Install Java 21, Maven, and Node 24
- Install the listed extensions
- Mount the workspace folder
### Working Inside the Container
Once inside the container, you have access to both stacks:
```bash
# Backend
cd backend
./mvnw spring-boot:run
# Frontend (in a new terminal)
cd frontend
npm install
npm run dev
```
The container reuses the `docker-compose.yml` services, so PostgreSQL and MinIO are available automatically.
### Forwarding Frontend Port
The devcontainer config only forwards port 8080 by default. To access the frontend dev server (port 5173 or 3000), either:
1. Add `5173` to `forwardPorts` in `devcontainer.json`, or
2. Use the VS Code "Ports" panel to forward it dynamically
## Limitations
- The devcontainer attaches to the `backend` service from `docker-compose.yml`, so it inherits those environment variables
- OCR service and other containers should be started separately via `docker-compose up -d`
- GPU passthrough for OCR training is not configured
## Customization
To add more tools or extensions, edit `.devcontainer/devcontainer.json`:
```json
{
"features": {
"ghcr.io/devcontainers/features/python:1": {
"version": "3.11"
}
},
"forwardPorts": [8080, 5173, 3000]
}
```

195
CLAUDE.md
View File

@@ -2,6 +2,8 @@
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
> For a human-readable project overview, see [README.md](./README.md).
## Project Overview
**Familienarchiv** is a family document archival system — a full-stack web app for digitizing, organizing, and searching family documents. Key features: file uploads (stored in MinIO/S3), metadata management, Excel/ODS batch import, full-text search, conversation threads between family members, and role-based access control.
@@ -16,6 +18,8 @@ See [CODESTYLE.md](./CODESTYLE.md) for coding standards: Clean Code, DRY/KISS tr
## Stack
→ See [README.md §Tech Stack](./README.md#tech-stack)
- **Backend**: Spring Boot 4.0 (Java 21, Maven, Jetty, JPA/Hibernate, Flyway, Spring Security, Spring Session JDBC)
- **Frontend**: SvelteKit 2 with Svelte 5, TypeScript, Tailwind CSS 4, Paraglide.js (i18n: de/en/es)
- **Database**: PostgreSQL 16
@@ -25,12 +29,13 @@ See [CODESTYLE.md](./CODESTYLE.md) for coding standards: Clean Code, DRY/KISS tr
## Common Commands
### Running the Full Stack
```bash
# From repo root — starts PostgreSQL, MinIO, and Spring Boot backend
docker-compose up -d
```
### Backend (Spring Boot)
```bash
cd backend
@@ -42,6 +47,7 @@ cd backend
```
### Frontend (SvelteKit)
```bash
cd frontend
@@ -64,7 +70,7 @@ npm run generate:api # Regenerate TypeScript API types from OpenAPI spec
### Package Structure
Package-by-domain: each domain owns its controller, service, repository, entities, and DTOs.
<!-- TODO: rewrite post-REFACTOR-1 — see Epic 4 -->
```
backend/src/main/java/org/raddatz/familienarchiv/
@@ -88,27 +94,21 @@ backend/src/main/java/org/raddatz/familienarchiv/
└── user/ User domain — AppUser, UserGroup, UserService, auth controllers
```
### Layering Rules (strictly enforced)
### Layering Rules
```
Controller → Service → Repository → DB
```
→ See [docs/ARCHITECTURE.md §Layering rule](./docs/ARCHITECTURE.md#layering-rule)
- **Controllers** never inject or call repositories directly.
- **Services** never reach into another domain's repository. Call the other domain's service instead.
-`DocumentService``PersonService.getById()``PersonRepository`
-`DocumentService``PersonRepository` directly
- This keeps domain boundaries clear and business logic testable in isolation.
**LLM reminder:** controllers never call repositories directly; services never reach into another domain's repository — always call the other domain's service instead.
### Domain Model
| Entity | Table | Key relationships |
|---|---|---|
| `Document` | `documents` | ManyToOne `sender` (Person), ManyToMany `receivers` (Person), ManyToMany `tags` (Tag) |
| `Person` | `persons` | Referenced by documents as sender/receiver |
| `Tag` | `tag` | ManyToMany with documents via `document_tags` |
| `AppUser` | `app_users` | ManyToMany `groups` (UserGroup) |
| `UserGroup` | `user_groups` | Has a `Set<String> permissions` |
| Entity | Table | Key relationships |
| ----------- | ------------- | ------------------------------------------------------------------------------------- |
| `Document` | `documents` | ManyToOne `sender` (Person), ManyToMany `receivers` (Person), ManyToMany `tags` (Tag) |
| `Person` | `persons` | Referenced by documents as sender/receiver |
| `Tag` | `tag` | ManyToMany with documents via `document_tags` |
| `AppUser` | `app_users` | ManyToMany `groups` (UserGroup) |
| `UserGroup` | `user_groups` | Has a `Set<String> permissions` |
**`DocumentStatus` lifecycle:** `PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED`
@@ -118,6 +118,7 @@ Controller → Service → Repository → DB
### Entity Code Style
All entities use these Lombok annotations:
```java
@Entity
@Table(name = "table_name")
@@ -146,65 +147,29 @@ Services are annotated with `@Service`, `@RequiredArgsConstructor`, and optional
- Read methods are not annotated (default non-transactional is fine).
- Each service owns its domain's repository. Cross-domain data access goes through the other domain's service.
**Existing services:**
| Service | Responsibility |
|---|---|
| `DocumentService` | Document CRUD, search, tag cascade delete |
| `PersonService` | Person CRUD, find-or-create by alias |
| `TagService` | Tag find/create/update/delete |
| `UserService` | User and group CRUD |
| `FileService` | S3/MinIO upload and download |
| `MassImportService` | Async ODS/Excel import; delegates to PersonService and TagService |
### DTOs
Input DTOs live in `dto/`. Response types are the model entities themselves (no response DTOs).
Input DTOs live flat in the domain package. Response types are the model entities themselves (no response DTOs).
- `DocumentUpdateDTO` — used for both create and update (all fields optional)
- `CreateUserRequest` — user creation
- `GroupDTO` — group create/update
- `@Schema(requiredMode = REQUIRED)` on every field the backend always populates — drives TypeScript generation.
### Error Handling
Use `DomainException` for all domain errors. Never throw raw exceptions from service methods.
→ See [CONTRIBUTING.md §Error handling](./CONTRIBUTING.md#error-handling)
```java
// Static factories match common HTTP status codes:
DomainException.notFound(ErrorCode.DOCUMENT_NOT_FOUND, "Document not found: " + id)
DomainException.forbidden("Access denied")
DomainException.conflict(ErrorCode.IMPORT_ALREADY_RUNNING, "Already running")
DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED, "Upload failed: " + e.getMessage())
```
`ErrorCode` is an enum in `exception/ErrorCode.java`. When adding a new error case, add the value there **and** mirror it in the frontend's `src/lib/errors.ts` + add a Paraglide translation key.
For simple validation in controllers (not domain logic), `ResponseStatusException` is acceptable:
```java
throw new ResponseStatusException(HttpStatus.BAD_REQUEST, "firstName is required");
```
**LLM reminder:** use `DomainException.notFound/forbidden/conflict/internal()` from service methods — never throw raw exceptions. When adding a new `ErrorCode`: (1) add to `ErrorCode.java`, (2) mirror in `frontend/src/lib/shared/errors.ts`, (3) add i18n keys in `messages/{de,en,es}.json`.
### Security / Permissions
Use `@RequirePermission` on controller methods (or the whole controller class):
→ See [docs/ARCHITECTURE.md §Permission system](./docs/ARCHITECTURE.md#permission-system)
```java
@RequirePermission(Permission.WRITE_ALL)
public Document updateDocument(...) { ... }
```
Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`
`PermissionAspect` (AOP) checks the current user's `UserGroup.permissions` at runtime.
**LLM reminder:** `@RequirePermission(Permission.WRITE_ALL)` is **required** on every `POST`, `PUT`, `PATCH`, `DELETE` endpoint — not optional. Do not mix with Spring Security's `@PreAuthorize`. Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`, `ANNOTATE_ALL`, `BLOG_WRITE`.
### OpenAPI / API Types
SpringDoc generates the spec at `/v3/api-docs` (only accessible when running with `--spring.profiles.active=dev`).
→ See [CONTRIBUTING.md §Walkthrough B — Add a new endpoint](./CONTRIBUTING.md#4-walkthrough-b--add-a-new-endpoint)
When changing any model field or endpoint:
1. Rebuild the backend JAR with `-DskipTests`
2. Start it with `--spring.profiles.active=dev`
3. Run `npm run generate:api` in `frontend/`
**LLM reminder:** always run `npm run generate:api` in `frontend/` after any backend model or endpoint change — this is the most common cause of TypeScript type errors.
---
@@ -233,79 +198,52 @@ frontend/src/routes/
### API Client Pattern
All server-side API calls use the typed client from `$lib/api.server.ts`:
→ See [CONTRIBUTING.md §Frontend API client](./CONTRIBUTING.md#frontend-api-client)
```typescript
const api = createApiClient(fetch);
const result = await api.GET('/api/persons/{id}', { params: { path: { id } } });
// Always check via response.ok, NOT result.error
if (!result.response.ok) {
const code = (result.error as unknown as { code?: string })?.code;
throw error(result.response.status, getErrorMessage(code));
}
return { person: result.data! };
```
Key rules:
- Use `!result.response.ok` for error checking (not `if (result.error)` — this breaks when the spec has no error responses defined)
- Cast errors as `result.error as unknown as { code?: string }` to extract the backend error code
- Use `result.data!` (non-null assertion) after an ok check — TypeScript knows it's present
For multipart/form-data endpoints (file uploads), bypass the typed client and use raw `fetch`:
```typescript
const res = await fetch(`${baseUrl}/api/documents`, { method: 'POST', body: formData });
```
**LLM reminder:** check `!result.response.ok` (not `result.error` — breaks when spec has no error responses defined); cast errors as `result.error as unknown as { code?: string }`; use `result.data!` after an ok check.
### Form Actions Pattern
```typescript
// +page.server.ts
export const actions = {
default: async ({ request, fetch }) => {
const formData = await request.formData();
const name = formData.get('name') as string; // cast needed — FormData returns FormDataEntryValue
// ...
return fail(400, { error: 'message' }); // on error
throw redirect(303, '/target'); // on success
}
default: async ({ request, fetch }) => {
const formData = await request.formData();
const name = formData.get("name") as string;
// ...
return fail(400, { error: "message" }); // on error
throw redirect(303, "/target"); // on success
},
};
```
### Date Handling
- **Forms**: German format `dd.mm.yyyy` with auto-dot insertion via `handleDateInput()`. A hidden `<input type="hidden" name="documentDate" value={dateIso}>` sends ISO format to the backend.
- **Display**: Always use `Intl.DateTimeFormat` with `T12:00:00` suffix to prevent UTC timezone off-by-one:
```typescript
new Intl.DateTimeFormat('de-DE', { day: 'numeric', month: 'long', year: 'numeric' })
.format(new Date(doc.documentDate + 'T12:00:00'))
```
→ See [CONTRIBUTING.md §Date handling](./CONTRIBUTING.md#date-handling)
**LLM reminder:** always append `T12:00:00` when constructing `new Date()` from an ISO date string — prevents UTC timezone off-by-one errors.
### UI Component Library
Custom components in `src/lib/components/`:
| Component | Props | Description |
|---|---|---|
| `PersonTypeahead` | `name`, `label`, `value`, `initialName`, `on:change` | Single-person selector with typeahead dropdown |
| `PersonMultiSelect` | `selectedPersons` (bind) | Chip-based multi-person selector |
| `TagInput` | `tags` (bind), `allowCreation?`, `on:change` | Tag chip input with typeahead |
→ See per-domain READMEs: [`frontend/src/lib/person/README.md`](./frontend/src/lib/person/README.md), [`frontend/src/lib/tag/README.md`](./frontend/src/lib/tag/README.md), [`frontend/src/lib/document/README.md`](./frontend/src/lib/document/README.md), [`frontend/src/lib/shared/README.md`](./frontend/src/lib/shared/README.md)
### Styling Conventions (Tailwind CSS 4)
Brand color utilities (defined in `layout.css`):
| Class | Value | Usage |
|---|---|---|
| `brand-navy` | `#002850` | Primary text, buttons, headers |
| Class | Value | Usage |
| ------------ | --------- | -------------------------------- |
| `brand-navy` | `#002850` | Primary text, buttons, headers |
| `brand-mint` | `#A6DAD8` | Accents, hover underlines, icons |
| `brand-sand` | `#E4E2D7` | Page background, card borders |
| `brand-sand` | `#E4E2D7` | Page background, card borders |
Typography:
- `font-serif` (Merriweather) — body text, document titles, names
- `font-sans` (Montserrat) — labels, metadata, UI chrome
Card pattern for content sections:
```svelte
<div class="bg-white shadow-sm border border-brand-sand rounded-sm p-6">
<h2 class="text-xs font-bold uppercase tracking-widest text-gray-400 mb-5">Section Title</h2>
@@ -313,48 +251,19 @@ Card pattern for content sections:
</div>
```
Save bar pattern — use **sticky full-bleed** for long forms (edit document), **card-style with `mt-4`** for short forms (new person):
```svelte
<!-- Long forms: sticky, full-bleed -->
<div class="sticky bottom-0 z-10 -mx-4 px-6 py-4 bg-white border-t border-brand-sand shadow-[0_-2px_8px_rgba(0,0,0,0.06)] flex items-center justify-between">
<!-- Short forms: card, top margin -->
<div class="mt-4 flex items-center justify-between rounded-sm border border-brand-sand bg-white px-6 py-4 shadow-sm">
```
Back button pattern — use the shared `<BackButton>` component from `$lib/components/BackButton.svelte`:
```svelte
<script lang="ts">
import BackButton from '$lib/components/BackButton.svelte';
</script>
<BackButton />
```
The component calls `history.back()` so the user returns to wherever they came from. Label is always "Zurück" (no contextual suffix — destination is unknown). Touch target ≥ 44px and focus ring are built in. Do not use a static `<a href>` for back navigation.
Subtle action link (e.g. "new document/person"):
```svelte
<a href="/documents/new" class="inline-flex items-center gap-1 text-sm font-medium text-brand-navy/60 hover:text-brand-navy transition-colors">
<svg class="w-4 h-4" ...><!-- plus icon --></svg>
Neues Dokument
</a>
```
Back button pattern — use the shared `<BackButton>` component from `$lib/shared/primitives/BackButton.svelte`. Do not use a static `<a href>` for back navigation.
### Error Handling (Frontend)
`src/lib/errors.ts` mirrors the backend `ErrorCode` enum and maps codes to Paraglide translation keys. When adding a new `ErrorCode` on the backend:
1. Add it to `ErrorCode.java`
2. Add it to the `ErrorCode` type in `errors.ts`
3. Add a `case` in `getErrorMessage()`
4. Add the translation key in `messages/de.json`, `en.json`, `es.json`
→ See [CONTRIBUTING.md §Error handling](./CONTRIBUTING.md#error-handling)
**LLM reminder:** when adding a new `ErrorCode`: (1) add to `ErrorCode.java`, (2) add to `ErrorCode` type in `frontend/src/lib/shared/errors.ts`, (3) add a `case` in `getErrorMessage()`, (4) add i18n keys in `messages/{de,en,es}.json`.
---
## Infrastructure
The `docker-compose.yml` at the repo root orchestrates everything. A MinIO MC helper container runs at startup to create the `archive-documents` bucket. The backend container depends on both `db` and `minio` being healthy.
Database migrations live in `backend/src/main/resources/db/migration/` (Flyway, SQL files named `V{n}__{description}.sql`).
→ See [docs/DEPLOYMENT.md](./docs/DEPLOYMENT.md)
## API Testing
@@ -362,4 +271,4 @@ HTTP test files are in `backend/api_tests/` for use with the VS Code REST Client
## Dev Container
A `.devcontainer/` config is available (Java 21 + Node 24, ports 8080 and 3000 forwarded). Use VS Code's "Reopen in Container" for a pre-configured environment.
→ See [.devcontainer/README.md](./.devcontainer/README.md)

View File

@@ -11,7 +11,7 @@ Spring Boot 4.0 monolith serving the Familienarchiv REST API. Handles document m
- **Server**: Jetty (not Tomcat — excluded in pom.xml)
- **Data**: PostgreSQL 16, JPA/Hibernate, Spring Data JPA
- **Migrations**: Flyway (SQL files in `src/main/resources/db/migration/`)
- **Security**: Spring Security, Spring Session JDBC, JWT tokens
- **Security**: Spring Security, Spring Session JDBC
- **File Storage**: MinIO via AWS SDK v2 (S3-compatible)
- **Spreadsheet Import**: Apache POI 5.5.0 (Excel/ODS)
- **API Docs**: SpringDoc OpenAPI 3.x (`/v3/api-docs` — dev profile only)
@@ -19,7 +19,7 @@ Spring Boot 4.0 monolith serving the Familienarchiv REST API. Handles document m
## Package Structure
Package-by-domain: each domain owns its controller, service, repository, entities, and DTOs.
<!-- TODO: rewrite post-REFACTOR-1 — see Epic 4 -->
```
src/main/java/org/raddatz/familienarchiv/
@@ -43,31 +43,28 @@ src/main/java/org/raddatz/familienarchiv/
└── user/ # User domain — AppUser, UserGroup, UserService, auth controllers
```
## Layering Rules (Strict)
For per-domain ownership and public surface, see each domain's `README.md`.
```
Controller → Service → Repository → DB
```
## Layering Rules
- **Controllers never call repositories directly.**
- **Services never reach into another domain's repository.** Call the other domain's service instead.
-`DocumentService``PersonService.getById()``PersonRepository`
-`DocumentService``PersonRepository` directly
→ See [docs/ARCHITECTURE.md §Layering rule](../docs/ARCHITECTURE.md#layering-rule)
**LLM reminder:** controllers never call repositories directly; services never reach into another domain's repository — always call the other domain's service.
## Key Entities
| Entity | Table | Key Relationships |
|---|---|---|
| `Document` | `documents` | ManyToOne sender (Person), ManyToMany receivers (Person), ManyToMany tags (Tag) |
| `Person` | `persons` | Referenced by documents as sender/receiver; name aliases table |
| `Tag` | `tag` | ManyToMany with documents via `document_tags`; self-referencing parent for tree |
| `AppUser` | `app_users` | ManyToMany groups (UserGroup) |
| `UserGroup` | `user_groups` | Has a `Set<String> permissions` |
| `TranscriptionBlock` | `transcription_blocks` | Per-document, per-page text blocks with polygons |
| `DocumentAnnotation` | `document_annotations` | Free-form annotations on document pages |
| `Comment` | `document_comments` | Threaded comments with mentions |
| `Notification` | `notifications` | User notification feed |
| `OcrJob` / `OcrJobDocument` | `ocr_jobs`, `ocr_job_documents` | Batch OCR job tracking |
| Entity | Table | Key Relationships |
| --------------------------- | ------------------------------- | ------------------------------------------------------------------------------- |
| `Document` | `documents` | ManyToOne sender (Person), ManyToMany receivers (Person), ManyToMany tags (Tag) |
| `Person` | `persons` | Referenced by documents as sender/receiver; name aliases table |
| `Tag` | `tag` | ManyToMany with documents via `document_tags`; self-referencing parent for tree |
| `AppUser` | `app_users` | ManyToMany groups (UserGroup) |
| `UserGroup` | `user_groups` | Has a `Set<String> permissions` |
| `TranscriptionBlock` | `transcription_blocks` | Per-document, per-page text blocks with polygons |
| `DocumentAnnotation` | `document_annotations` | Free-form annotations on document pages |
| `Comment` | `document_comments` | Threaded comments with mentions |
| `Notification` | `notifications` | User notification feed |
| `OcrJob` / `OcrJobDocument` | `ocr_jobs`, `ocr_job_documents` | Batch OCR job tracking |
**`DocumentStatus` lifecycle:** `PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED`
@@ -104,32 +101,15 @@ public class MyEntity {
## Error Handling
Use `DomainException` for all domain errors:
→ See [CONTRIBUTING.md §Error handling](../CONTRIBUTING.md#error-handling)
```java
DomainException.notFound(ErrorCode.DOCUMENT_NOT_FOUND, "...")
DomainException.forbidden("...")
DomainException.conflict(ErrorCode.IMPORT_ALREADY_RUNNING, "...")
DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED, "...")
```
When adding a new `ErrorCode`:
1. Add to `ErrorCode.java`
2. Mirror in frontend `src/lib/errors.ts`
3. Add Paraglide translation key in `messages/{de,en,es}.json`
**LLM reminder:** use `DomainException.notFound/forbidden/conflict/internal()` — never throw raw exceptions from service methods. When adding a new `ErrorCode`: add to `ErrorCode.java`, mirror in `frontend/src/lib/shared/errors.ts`, add i18n keys in `messages/{de,en,es}.json`.
## Security / Permissions
Use `@RequirePermission` on controller methods or classes:
→ See [docs/ARCHITECTURE.md §Permission system](../docs/ARCHITECTURE.md#permission-system)
```java
@RequirePermission(Permission.WRITE_ALL)
public Document updateDocument(...) { ... }
```
Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`
`PermissionAspect` checks the current user's `UserGroup.permissions` at runtime.
**LLM reminder:** `@RequirePermission(Permission.WRITE_ALL)` is **required** on every `POST`, `PUT`, `PATCH`, `DELETE` endpoint — not optional. Do not mix with Spring Security's `@PreAuthorize`. Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`, `ANNOTATE_ALL`, `BLOG_WRITE`.
## OCR Integration
@@ -141,49 +121,35 @@ The backend orchestrates OCR by calling the Python `ocr-service` microservice vi
- `OcrBatchService` — handles batch/job workflows
- `OcrAsyncRunner` — async execution of OCR jobs
For ocr-service internals, see [`ocr-service/README.md`](../ocr-service/README.md).
## API Testing
HTTP test files in `backend/api_tests/` for the VS Code REST Client extension.
## How to Run
### Local Development
```bash
cd backend
# Run with dev profile (requires PostgreSQL + MinIO running via docker-compose)
./mvnw spring-boot:run
# Build JAR (with tests)
./mvnw clean package
# Build JAR skipping tests
./mvnw spring-boot:run # Run with dev profile (requires PostgreSQL + MinIO)
./mvnw clean package # Build JAR (with tests)
./mvnw clean package -DskipTests
# Run all tests
./mvnw test
# Run a single test class
./mvnw test -Dtest=ClassName
# Run with coverage (JaCoCo)
./mvnw clean verify
./mvnw test # Run all tests
./mvnw test -Dtest=ClassName # Run a single test class
./mvnw clean verify # Run with JaCoCo coverage report
```
### OpenAPI TypeScript Generation
**OpenAPI / TypeScript type generation:**
1. Build and start backend with `--spring.profiles.active=dev`
2. In `frontend/`, run: `npm run generate:api`
1. Start backend with `--spring.profiles.active=dev`
2. In `frontend/`: `npm run generate:api`
### Profiles
- **dev** (default): Enables OpenAPI, dev configs, e2e seeds
- **prod**: Production profile — no dev endpoints
**LLM reminder:** always regenerate types after any model or endpoint change — the most common cause of "where did my TypeScript type go?"
## Testing
- Unit tests: Mockito + JUnit, pure in-memory
- Slice tests: `@WebMvcTest`, `@DataJpaTest` with Testcontainers PostgreSQL
- Integration tests: Full Spring context with Testcontainers
- Coverage gate: 88% branch coverage overall (JaCoCo)
- Coverage gate: 88% branch coverage (JaCoCo)

View File

@@ -1,97 +1,5 @@
# Docs — Familienarchiv
# docs/
## Overview
→ See [docs/README.md](./README.md) for the folder structure and documentation guide.
Project documentation organized into four categories: architecture decision records (ADRs), system architecture diagrams, infrastructure runbooks, and detailed UI/UX specifications.
## Folder Structure
```
docs/
├── adr/ # Architecture Decision Records
├── architecture/ # C4 model diagrams and system architecture docs
├── infrastructure/ # Deployment, CI/CD, and ops guides
├── specs/ # UI/UX feature specifications (HTML)
├── app-analysis-*.md # Application analysis reports
├── mail.md # Mail system documentation
├── security-guide.md # Security policies and hardening guide
├── STYLEGUIDE.md # Coding and design style guide
├── TODO-backend.md # Backend backlog
└── TODO-frontend.md # Frontend backlog
```
## ADR (`adr/`)
Architecture Decision Records capture major technical decisions and their rationale.
| ADR | Title | Status |
|---|---|---|
| `001-ocr-python-microservice.md` | OCR as a separate Python container | Accepted |
| `002-polygon-jsonb-storage.md` | Polygon coordinates in JSONB columns | Accepted |
| `003-chronik-unified-activity-feed.md` | Unified activity feed (Chronik) | Accepted |
When making a significant architectural change (new service, data model change, technology swap), write a new ADR following the format:
- Status (Proposed / Accepted / Deprecated / Superseded)
- Context (forces at play)
- Decision (what we decided)
- Consequences (trade-offs)
- Alternatives Considered (table format)
## Architecture (`architecture/`)
Contains C4 model diagrams describing the system at different zoom levels:
- **Context diagram** — How Familienarchiv fits into the user and system ecosystem
- **Container diagram** — The high-level technology choices (Spring Boot, SvelteKit, PostgreSQL, MinIO, OCR service)
- **Component diagram** — Major structural components within the backend
Written in Markdown with embedded Mermaid or PlantUML diagrams (`c4-diagrams.md`).
## Infrastructure (`infrastructure/`)
Operational documentation for running Familienarchiv in production and CI.
| Document | Purpose |
|---|---|
| `ci-gitea.md` | Gitea CI/CD pipeline configuration |
| `production-compose.md` | Production Docker Compose setup |
| `s3-migration.md` | Migrating documents between S3 buckets |
| `self-hosted-catalogue.md` | Self-hosted software catalogue |
## Specs (`specs/`)
High-fidelity UI/UX specifications written as standalone HTML files. These are design documents that describe exact layout, interactions, and responsive behavior before implementation.
Each spec typically includes:
- Visual mockups with CSS-in-HTML styling
- Interaction flows and state transitions
- Responsive breakpoint behavior
- Accessibility requirements
Examples of active spec areas:
- Document detail page (`document-topbar-*.html`, `documents-page-spec.html`)
- Admin interfaces (`admin-redesign-*.html`, `admin-tag-overhaul.html`)
- Transcription workflows (`inline-transcription-*.html`, `annotation-transcription-*.html`)
- Dashboard and activity feeds (`dashboard-*.html`, `chronik-spec.html`)
- OCR admin (`ocr-admin-spec.html`)
## How to Use
1. **Before implementing a feature**, check `specs/` for an existing specification.
2. **When proposing a new architecture**, draft an ADR in `adr/` and discuss before coding.
3. **When deploying**, follow `infrastructure/production-compose.md`.
4. **Keep TODO files updated** — they serve as lightweight backlogs.
## Style Guide
`STYLEGUIDE.md` covers:
- Code formatting and linting rules
- Component naming conventions
- Color palette and typography
- Accessibility standards (WCAG 2.1 AA)
## Contributing
- ADRs should be sequential (`NNN-descriptive-name.md`).
- Specs should be self-contained HTML files viewable in a browser.
- Infrastructure docs should include copy-pasteable commands.
**LLM reminder:** ADRs are sequential — use the next number after the highest existing one in `docs/adr/`. When making a significant architectural change (new service, data model change, technology swap), write a new ADR before implementing.

86
docs/README.md Normal file
View File

@@ -0,0 +1,86 @@
# docs/
Project documentation organised into four categories: architecture decision records (ADRs), system architecture diagrams, infrastructure runbooks, and detailed UI/UX specifications.
## Folder structure
```
docs/
├── adr/ # Architecture Decision Records
├── architecture/ # C4 model diagrams and system architecture docs
├── infrastructure/ # Deployment, CI/CD, and ops guides
├── specs/ # UI/UX feature specifications (HTML)
├── ARCHITECTURE.md # Human-readable architecture overview (DOC-2)
├── DEPLOYMENT.md # Day-1 checklist and operational reference (DOC-5)
├── GLOSSARY.md # Domain terminology (DOC-3)
├── security-guide.md # Security policies and hardening guide
├── STYLEGUIDE.md # Coding and design style guide
└── infrastructure/ # Production compose, CI config, S3 migration
```
## ADR (`adr/`)
Architecture Decision Records capture major technical decisions and their rationale.
| ADR | Title | Status |
| -------------------------------------- | ------------------------------------ | -------- |
| `001-ocr-python-microservice.md` | OCR as a separate Python container | Accepted |
| `002-polygon-jsonb-storage.md` | Polygon coordinates in JSONB columns | Accepted |
| `003-chronik-unified-activity-feed.md` | Unified activity feed (Chronik) | Accepted |
When making a significant architectural change (new service, data model change, technology swap), write a new ADR:
- **Status** (Proposed / Accepted / Deprecated / Superseded)
- **Context** (forces at play)
- **Decision** (what we decided)
- **Consequences** (trade-offs)
- **Alternatives Considered** (table format)
ADRs are sequential (`NNN-descriptive-name.md`). Do not reuse numbers.
## Architecture (`architecture/`)
Contains C4 model diagrams describing the system at different zoom levels:
- **Context diagram** — How Familienarchiv fits into the user and system ecosystem
- **Container diagram** — The high-level technology choices (Spring Boot, SvelteKit, PostgreSQL, MinIO, OCR service)
- **Component diagram** — Major structural components within the backend
Written in Markdown with embedded Mermaid diagrams (`c4-diagrams.md`). Gitea renders these automatically.
For the human-readable architecture narrative, see [`docs/ARCHITECTURE.md`](ARCHITECTURE.md).
## Infrastructure (`infrastructure/`)
Operational documentation for running Familienarchiv in production and CI.
| Document | Purpose |
| -------------------------- | ---------------------------------------------------- |
| `ci-gitea.md` | Gitea CI/CD pipeline configuration |
| `production-compose.md` | Production Docker Compose setup and VPS provisioning |
| `s3-migration.md` | Migrating documents between S3 buckets |
| `self-hosted-catalogue.md` | Self-hosted software catalogue |
For the day-1 deployment checklist, see [`docs/DEPLOYMENT.md`](DEPLOYMENT.md).
## Specs (`specs/`)
High-fidelity UI/UX specifications written as standalone HTML files. These are design documents describing exact layout, interactions, and responsive behavior before implementation.
Each spec typically includes:
- Visual mockups with CSS-in-HTML styling
- Interaction flows and state transitions
- Responsive breakpoint behavior
- Accessibility requirements
Before implementing a feature, check `specs/` for an existing specification.
## Style Guide
[`docs/STYLEGUIDE.md`](STYLEGUIDE.md) covers:
- Code formatting and linting rules
- Component naming conventions
- Color palette and typography
- Accessibility standards (WCAG 2.1 AA)

1
familienarchiv-408 Submodule

Submodule familienarchiv-408 added at 6ecff120e6

View File

@@ -71,29 +71,13 @@ src/
└── ... # Other SvelteKit config files
```
For per-domain component inventories, see the domain READMEs in `src/lib/<domain>/README.md`.
## API Client Pattern
All server-side API calls use the typed client from `$lib/api.server.ts`:
→ See [CONTRIBUTING.md §Frontend API client](../CONTRIBUTING.md#frontend-api-client)
```typescript
const api = createApiClient(fetch);
const result = await api.GET('/api/persons/{id}', { params: { path: { id } } });
// Always check via response.ok, NOT result.error
if (!result.response.ok) {
const code = (result.error as unknown as { code?: string })?.code;
throw error(result.response.status, getErrorMessage(code));
}
return { person: result.data! };
```
Key rules:
- Use `!result.response.ok` for error checking (not `if (result.error)` — breaks when spec has no error responses defined)
- Cast errors as `result.error as unknown as { code?: string }` to extract backend error code
- Use `result.data!` after an ok check
For multipart/form-data (file uploads), bypass the typed client and use raw `fetch`.
**LLM reminder:** check `!result.response.ok` (not `result.error` — breaks when spec has no error responses); cast errors as `result.error as unknown as { code?: string }`; use `result.data!` after an ok check. For multipart/form-data (file uploads), bypass the typed client and use raw `fetch`.
## Form Actions Pattern
@@ -102,7 +86,7 @@ For multipart/form-data (file uploads), bypass the typed client and use raw `fet
export const actions = {
default: async ({ request, fetch }) => {
const formData = await request.formData();
const name = formData.get('name') as string;
const name = formData.get('name') as string; // cast needed — FormData returns FormDataEntryValue
// ...
return fail(400, { error: 'message' }); // on error
throw redirect(303, '/target'); // on success
@@ -112,13 +96,9 @@ export const actions = {
## Date Handling
- **Forms**: German format `dd.mm.yyyy` with auto-dot insertion via `handleDateInput()`. A hidden `<input type="hidden" name="documentDate" value={dateIso}>` sends ISO to the backend.
- **Display**: Always use `Intl.DateTimeFormat` with `T12:00:00` suffix to prevent UTC off-by-one:
```typescript
new Intl.DateTimeFormat('de-DE', { day: 'numeric', month: 'long', year: 'numeric' }).format(
new Date(doc.documentDate + 'T12:00:00')
);
```
→ See [CONTRIBUTING.md §Date handling](../CONTRIBUTING.md#date-handling)
**LLM reminder:** always append `T12:00:00` when constructing `new Date()` from an ISO date string — prevents UTC timezone off-by-one errors. Forms use German `dd.mm.yyyy` format via `handleDateInput()` with a hidden ISO input.
## Styling Conventions (Tailwind CSS 4)
@@ -146,15 +126,9 @@ Card pattern for content sections:
## Key UI Components
| Component | Location | Props | Description |
| -------------------- | ------------------------------ | --------------------------------------- | ------------------------------------------ |
| `PersonTypeahead` | `$lib/person/` | `name`, `label`, `value`, `initialName` | Single-person selector with typeahead |
| `PersonMultiSelect` | `$lib/person/` | `selectedPersons` (bind) | Chip-based multi-person selector |
| `TagInput` | `$lib/tag/` | `tags` (bind), `allowCreation?` | Tag chip input with typeahead |
| `PdfViewer` | `$lib/document/` | `url`, `annotations` | PDF rendering with annotation overlay |
| `TranscriptionBlock` | `$lib/document/transcription/` | `block`, `mode` | Read/edit transcription block |
| `DocumentTopBar` | `$lib/document/` | `document` | Responsive document metadata header |
| `BackButton` | `$lib/shared/primitives/` | — | Calls `history.back()`; 44 px touch target |
→ See per-domain READMEs: [`src/lib/person/README.md`](src/lib/person/README.md), [`src/lib/tag/README.md`](src/lib/tag/README.md), [`src/lib/document/README.md`](src/lib/document/README.md), [`src/lib/shared/README.md`](src/lib/shared/README.md)
**LLM reminder:** `BackButton` is at `$lib/shared/primitives/BackButton.svelte` — use it for all back navigation; never a static `<a href>`. API client is at `$lib/shared/api.server`.
## How to Run

View File

@@ -1,154 +1,7 @@
# OCR Service — Familienarchiv
# OCR Service
## Overview
→ See [ocr-service/README.md](./README.md) for tech stack, architecture, endpoints, environment variables, local development, testing, and training.
Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
**LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.
## Tech Stack
- **Framework**: FastAPI 0.115.6 (Python 3.11)
- **OCR Engines**:
- **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting
- **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers
- **PDF Processing**: `pypdfium2` (rendering), `pillow`
- **Image Processing**: `opencv-python-headless`, `pyvips`
- **Spell Checking**: `pyspellchecker`
- **HTTP Client**: `httpx`
## Architecture
The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
### Interface Contract
**Request:**
```json
{
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
"scriptType": "HANDWRITING_KURRENT",
"language": "de"
}
```
**Response:** Array of `OcrBlock` objects:
```json
[
{
"pageNumber": 0,
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
"text": "Sehr geehrter Herr ..."
}
]
```
Coordinates are normalized (0-1) relative to page dimensions.
### File Structure
```
ocr-service/
├── main.py # FastAPI app, endpoints, request handling
├── models.py # Pydantic models (OcrRequest, OcrBlock)
├── engines/
│ ├── __init__.py
│ ├── kraken.py # Kraken engine wrapper (Kurrent models)
│ └── surya.py # Surya engine wrapper (typewritten/Latin)
├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise)
├── confidence.py # Confidence scoring and thresholding
├── spell_check.py # Post-OCR spell correction
├── ensure_blla_model.py # Model download / verification helper
├── dictionaries/ # Historical word lists for spell checking
├── requirements.txt # Python dependencies
├── Dockerfile # Production container image
└── entrypoint.sh # Container startup script
```
### Key Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Returns 200 only after models are loaded |
| `/ocr` | POST | Extract text blocks from a PDF URL |
| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events |
| `/training/submit` | POST | Submit training data for model fine-tuning |
### Environment Variables
| Variable | Default | Description |
|---|---|---|
| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file |
| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints |
| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts |
| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts |
| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size |
| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size |
| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit |
| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size |
| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) |
| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts |
## How to Run
### Local Development (Python venv)
```bash
cd ocr-service
python -m venv .venv
source .venv/bin/activate
# Install PyTorch CPU first (saves ~2 GB vs CUDA)
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
# Install remaining dependencies
pip install -r requirements.txt
# Run development server
fastapi dev main.py --host 0.0.0.0 --port 8000
# Or production mode
uvicorn main:app --host 0.0.0.0 --port 8000
```
### Docker (via docker-compose)
The OCR service is included in the root `docker-compose.yml`:
```bash
docker-compose up -d ocr-service
```
The container:
- Exposes port 8000 internally (not mapped to host by default)
- Mounts `ocr_models` and `ocr_cache` volumes for persistence
- Has a 120-second startup grace period for model loading
- Memory limit: 12 GB
### Model Downloads
Use the helper script to download Kraken models:
```bash
./scripts/download-kraken-models.sh
```
Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally.
## Testing
Only a subset of tests can run without the full ML stack:
```bash
cd ocr-service
pip install pytest pytest-asyncio pyspellchecker
# No ML required — pure logic tests
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
```
Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv.
## Training
The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.
`ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev.

View File

@@ -1,144 +1,5 @@
# Scripts — Familienarchiv
# scripts/
## Overview
→ See [scripts/README.md](./README.md) for the full list of scripts, their purpose, and usage.
Utility scripts for development, data management, model downloads, and database operations. These are standalone shell and Python scripts used outside the normal application runtime.
## Scripts
### `reset-db.sh`
**Purpose**: Hard-reset the development database, wiping all documents, persons, tags, and related data.
**Usage:**
```bash
./scripts/reset-db.sh
# Type 'yes' to confirm
```
**What it truncates:**
- `transcription_block_versions`
- `transcription_blocks`
- `comment_mentions`
- `document_comments`
- `document_annotations`
- `document_versions`
- `notifications`
- `documents`
- `person_name_aliases`
- `persons`
- `tag`
> ⚠️ **Destructive operation** — only for development!
---
### `rebuild-frontend.sh`
**Purpose**: Force a clean rebuild of the frontend Docker container.
**Usage:**
```bash
./scripts/rebuild-frontend.sh
```
---
### `download-kraken-models.sh`
**Purpose**: Download Kraken HTR models for German Kurrent and Sütterlin scripts.
**Usage:**
```bash
./scripts/download-kraken-models.sh
```
Downloads models into `./ocr-service/models/` or the `ocr_models` Docker volume. Models are ~100-500 MB each.
---
### `download-paperless.sh`
**Purpose**: Download exported documents from a Paperless-ngx instance.
**Usage:**
```bash
./scripts/download-paperless.sh
```
Requires environment variables or config for the Paperless API endpoint and token.
---
### `flatten-paperless.sh`
**Purpose**: Flatten nested Paperless export directories into a single import-ready structure.
**Usage:**
```bash
./scripts/flatten-paperless.sh
```
---
### `generate_data.py`
**Purpose**: Generate synthetic test data for development.
**Usage:**
```bash
python scripts/generate_data.py
```
Generates fake documents, persons, and tags suitable for load testing or UI development.
---
### `prepare_historical_dict.py`
**Purpose**: Build a historical German word dictionary for the OCR spell-checker.
**Usage:**
```bash
python scripts/prepare_historical_dict.py
```
Processes raw word lists into the format expected by `ocr-service/spell_check.py`.
---
### `schema.sql`
**Purpose**: Complete database schema dump for reference.
**Note**: Flyway migrations in `backend/src/main/resources/db/migration/` are the source of truth for schema evolution. `schema.sql` is a snapshot for quick reference only.
---
### `large-data.sql`
**Purpose**: Pre-seeded dataset with a large number of documents for performance testing.
**Usage:**
```bash
# Import into PostgreSQL
docker exec -i archive-db psql -U archive_user -d family_archive_db < scripts/large-data.sql
```
## How to Use
Most scripts should be run from the **repository root**:
```bash
# Database reset
./scripts/reset-db.sh
# Model download
./scripts/download-kraken-models.sh
# Data generation
cd scripts && python generate_data.py
```
Ensure scripts are executable:
```bash
chmod +x scripts/*.sh
```
## Adding New Scripts
1. Place the script in `scripts/`
2. Add a header comment describing purpose and usage
3. Make it executable (`chmod +x`)
4. Document it in this `CLAUDE.md`
**LLM reminder:** when adding a new script, document it in `scripts/README.md` (not here).

161
scripts/README.md Normal file
View File

@@ -0,0 +1,161 @@
# scripts/
Utility scripts for development, data management, model downloads, and database operations. These are standalone shell and Python scripts used outside the normal application runtime.
## Scripts
### `reset-db.sh`
**Purpose**: Hard-reset the development database, wiping all documents, persons, tags, and related data.
**Usage:**
```bash
./scripts/reset-db.sh
# Type 'yes' to confirm
```
**What it truncates:**
- `transcription_block_versions`
- `transcription_blocks`
- `comment_mentions`
- `document_comments`
- `document_annotations`
- `document_versions`
- `notifications`
- `documents`
- `person_name_aliases`
- `persons`
- `tag`
> ⚠️ **Destructive operation — only for development!** This wipes ALL data. Not reversible without a backup.
---
### `rebuild-frontend.sh`
**Purpose**: Force a clean rebuild of the frontend Docker container.
**Usage:**
```bash
./scripts/rebuild-frontend.sh
```
---
### `download-kraken-models.sh`
**Purpose**: Download Kraken HTR models for German Kurrent and Sütterlin scripts.
**Usage:**
```bash
./scripts/download-kraken-models.sh
```
Downloads models into `./ocr-service/models/` or the `ocr_models` Docker volume. Models are ~100500 MB each.
---
### `download-paperless.sh`
**Purpose**: Download exported documents from a Paperless-ngx instance.
**Usage:**
```bash
./scripts/download-paperless.sh
```
Requires environment variables or config for the Paperless API endpoint and token.
---
### `flatten-paperless.sh`
**Purpose**: Flatten nested Paperless export directories into a single import-ready structure.
**Usage:**
```bash
./scripts/flatten-paperless.sh
```
---
### `generate_data.py`
**Purpose**: Generate synthetic test data for development.
**Usage:**
```bash
python scripts/generate_data.py
```
Generates fake documents, persons, and tags suitable for load testing or UI development.
---
### `prepare_historical_dict.py`
**Purpose**: Build a historical German word dictionary for the OCR spell-checker.
**Usage:**
```bash
python scripts/prepare_historical_dict.py
```
Processes raw word lists into the format expected by `ocr-service/spell_check.py`.
---
### `schema.sql`
**Purpose**: Complete database schema dump for reference.
**Note**: Flyway migrations in `backend/src/main/resources/db/migration/` are the source of truth for schema evolution. `schema.sql` is a snapshot for quick reference only.
---
### `large-data.sql`
**Purpose**: Pre-seeded dataset with a large number of documents for performance testing.
**Usage:**
```bash
# Import into PostgreSQL
docker exec -i archive-db psql -U archive_user -d family_archive_db < scripts/large-data.sql
```
## How to Use
Most scripts should be run from the **repository root**:
```bash
# Database reset
./scripts/reset-db.sh
# Model download
./scripts/download-kraken-models.sh
# Data generation
cd scripts && python generate_data.py
```
Ensure scripts are executable:
```bash
chmod +x scripts/*.sh
```
## Adding New Scripts
1. Place the script in `scripts/`
2. Add a header comment describing purpose and usage
3. Make it executable (`chmod +x`)
4. Document it in this `README.md`