diff --git a/.devcontainer/CLAUDE.md b/.devcontainer/CLAUDE.md index abd0a5ce..2daaa2d0 100644 --- a/.devcontainer/CLAUDE.md +++ b/.devcontainer/CLAUDE.md @@ -1,96 +1,3 @@ -# Dev Container — Familienarchiv +# Dev Container -## Overview - -VS Code Dev Container configuration for a pre-configured development environment. Includes Java 21, Maven, and Node.js 24 — everything needed to work on both backend and frontend. - -## Configuration - -File: `.devcontainer/devcontainer.json` - -### Included Features - -| Feature | Version | Purpose | -|---|---|---| -| Java | 21 | Spring Boot backend | -| Maven | bundled with Java feature | Build tool | -| Node.js | 24 | SvelteKit frontend | - -### VS Code Extensions (Auto-installed) - -| Extension | Purpose | -|---|---| -| `vscjava.vscode-java-pack` | Java language support, debugging, testing | -| `vmware.vscode-spring-boot` | Spring Boot tooling | -| `gabrielbb.vscode-lombok` | Lombok annotation support | -| `humao.rest-client` | HTTP request files (for `backend/api_tests/`) | - -### Ports - -- `8080` forwarded to host — access backend at `http://localhost:8080` - -### User - -Runs as `vscode` user (not root) for security. - -## How to Use - -### Prerequisites - -- VS Code with the **Dev Containers** extension installed -- Docker running locally - -### Open in Dev Container - -1. Open the project in VS Code -2. Press `F1` → type "Dev Containers: Reopen in Container" -3. VS Code will: - - Build the container using the root `docker-compose.yml` - - Install Java 21, Maven, and Node 24 - - Install the listed extensions - - Mount the workspace folder - -### Working Inside the Container - -Once inside the container, you have access to both stacks: - -```bash -# Backend -cd backend -./mvnw spring-boot:run - -# Frontend (in a new terminal) -cd frontend -npm install -npm run dev -``` - -The container reuses the `docker-compose.yml` services, so PostgreSQL and MinIO are available automatically. - -### Forwarding Frontend Port - -The devcontainer config only forwards port 8080 by default. To access the frontend dev server (port 5173 or 3000), either: - -1. Add `5173` to `forwardPorts` in `devcontainer.json`, or -2. Use the VS Code "Ports" panel to forward it dynamically - -## Limitations - -- The devcontainer attaches to the `backend` service from `docker-compose.yml`, so it inherits those environment variables -- OCR service and other containers should be started separately via `docker-compose up -d` -- GPU passthrough for OCR training is not configured - -## Customization - -To add more tools or extensions, edit `.devcontainer/devcontainer.json`: - -```json -{ - "features": { - "ghcr.io/devcontainers/features/python:1": { - "version": "3.11" - } - }, - "forwardPorts": [8080, 5173, 3000] -} -``` +→ See [.devcontainer/README.md](./README.md) for configuration, usage, and known limitations. diff --git a/.devcontainer/README.md b/.devcontainer/README.md new file mode 100644 index 00000000..0dc649ff --- /dev/null +++ b/.devcontainer/README.md @@ -0,0 +1,94 @@ +# Dev Container — Familienarchiv + +VS Code Dev Container configuration for a pre-configured development environment. Includes Java 21, Maven, and Node.js 24 — everything needed to work on both backend and frontend. + +## Configuration + +File: `.devcontainer/devcontainer.json` + +### Included Features + +| Feature | Version | Purpose | +| ------- | ------------------------- | ------------------- | +| Java | 21 | Spring Boot backend | +| Maven | bundled with Java feature | Build tool | +| Node.js | 24 | SvelteKit frontend | + +### VS Code Extensions (Auto-installed) + +| Extension | Purpose | +| --------------------------- | --------------------------------------------- | +| `vscjava.vscode-java-pack` | Java language support, debugging, testing | +| `vmware.vscode-spring-boot` | Spring Boot tooling | +| `gabrielbb.vscode-lombok` | Lombok annotation support | +| `humao.rest-client` | HTTP request files (for `backend/api_tests/`) | + +### Ports + +- `8080` forwarded to host — access backend at `http://localhost:8080` + +### User + +Runs as `vscode` user (not root) for security. + +## How to Use + +### Prerequisites + +- VS Code with the **Dev Containers** extension installed +- Docker running locally + +### Open in Dev Container + +1. Open the project in VS Code +2. Press `F1` → type "Dev Containers: Reopen in Container" +3. VS Code will: + - Build the container using the root `docker-compose.yml` + - Install Java 21, Maven, and Node 24 + - Install the listed extensions + - Mount the workspace folder + +### Working Inside the Container + +Once inside the container, you have access to both stacks: + +```bash +# Backend +cd backend +./mvnw spring-boot:run + +# Frontend (in a new terminal) +cd frontend +npm install +npm run dev +``` + +The container reuses the `docker-compose.yml` services, so PostgreSQL and MinIO are available automatically. + +### Forwarding Frontend Port + +The devcontainer config only forwards port 8080 by default. To access the frontend dev server (port 5173 or 3000), either: + +1. Add `5173` to `forwardPorts` in `devcontainer.json`, or +2. Use the VS Code "Ports" panel to forward it dynamically + +## Limitations + +- The devcontainer attaches to the `backend` service from `docker-compose.yml`, so it inherits those environment variables +- OCR service and other containers should be started separately via `docker-compose up -d` +- GPU passthrough for OCR training is not configured + +## Customization + +To add more tools or extensions, edit `.devcontainer/devcontainer.json`: + +```json +{ + "features": { + "ghcr.io/devcontainers/features/python:1": { + "version": "3.11" + } + }, + "forwardPorts": [8080, 5173, 3000] +} +``` diff --git a/CLAUDE.md b/CLAUDE.md index c6556eae..eca50a6d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,6 +2,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. +> For a human-readable project overview, see [README.md](./README.md). + ## Project Overview **Familienarchiv** is a family document archival system — a full-stack web app for digitizing, organizing, and searching family documents. Key features: file uploads (stored in MinIO/S3), metadata management, Excel/ODS batch import, full-text search, conversation threads between family members, and role-based access control. @@ -16,6 +18,8 @@ See [CODESTYLE.md](./CODESTYLE.md) for coding standards: Clean Code, DRY/KISS tr ## Stack +→ See [README.md §Tech Stack](./README.md#tech-stack) + - **Backend**: Spring Boot 4.0 (Java 21, Maven, Jetty, JPA/Hibernate, Flyway, Spring Security, Spring Session JDBC) - **Frontend**: SvelteKit 2 with Svelte 5, TypeScript, Tailwind CSS 4, Paraglide.js (i18n: de/en/es) - **Database**: PostgreSQL 16 @@ -25,12 +29,13 @@ See [CODESTYLE.md](./CODESTYLE.md) for coding standards: Clean Code, DRY/KISS tr ## Common Commands ### Running the Full Stack + ```bash -# From repo root — starts PostgreSQL, MinIO, and Spring Boot backend docker-compose up -d ``` ### Backend (Spring Boot) + ```bash cd backend @@ -42,6 +47,7 @@ cd backend ``` ### Frontend (SvelteKit) + ```bash cd frontend @@ -64,7 +70,7 @@ npm run generate:api # Regenerate TypeScript API types from OpenAPI spec ### Package Structure -Package-by-domain: each domain owns its controller, service, repository, entities, and DTOs. + ``` backend/src/main/java/org/raddatz/familienarchiv/ @@ -88,27 +94,21 @@ backend/src/main/java/org/raddatz/familienarchiv/ └── user/ User domain — AppUser, UserGroup, UserService, auth controllers ``` -### Layering Rules (strictly enforced) +### Layering Rules -``` -Controller → Service → Repository → DB -``` +→ See [docs/ARCHITECTURE.md §Layering rule](./docs/ARCHITECTURE.md#layering-rule) -- **Controllers** never inject or call repositories directly. -- **Services** never reach into another domain's repository. Call the other domain's service instead. - - ✅ `DocumentService` → `PersonService.getById()` → `PersonRepository` - - ❌ `DocumentService` → `PersonRepository` directly -- This keeps domain boundaries clear and business logic testable in isolation. +**LLM reminder:** controllers never call repositories directly; services never reach into another domain's repository — always call the other domain's service instead. ### Domain Model -| Entity | Table | Key relationships | -|---|---|---| -| `Document` | `documents` | ManyToOne `sender` (Person), ManyToMany `receivers` (Person), ManyToMany `tags` (Tag) | -| `Person` | `persons` | Referenced by documents as sender/receiver | -| `Tag` | `tag` | ManyToMany with documents via `document_tags` | -| `AppUser` | `app_users` | ManyToMany `groups` (UserGroup) | -| `UserGroup` | `user_groups` | Has a `Set permissions` | +| Entity | Table | Key relationships | +| ----------- | ------------- | ------------------------------------------------------------------------------------- | +| `Document` | `documents` | ManyToOne `sender` (Person), ManyToMany `receivers` (Person), ManyToMany `tags` (Tag) | +| `Person` | `persons` | Referenced by documents as sender/receiver | +| `Tag` | `tag` | ManyToMany with documents via `document_tags` | +| `AppUser` | `app_users` | ManyToMany `groups` (UserGroup) | +| `UserGroup` | `user_groups` | Has a `Set permissions` | **`DocumentStatus` lifecycle:** `PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED` @@ -118,6 +118,7 @@ Controller → Service → Repository → DB ### Entity Code Style All entities use these Lombok annotations: + ```java @Entity @Table(name = "table_name") @@ -146,65 +147,29 @@ Services are annotated with `@Service`, `@RequiredArgsConstructor`, and optional - Read methods are not annotated (default non-transactional is fine). - Each service owns its domain's repository. Cross-domain data access goes through the other domain's service. -**Existing services:** - -| Service | Responsibility | -|---|---| -| `DocumentService` | Document CRUD, search, tag cascade delete | -| `PersonService` | Person CRUD, find-or-create by alias | -| `TagService` | Tag find/create/update/delete | -| `UserService` | User and group CRUD | -| `FileService` | S3/MinIO upload and download | -| `MassImportService` | Async ODS/Excel import; delegates to PersonService and TagService | - ### DTOs -Input DTOs live in `dto/`. Response types are the model entities themselves (no response DTOs). +Input DTOs live flat in the domain package. Response types are the model entities themselves (no response DTOs). -- `DocumentUpdateDTO` — used for both create and update (all fields optional) -- `CreateUserRequest` — user creation -- `GroupDTO` — group create/update +- `@Schema(requiredMode = REQUIRED)` on every field the backend always populates — drives TypeScript generation. ### Error Handling -Use `DomainException` for all domain errors. Never throw raw exceptions from service methods. +→ See [CONTRIBUTING.md §Error handling](./CONTRIBUTING.md#error-handling) -```java -// Static factories match common HTTP status codes: -DomainException.notFound(ErrorCode.DOCUMENT_NOT_FOUND, "Document not found: " + id) -DomainException.forbidden("Access denied") -DomainException.conflict(ErrorCode.IMPORT_ALREADY_RUNNING, "Already running") -DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED, "Upload failed: " + e.getMessage()) -``` - -`ErrorCode` is an enum in `exception/ErrorCode.java`. When adding a new error case, add the value there **and** mirror it in the frontend's `src/lib/errors.ts` + add a Paraglide translation key. - -For simple validation in controllers (not domain logic), `ResponseStatusException` is acceptable: -```java -throw new ResponseStatusException(HttpStatus.BAD_REQUEST, "firstName is required"); -``` +**LLM reminder:** use `DomainException.notFound/forbidden/conflict/internal()` from service methods — never throw raw exceptions. When adding a new `ErrorCode`: (1) add to `ErrorCode.java`, (2) mirror in `frontend/src/lib/shared/errors.ts`, (3) add i18n keys in `messages/{de,en,es}.json`. ### Security / Permissions -Use `@RequirePermission` on controller methods (or the whole controller class): +→ See [docs/ARCHITECTURE.md §Permission system](./docs/ARCHITECTURE.md#permission-system) -```java -@RequirePermission(Permission.WRITE_ALL) -public Document updateDocument(...) { ... } -``` - -Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION` - -`PermissionAspect` (AOP) checks the current user's `UserGroup.permissions` at runtime. +**LLM reminder:** `@RequirePermission(Permission.WRITE_ALL)` is **required** on every `POST`, `PUT`, `PATCH`, `DELETE` endpoint — not optional. Do not mix with Spring Security's `@PreAuthorize`. Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`, `ANNOTATE_ALL`, `BLOG_WRITE`. ### OpenAPI / API Types -SpringDoc generates the spec at `/v3/api-docs` (only accessible when running with `--spring.profiles.active=dev`). +→ See [CONTRIBUTING.md §Walkthrough B — Add a new endpoint](./CONTRIBUTING.md#4-walkthrough-b--add-a-new-endpoint) -When changing any model field or endpoint: -1. Rebuild the backend JAR with `-DskipTests` -2. Start it with `--spring.profiles.active=dev` -3. Run `npm run generate:api` in `frontend/` +**LLM reminder:** always run `npm run generate:api` in `frontend/` after any backend model or endpoint change — this is the most common cause of TypeScript type errors. --- @@ -233,79 +198,52 @@ frontend/src/routes/ ### API Client Pattern -All server-side API calls use the typed client from `$lib/api.server.ts`: +→ See [CONTRIBUTING.md §Frontend API client](./CONTRIBUTING.md#frontend-api-client) -```typescript -const api = createApiClient(fetch); -const result = await api.GET('/api/persons/{id}', { params: { path: { id } } }); - -// Always check via response.ok, NOT result.error -if (!result.response.ok) { - const code = (result.error as unknown as { code?: string })?.code; - throw error(result.response.status, getErrorMessage(code)); -} -return { person: result.data! }; -``` - -Key rules: -- Use `!result.response.ok` for error checking (not `if (result.error)` — this breaks when the spec has no error responses defined) -- Cast errors as `result.error as unknown as { code?: string }` to extract the backend error code -- Use `result.data!` (non-null assertion) after an ok check — TypeScript knows it's present - -For multipart/form-data endpoints (file uploads), bypass the typed client and use raw `fetch`: -```typescript -const res = await fetch(`${baseUrl}/api/documents`, { method: 'POST', body: formData }); -``` +**LLM reminder:** check `!result.response.ok` (not `result.error` — breaks when spec has no error responses defined); cast errors as `result.error as unknown as { code?: string }`; use `result.data!` after an ok check. ### Form Actions Pattern ```typescript // +page.server.ts export const actions = { - default: async ({ request, fetch }) => { - const formData = await request.formData(); - const name = formData.get('name') as string; // cast needed — FormData returns FormDataEntryValue - // ... - return fail(400, { error: 'message' }); // on error - throw redirect(303, '/target'); // on success - } + default: async ({ request, fetch }) => { + const formData = await request.formData(); + const name = formData.get("name") as string; + // ... + return fail(400, { error: "message" }); // on error + throw redirect(303, "/target"); // on success + }, }; ``` ### Date Handling -- **Forms**: German format `dd.mm.yyyy` with auto-dot insertion via `handleDateInput()`. A hidden `` sends ISO format to the backend. -- **Display**: Always use `Intl.DateTimeFormat` with `T12:00:00` suffix to prevent UTC timezone off-by-one: - ```typescript - new Intl.DateTimeFormat('de-DE', { day: 'numeric', month: 'long', year: 'numeric' }) - .format(new Date(doc.documentDate + 'T12:00:00')) - ``` +→ See [CONTRIBUTING.md §Date handling](./CONTRIBUTING.md#date-handling) + +**LLM reminder:** always append `T12:00:00` when constructing `new Date()` from an ISO date string — prevents UTC timezone off-by-one errors. ### UI Component Library -Custom components in `src/lib/components/`: - -| Component | Props | Description | -|---|---|---| -| `PersonTypeahead` | `name`, `label`, `value`, `initialName`, `on:change` | Single-person selector with typeahead dropdown | -| `PersonMultiSelect` | `selectedPersons` (bind) | Chip-based multi-person selector | -| `TagInput` | `tags` (bind), `allowCreation?`, `on:change` | Tag chip input with typeahead | +→ See per-domain READMEs: [`frontend/src/lib/person/README.md`](./frontend/src/lib/person/README.md), [`frontend/src/lib/tag/README.md`](./frontend/src/lib/tag/README.md), [`frontend/src/lib/document/README.md`](./frontend/src/lib/document/README.md), [`frontend/src/lib/shared/README.md`](./frontend/src/lib/shared/README.md) ### Styling Conventions (Tailwind CSS 4) Brand color utilities (defined in `layout.css`): -| Class | Value | Usage | -|---|---|---| -| `brand-navy` | `#002850` | Primary text, buttons, headers | +| Class | Value | Usage | +| ------------ | --------- | -------------------------------- | +| `brand-navy` | `#002850` | Primary text, buttons, headers | | `brand-mint` | `#A6DAD8` | Accents, hover underlines, icons | -| `brand-sand` | `#E4E2D7` | Page background, card borders | +| `brand-sand` | `#E4E2D7` | Page background, card borders | Typography: + - `font-serif` (Merriweather) — body text, document titles, names - `font-sans` (Montserrat) — labels, metadata, UI chrome Card pattern for content sections: + ```svelte

Section Title

@@ -313,48 +251,19 @@ Card pattern for content sections:
``` -Save bar pattern — use **sticky full-bleed** for long forms (edit document), **card-style with `mt-4`** for short forms (new person): -```svelte - -
- - -
-``` - -Back button pattern — use the shared `` component from `$lib/components/BackButton.svelte`: -```svelte - - - -``` -The component calls `history.back()` so the user returns to wherever they came from. Label is always "Zurück" (no contextual suffix — destination is unknown). Touch target ≥ 44px and focus ring are built in. Do not use a static `` for back navigation. - -Subtle action link (e.g. "new document/person"): -```svelte - - - Neues Dokument - -``` +Back button pattern — use the shared `` component from `$lib/shared/primitives/BackButton.svelte`. Do not use a static `` for back navigation. ### Error Handling (Frontend) -`src/lib/errors.ts` mirrors the backend `ErrorCode` enum and maps codes to Paraglide translation keys. When adding a new `ErrorCode` on the backend: -1. Add it to `ErrorCode.java` -2. Add it to the `ErrorCode` type in `errors.ts` -3. Add a `case` in `getErrorMessage()` -4. Add the translation key in `messages/de.json`, `en.json`, `es.json` +→ See [CONTRIBUTING.md §Error handling](./CONTRIBUTING.md#error-handling) + +**LLM reminder:** when adding a new `ErrorCode`: (1) add to `ErrorCode.java`, (2) add to `ErrorCode` type in `frontend/src/lib/shared/errors.ts`, (3) add a `case` in `getErrorMessage()`, (4) add i18n keys in `messages/{de,en,es}.json`. --- ## Infrastructure -The `docker-compose.yml` at the repo root orchestrates everything. A MinIO MC helper container runs at startup to create the `archive-documents` bucket. The backend container depends on both `db` and `minio` being healthy. - -Database migrations live in `backend/src/main/resources/db/migration/` (Flyway, SQL files named `V{n}__{description}.sql`). +→ See [docs/DEPLOYMENT.md](./docs/DEPLOYMENT.md) ## API Testing @@ -362,4 +271,4 @@ HTTP test files are in `backend/api_tests/` for use with the VS Code REST Client ## Dev Container -A `.devcontainer/` config is available (Java 21 + Node 24, ports 8080 and 3000 forwarded). Use VS Code's "Reopen in Container" for a pre-configured environment. +→ See [.devcontainer/README.md](./.devcontainer/README.md) diff --git a/backend/CLAUDE.md b/backend/CLAUDE.md index 41d3c372..69d2f154 100644 --- a/backend/CLAUDE.md +++ b/backend/CLAUDE.md @@ -11,7 +11,7 @@ Spring Boot 4.0 monolith serving the Familienarchiv REST API. Handles document m - **Server**: Jetty (not Tomcat — excluded in pom.xml) - **Data**: PostgreSQL 16, JPA/Hibernate, Spring Data JPA - **Migrations**: Flyway (SQL files in `src/main/resources/db/migration/`) -- **Security**: Spring Security, Spring Session JDBC, JWT tokens +- **Security**: Spring Security, Spring Session JDBC - **File Storage**: MinIO via AWS SDK v2 (S3-compatible) - **Spreadsheet Import**: Apache POI 5.5.0 (Excel/ODS) - **API Docs**: SpringDoc OpenAPI 3.x (`/v3/api-docs` — dev profile only) @@ -19,7 +19,7 @@ Spring Boot 4.0 monolith serving the Familienarchiv REST API. Handles document m ## Package Structure -Package-by-domain: each domain owns its controller, service, repository, entities, and DTOs. + ``` src/main/java/org/raddatz/familienarchiv/ @@ -43,31 +43,28 @@ src/main/java/org/raddatz/familienarchiv/ └── user/ # User domain — AppUser, UserGroup, UserService, auth controllers ``` -## Layering Rules (Strict) +For per-domain ownership and public surface, see each domain's `README.md`. -``` -Controller → Service → Repository → DB -``` +## Layering Rules -- **Controllers never call repositories directly.** -- **Services never reach into another domain's repository.** Call the other domain's service instead. - - ✅ `DocumentService` → `PersonService.getById()` → `PersonRepository` - - ❌ `DocumentService` → `PersonRepository` directly +→ See [docs/ARCHITECTURE.md §Layering rule](../docs/ARCHITECTURE.md#layering-rule) + +**LLM reminder:** controllers never call repositories directly; services never reach into another domain's repository — always call the other domain's service. ## Key Entities -| Entity | Table | Key Relationships | -|---|---|---| -| `Document` | `documents` | ManyToOne sender (Person), ManyToMany receivers (Person), ManyToMany tags (Tag) | -| `Person` | `persons` | Referenced by documents as sender/receiver; name aliases table | -| `Tag` | `tag` | ManyToMany with documents via `document_tags`; self-referencing parent for tree | -| `AppUser` | `app_users` | ManyToMany groups (UserGroup) | -| `UserGroup` | `user_groups` | Has a `Set permissions` | -| `TranscriptionBlock` | `transcription_blocks` | Per-document, per-page text blocks with polygons | -| `DocumentAnnotation` | `document_annotations` | Free-form annotations on document pages | -| `Comment` | `document_comments` | Threaded comments with mentions | -| `Notification` | `notifications` | User notification feed | -| `OcrJob` / `OcrJobDocument` | `ocr_jobs`, `ocr_job_documents` | Batch OCR job tracking | +| Entity | Table | Key Relationships | +| --------------------------- | ------------------------------- | ------------------------------------------------------------------------------- | +| `Document` | `documents` | ManyToOne sender (Person), ManyToMany receivers (Person), ManyToMany tags (Tag) | +| `Person` | `persons` | Referenced by documents as sender/receiver; name aliases table | +| `Tag` | `tag` | ManyToMany with documents via `document_tags`; self-referencing parent for tree | +| `AppUser` | `app_users` | ManyToMany groups (UserGroup) | +| `UserGroup` | `user_groups` | Has a `Set permissions` | +| `TranscriptionBlock` | `transcription_blocks` | Per-document, per-page text blocks with polygons | +| `DocumentAnnotation` | `document_annotations` | Free-form annotations on document pages | +| `Comment` | `document_comments` | Threaded comments with mentions | +| `Notification` | `notifications` | User notification feed | +| `OcrJob` / `OcrJobDocument` | `ocr_jobs`, `ocr_job_documents` | Batch OCR job tracking | **`DocumentStatus` lifecycle:** `PLACEHOLDER → UPLOADED → TRANSCRIBED → REVIEWED → ARCHIVED` @@ -104,32 +101,15 @@ public class MyEntity { ## Error Handling -Use `DomainException` for all domain errors: +→ See [CONTRIBUTING.md §Error handling](../CONTRIBUTING.md#error-handling) -```java -DomainException.notFound(ErrorCode.DOCUMENT_NOT_FOUND, "...") -DomainException.forbidden("...") -DomainException.conflict(ErrorCode.IMPORT_ALREADY_RUNNING, "...") -DomainException.internal(ErrorCode.FILE_UPLOAD_FAILED, "...") -``` - -When adding a new `ErrorCode`: -1. Add to `ErrorCode.java` -2. Mirror in frontend `src/lib/errors.ts` -3. Add Paraglide translation key in `messages/{de,en,es}.json` +**LLM reminder:** use `DomainException.notFound/forbidden/conflict/internal()` — never throw raw exceptions from service methods. For simple controller validation (not domain logic), `ResponseStatusException` is acceptable: `throw new ResponseStatusException(HttpStatus.BAD_REQUEST, "…")`. When adding a new `ErrorCode`: add to `ErrorCode.java`, mirror in `frontend/src/lib/shared/errors.ts`, add i18n keys in `messages/{de,en,es}.json`. ## Security / Permissions -Use `@RequirePermission` on controller methods or classes: +→ See [docs/ARCHITECTURE.md §Permission system](../docs/ARCHITECTURE.md#permission-system) -```java -@RequirePermission(Permission.WRITE_ALL) -public Document updateDocument(...) { ... } -``` - -Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION` - -`PermissionAspect` checks the current user's `UserGroup.permissions` at runtime. +**LLM reminder:** `@RequirePermission(Permission.WRITE_ALL)` is **required** on every `POST`, `PUT`, `PATCH`, `DELETE` endpoint — not optional. Do not mix with Spring Security's `@PreAuthorize`. Available permissions: `READ_ALL`, `WRITE_ALL`, `ADMIN`, `ADMIN_USER`, `ADMIN_TAG`, `ADMIN_PERMISSION`, `ANNOTATE_ALL`, `BLOG_WRITE`. ## OCR Integration @@ -141,49 +121,35 @@ The backend orchestrates OCR by calling the Python `ocr-service` microservice vi - `OcrBatchService` — handles batch/job workflows - `OcrAsyncRunner` — async execution of OCR jobs +For ocr-service internals, see [`ocr-service/README.md`](../ocr-service/README.md). + ## API Testing HTTP test files in `backend/api_tests/` for the VS Code REST Client extension. ## How to Run -### Local Development - ```bash cd backend -# Run with dev profile (requires PostgreSQL + MinIO running via docker-compose) -./mvnw spring-boot:run - -# Build JAR (with tests) -./mvnw clean package - -# Build JAR skipping tests +./mvnw spring-boot:run # Run with dev profile (requires PostgreSQL + MinIO) +./mvnw clean package # Build JAR (with tests) ./mvnw clean package -DskipTests - -# Run all tests -./mvnw test - -# Run a single test class -./mvnw test -Dtest=ClassName - -# Run with coverage (JaCoCo) -./mvnw clean verify +./mvnw test # Run all tests +./mvnw test -Dtest=ClassName # Run a single test class +./mvnw clean verify # Run with JaCoCo coverage report ``` -### OpenAPI TypeScript Generation +**OpenAPI / TypeScript type generation:** -1. Build and start backend with `--spring.profiles.active=dev` -2. In `frontend/`, run: `npm run generate:api` +1. Start backend with `--spring.profiles.active=dev` +2. In `frontend/`: `npm run generate:api` -### Profiles - -- **dev** (default): Enables OpenAPI, dev configs, e2e seeds -- **prod**: Production profile — no dev endpoints +**LLM reminder:** always regenerate types after any model or endpoint change — the most common cause of "where did my TypeScript type go?" ## Testing - Unit tests: Mockito + JUnit, pure in-memory - Slice tests: `@WebMvcTest`, `@DataJpaTest` with Testcontainers PostgreSQL - Integration tests: Full Spring context with Testcontainers -- Coverage gate: 88% branch coverage overall (JaCoCo) +- Coverage gate: 88% branch coverage (JaCoCo) diff --git a/docs/CLAUDE.md b/docs/CLAUDE.md index 96d035e2..f9afcb17 100644 --- a/docs/CLAUDE.md +++ b/docs/CLAUDE.md @@ -1,97 +1,5 @@ -# Docs — Familienarchiv +# docs/ -## Overview +→ See [docs/README.md](./README.md) for the folder structure and documentation guide. -Project documentation organized into four categories: architecture decision records (ADRs), system architecture diagrams, infrastructure runbooks, and detailed UI/UX specifications. - -## Folder Structure - -``` -docs/ -├── adr/ # Architecture Decision Records -├── architecture/ # C4 model diagrams and system architecture docs -├── infrastructure/ # Deployment, CI/CD, and ops guides -├── specs/ # UI/UX feature specifications (HTML) -├── app-analysis-*.md # Application analysis reports -├── mail.md # Mail system documentation -├── security-guide.md # Security policies and hardening guide -├── STYLEGUIDE.md # Coding and design style guide -├── TODO-backend.md # Backend backlog -└── TODO-frontend.md # Frontend backlog -``` - -## ADR (`adr/`) - -Architecture Decision Records capture major technical decisions and their rationale. - -| ADR | Title | Status | -|---|---|---| -| `001-ocr-python-microservice.md` | OCR as a separate Python container | Accepted | -| `002-polygon-jsonb-storage.md` | Polygon coordinates in JSONB columns | Accepted | -| `003-chronik-unified-activity-feed.md` | Unified activity feed (Chronik) | Accepted | - -When making a significant architectural change (new service, data model change, technology swap), write a new ADR following the format: -- Status (Proposed / Accepted / Deprecated / Superseded) -- Context (forces at play) -- Decision (what we decided) -- Consequences (trade-offs) -- Alternatives Considered (table format) - -## Architecture (`architecture/`) - -Contains C4 model diagrams describing the system at different zoom levels: - -- **Context diagram** — How Familienarchiv fits into the user and system ecosystem -- **Container diagram** — The high-level technology choices (Spring Boot, SvelteKit, PostgreSQL, MinIO, OCR service) -- **Component diagram** — Major structural components within the backend - -Written in Markdown with embedded Mermaid or PlantUML diagrams (`c4-diagrams.md`). - -## Infrastructure (`infrastructure/`) - -Operational documentation for running Familienarchiv in production and CI. - -| Document | Purpose | -|---|---| -| `ci-gitea.md` | Gitea CI/CD pipeline configuration | -| `production-compose.md` | Production Docker Compose setup | -| `s3-migration.md` | Migrating documents between S3 buckets | -| `self-hosted-catalogue.md` | Self-hosted software catalogue | - -## Specs (`specs/`) - -High-fidelity UI/UX specifications written as standalone HTML files. These are design documents that describe exact layout, interactions, and responsive behavior before implementation. - -Each spec typically includes: -- Visual mockups with CSS-in-HTML styling -- Interaction flows and state transitions -- Responsive breakpoint behavior -- Accessibility requirements - -Examples of active spec areas: -- Document detail page (`document-topbar-*.html`, `documents-page-spec.html`) -- Admin interfaces (`admin-redesign-*.html`, `admin-tag-overhaul.html`) -- Transcription workflows (`inline-transcription-*.html`, `annotation-transcription-*.html`) -- Dashboard and activity feeds (`dashboard-*.html`, `chronik-spec.html`) -- OCR admin (`ocr-admin-spec.html`) - -## How to Use - -1. **Before implementing a feature**, check `specs/` for an existing specification. -2. **When proposing a new architecture**, draft an ADR in `adr/` and discuss before coding. -3. **When deploying**, follow `infrastructure/production-compose.md`. -4. **Keep TODO files updated** — they serve as lightweight backlogs. - -## Style Guide - -`STYLEGUIDE.md` covers: -- Code formatting and linting rules -- Component naming conventions -- Color palette and typography -- Accessibility standards (WCAG 2.1 AA) - -## Contributing - -- ADRs should be sequential (`NNN-descriptive-name.md`). -- Specs should be self-contained HTML files viewable in a browser. -- Infrastructure docs should include copy-pasteable commands. +**LLM reminder:** ADRs are sequential — use the next number after the highest existing one in `docs/adr/`. When making a significant architectural change (new service, data model change, technology swap), write a new ADR before implementing. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..cf95abb7 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,85 @@ +# docs/ + +Project documentation organised into four categories: architecture decision records (ADRs), system architecture diagrams, infrastructure runbooks, and detailed UI/UX specifications. + +## Folder structure + +``` +docs/ +├── adr/ # Architecture Decision Records +├── architecture/ # C4 model diagrams and system architecture docs +├── infrastructure/ # Deployment, CI/CD, and ops guides +├── specs/ # UI/UX feature specifications (HTML) +├── ARCHITECTURE.md # Human-readable architecture overview (DOC-2) +├── DEPLOYMENT.md # Day-1 checklist and operational reference (DOC-5) +├── GLOSSARY.md # Domain terminology (DOC-3) +├── security-guide.md # Security policies and hardening guide +└── STYLEGUIDE.md # Coding and design style guide +``` + +## ADR (`adr/`) + +Architecture Decision Records capture major technical decisions and their rationale. + +| ADR | Title | Status | +| -------------------------------------- | ------------------------------------ | -------- | +| `001-ocr-python-microservice.md` | OCR as a separate Python container | Accepted | +| `002-polygon-jsonb-storage.md` | Polygon coordinates in JSONB columns | Accepted | +| `003-chronik-unified-activity-feed.md` | Unified activity feed (Chronik) | Accepted | + +When making a significant architectural change (new service, data model change, technology swap), write a new ADR: + +- **Status** (Proposed / Accepted / Deprecated / Superseded) +- **Context** (forces at play) +- **Decision** (what we decided) +- **Consequences** (trade-offs) +- **Alternatives Considered** (table format) + +ADRs are sequential (`NNN-descriptive-name.md`). Do not reuse numbers. + +## Architecture (`architecture/`) + +Contains C4 model diagrams describing the system at different zoom levels: + +- **Context diagram** — How Familienarchiv fits into the user and system ecosystem +- **Container diagram** — The high-level technology choices (Spring Boot, SvelteKit, PostgreSQL, MinIO, OCR service) +- **Component diagram** — Major structural components within the backend + +Written in Markdown with embedded Mermaid diagrams (`c4-diagrams.md`). Gitea renders these automatically. + +For the human-readable architecture narrative, see [`docs/ARCHITECTURE.md`](ARCHITECTURE.md). + +## Infrastructure (`infrastructure/`) + +Operational documentation for running Familienarchiv in production and CI. + +| Document | Purpose | +| -------------------------- | ---------------------------------------------------- | +| `ci-gitea.md` | Gitea CI/CD pipeline configuration | +| `production-compose.md` | Production Docker Compose setup and VPS provisioning | +| `s3-migration.md` | Migrating documents between S3 buckets | +| `self-hosted-catalogue.md` | Self-hosted software catalogue | + +For the day-1 deployment checklist, see [`docs/DEPLOYMENT.md`](DEPLOYMENT.md). + +## Specs (`specs/`) + +High-fidelity UI/UX specifications written as standalone HTML files. These are design documents describing exact layout, interactions, and responsive behavior before implementation. + +Each spec typically includes: + +- Visual mockups with CSS-in-HTML styling +- Interaction flows and state transitions +- Responsive breakpoint behavior +- Accessibility requirements + +Before implementing a feature, check `specs/` for an existing specification. + +## Style Guide + +[`docs/STYLEGUIDE.md`](STYLEGUIDE.md) covers: + +- Code formatting and linting rules +- Component naming conventions +- Color palette and typography +- Accessibility standards (WCAG 2.1 AA) diff --git a/frontend/CLAUDE.md b/frontend/CLAUDE.md index 47b2d1b9..0061686a 100644 --- a/frontend/CLAUDE.md +++ b/frontend/CLAUDE.md @@ -71,29 +71,13 @@ src/ └── ... # Other SvelteKit config files ``` +For per-domain component inventories, see the domain READMEs in `src/lib//README.md`. + ## API Client Pattern -All server-side API calls use the typed client from `$lib/api.server.ts`: +→ See [CONTRIBUTING.md §Frontend API client](../CONTRIBUTING.md#frontend-api-client) -```typescript -const api = createApiClient(fetch); -const result = await api.GET('/api/persons/{id}', { params: { path: { id } } }); - -// Always check via response.ok, NOT result.error -if (!result.response.ok) { - const code = (result.error as unknown as { code?: string })?.code; - throw error(result.response.status, getErrorMessage(code)); -} -return { person: result.data! }; -``` - -Key rules: - -- Use `!result.response.ok` for error checking (not `if (result.error)` — breaks when spec has no error responses defined) -- Cast errors as `result.error as unknown as { code?: string }` to extract backend error code -- Use `result.data!` after an ok check - -For multipart/form-data (file uploads), bypass the typed client and use raw `fetch`. +**LLM reminder:** check `!result.response.ok` (not `result.error` — breaks when spec has no error responses); cast errors as `result.error as unknown as { code?: string }`; use `result.data!` after an ok check. For multipart/form-data (file uploads), bypass the typed client and use raw `fetch`. ## Form Actions Pattern @@ -102,7 +86,7 @@ For multipart/form-data (file uploads), bypass the typed client and use raw `fet export const actions = { default: async ({ request, fetch }) => { const formData = await request.formData(); - const name = formData.get('name') as string; + const name = formData.get('name') as string; // cast needed — FormData returns FormDataEntryValue // ... return fail(400, { error: 'message' }); // on error throw redirect(303, '/target'); // on success @@ -112,13 +96,9 @@ export const actions = { ## Date Handling -- **Forms**: German format `dd.mm.yyyy` with auto-dot insertion via `handleDateInput()`. A hidden `` sends ISO to the backend. -- **Display**: Always use `Intl.DateTimeFormat` with `T12:00:00` suffix to prevent UTC off-by-one: - ```typescript - new Intl.DateTimeFormat('de-DE', { day: 'numeric', month: 'long', year: 'numeric' }).format( - new Date(doc.documentDate + 'T12:00:00') - ); - ``` +→ See [CONTRIBUTING.md §Date handling](../CONTRIBUTING.md#date-handling) + +**LLM reminder:** always append `T12:00:00` when constructing `new Date()` from an ISO date string — prevents UTC timezone off-by-one errors. Forms use German `dd.mm.yyyy` format via `handleDateInput()` with a hidden ISO input. ## Styling Conventions (Tailwind CSS 4) @@ -146,15 +126,9 @@ Card pattern for content sections: ## Key UI Components -| Component | Location | Props | Description | -| -------------------- | ------------------------------ | --------------------------------------- | ------------------------------------------ | -| `PersonTypeahead` | `$lib/person/` | `name`, `label`, `value`, `initialName` | Single-person selector with typeahead | -| `PersonMultiSelect` | `$lib/person/` | `selectedPersons` (bind) | Chip-based multi-person selector | -| `TagInput` | `$lib/tag/` | `tags` (bind), `allowCreation?` | Tag chip input with typeahead | -| `PdfViewer` | `$lib/document/` | `url`, `annotations` | PDF rendering with annotation overlay | -| `TranscriptionBlock` | `$lib/document/transcription/` | `block`, `mode` | Read/edit transcription block | -| `DocumentTopBar` | `$lib/document/` | `document` | Responsive document metadata header | -| `BackButton` | `$lib/shared/primitives/` | — | Calls `history.back()`; 44 px touch target | +→ See per-domain READMEs: [`src/lib/person/README.md`](src/lib/person/README.md), [`src/lib/tag/README.md`](src/lib/tag/README.md), [`src/lib/document/README.md`](src/lib/document/README.md), [`src/lib/shared/README.md`](src/lib/shared/README.md) + +**LLM reminder:** `BackButton` is at `$lib/shared/primitives/BackButton.svelte` — use it for all back navigation; never a static ``. API client is at `$lib/shared/api.server`. ## How to Run diff --git a/ocr-service/CLAUDE.md b/ocr-service/CLAUDE.md index 72410f68..f628c60b 100644 --- a/ocr-service/CLAUDE.md +++ b/ocr-service/CLAUDE.md @@ -1,154 +1,7 @@ -# OCR Service — Familienarchiv +# OCR Service -## Overview +→ See [ocr-service/README.md](./README.md) for tech stack, architecture, endpoints, environment variables, local development, testing, and training. -Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java. +**LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only. -## Tech Stack - -- **Framework**: FastAPI 0.115.6 (Python 3.11) -- **OCR Engines**: - - **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting - - **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts -- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers -- **PDF Processing**: `pypdfium2` (rendering), `pillow` -- **Image Processing**: `opencv-python-headless`, `pyvips` -- **Spell Checking**: `pyspellchecker` -- **HTTP Client**: `httpx` - -## Architecture - -The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence. - -### Interface Contract - -**Request:** -```json -{ - "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...", - "scriptType": "HANDWRITING_KURRENT", - "language": "de" -} -``` - -**Response:** Array of `OcrBlock` objects: -```json -[ - { - "pageNumber": 0, - "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04, - "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]], - "text": "Sehr geehrter Herr ..." - } -] -``` - -Coordinates are normalized (0-1) relative to page dimensions. - -### File Structure - -``` -ocr-service/ -├── main.py # FastAPI app, endpoints, request handling -├── models.py # Pydantic models (OcrRequest, OcrBlock) -├── engines/ -│ ├── __init__.py -│ ├── kraken.py # Kraken engine wrapper (Kurrent models) -│ └── surya.py # Surya engine wrapper (typewritten/Latin) -├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise) -├── confidence.py # Confidence scoring and thresholding -├── spell_check.py # Post-OCR spell correction -├── ensure_blla_model.py # Model download / verification helper -├── dictionaries/ # Historical word lists for spell checking -├── requirements.txt # Python dependencies -├── Dockerfile # Production container image -└── entrypoint.sh # Container startup script -``` - -### Key Endpoints - -| Endpoint | Method | Description | -|---|---|---| -| `/health` | GET | Returns 200 only after models are loaded | -| `/ocr` | POST | Extract text blocks from a PDF URL | -| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events | -| `/training/submit` | POST | Submit training data for model fine-tuning | - -### Environment Variables - -| Variable | Default | Description | -|---|---|---| -| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file | -| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints | -| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts | -| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts | -| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size | -| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size | -| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit | -| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size | -| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) | -| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts | - -## How to Run - -### Local Development (Python venv) - -```bash -cd ocr-service -python -m venv .venv -source .venv/bin/activate - -# Install PyTorch CPU first (saves ~2 GB vs CUDA) -pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu - -# Install remaining dependencies -pip install -r requirements.txt - -# Run development server -fastapi dev main.py --host 0.0.0.0 --port 8000 - -# Or production mode -uvicorn main:app --host 0.0.0.0 --port 8000 -``` - -### Docker (via docker-compose) - -The OCR service is included in the root `docker-compose.yml`: - -```bash -docker-compose up -d ocr-service -``` - -The container: -- Exposes port 8000 internally (not mapped to host by default) -- Mounts `ocr_models` and `ocr_cache` volumes for persistence -- Has a 120-second startup grace period for model loading -- Memory limit: 12 GB - -### Model Downloads - -Use the helper script to download Kraken models: - -```bash -./scripts/download-kraken-models.sh -``` - -Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally. - -## Testing - -Only a subset of tests can run without the full ML stack: - -```bash -cd ocr-service -pip install pytest pytest-asyncio pyspellchecker - -# No ML required — pure logic tests -python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v -``` - -Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv. - -## Training - -The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported. +**LLM reminder:** `ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev. diff --git a/scripts/CLAUDE.md b/scripts/CLAUDE.md index 6fcc1d0c..4ed2f0fc 100644 --- a/scripts/CLAUDE.md +++ b/scripts/CLAUDE.md @@ -1,144 +1,5 @@ -# Scripts — Familienarchiv +# scripts/ -## Overview +→ See [scripts/README.md](./README.md) for the full list of scripts, their purpose, and usage. -Utility scripts for development, data management, model downloads, and database operations. These are standalone shell and Python scripts used outside the normal application runtime. - -## Scripts - -### `reset-db.sh` -**Purpose**: Hard-reset the development database, wiping all documents, persons, tags, and related data. - -**Usage:** -```bash -./scripts/reset-db.sh -# Type 'yes' to confirm -``` - -**What it truncates:** -- `transcription_block_versions` -- `transcription_blocks` -- `comment_mentions` -- `document_comments` -- `document_annotations` -- `document_versions` -- `notifications` -- `documents` -- `person_name_aliases` -- `persons` -- `tag` - -> ⚠️ **Destructive operation** — only for development! - ---- - -### `rebuild-frontend.sh` -**Purpose**: Force a clean rebuild of the frontend Docker container. - -**Usage:** -```bash -./scripts/rebuild-frontend.sh -``` - ---- - -### `download-kraken-models.sh` -**Purpose**: Download Kraken HTR models for German Kurrent and Sütterlin scripts. - -**Usage:** -```bash -./scripts/download-kraken-models.sh -``` - -Downloads models into `./ocr-service/models/` or the `ocr_models` Docker volume. Models are ~100-500 MB each. - ---- - -### `download-paperless.sh` -**Purpose**: Download exported documents from a Paperless-ngx instance. - -**Usage:** -```bash -./scripts/download-paperless.sh -``` - -Requires environment variables or config for the Paperless API endpoint and token. - ---- - -### `flatten-paperless.sh` -**Purpose**: Flatten nested Paperless export directories into a single import-ready structure. - -**Usage:** -```bash -./scripts/flatten-paperless.sh -``` - ---- - -### `generate_data.py` -**Purpose**: Generate synthetic test data for development. - -**Usage:** -```bash -python scripts/generate_data.py -``` - -Generates fake documents, persons, and tags suitable for load testing or UI development. - ---- - -### `prepare_historical_dict.py` -**Purpose**: Build a historical German word dictionary for the OCR spell-checker. - -**Usage:** -```bash -python scripts/prepare_historical_dict.py -``` - -Processes raw word lists into the format expected by `ocr-service/spell_check.py`. - ---- - -### `schema.sql` -**Purpose**: Complete database schema dump for reference. - -**Note**: Flyway migrations in `backend/src/main/resources/db/migration/` are the source of truth for schema evolution. `schema.sql` is a snapshot for quick reference only. - ---- - -### `large-data.sql` -**Purpose**: Pre-seeded dataset with a large number of documents for performance testing. - -**Usage:** -```bash -# Import into PostgreSQL -docker exec -i archive-db psql -U archive_user -d family_archive_db < scripts/large-data.sql -``` - -## How to Use - -Most scripts should be run from the **repository root**: - -```bash -# Database reset -./scripts/reset-db.sh - -# Model download -./scripts/download-kraken-models.sh - -# Data generation -cd scripts && python generate_data.py -``` - -Ensure scripts are executable: -```bash -chmod +x scripts/*.sh -``` - -## Adding New Scripts - -1. Place the script in `scripts/` -2. Add a header comment describing purpose and usage -3. Make it executable (`chmod +x`) -4. Document it in this `CLAUDE.md` +**LLM reminder:** when adding a new script, document it in `scripts/README.md` (not here). diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 00000000..8c56ead3 --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,161 @@ +# scripts/ + +Utility scripts for development, data management, model downloads, and database operations. These are standalone shell and Python scripts used outside the normal application runtime. + +## Scripts + +### `reset-db.sh` + +**Purpose**: Hard-reset the development database, wiping all documents, persons, tags, and related data. + +**Usage:** + +```bash +./scripts/reset-db.sh +# Type 'yes' to confirm +``` + +**What it truncates:** + +- `transcription_block_versions` +- `transcription_blocks` +- `comment_mentions` +- `document_comments` +- `document_annotations` +- `document_versions` +- `notifications` +- `documents` +- `person_name_aliases` +- `persons` +- `tag` + +> ⚠️ **Destructive operation — only for development!** This wipes ALL data. Not reversible without a backup. + +--- + +### `rebuild-frontend.sh` + +**Purpose**: Force a clean rebuild of the frontend Docker container. + +**Usage:** + +```bash +./scripts/rebuild-frontend.sh +``` + +--- + +### `download-kraken-models.sh` + +**Purpose**: Download Kraken HTR models for German Kurrent and Sütterlin scripts. + +**Usage:** + +```bash +./scripts/download-kraken-models.sh +``` + +Downloads models into `./ocr-service/models/` or the `ocr_models` Docker volume. Models are ~100–500 MB each. + +--- + +### `download-paperless.sh` + +**Purpose**: Download exported documents from a Paperless-ngx instance. + +**Usage:** + +```bash +./scripts/download-paperless.sh +``` + +Requires environment variables or config for the Paperless API endpoint and token. + +--- + +### `flatten-paperless.sh` + +**Purpose**: Flatten nested Paperless export directories into a single import-ready structure. + +**Usage:** + +```bash +./scripts/flatten-paperless.sh +``` + +--- + +### `generate_data.py` + +**Purpose**: Generate synthetic test data for development. + +**Usage:** + +```bash +python scripts/generate_data.py +``` + +Generates fake documents, persons, and tags suitable for load testing or UI development. + +--- + +### `prepare_historical_dict.py` + +**Purpose**: Build a historical German word dictionary for the OCR spell-checker. + +**Usage:** + +```bash +python scripts/prepare_historical_dict.py +``` + +Processes raw word lists into the format expected by `ocr-service/spell_check.py`. + +--- + +### `schema.sql` + +**Purpose**: Complete database schema dump for reference. + +**Note**: Flyway migrations in `backend/src/main/resources/db/migration/` are the source of truth for schema evolution. `schema.sql` is a snapshot for quick reference only. + +--- + +### `large-data.sql` + +**Purpose**: Pre-seeded dataset with a large number of documents for performance testing. + +**Usage:** + +```bash +# Import into PostgreSQL +docker exec -i archive-db psql -U archive_user -d family_archive_db < scripts/large-data.sql +``` + +## How to Use + +Most scripts should be run from the **repository root**: + +```bash +# Database reset +./scripts/reset-db.sh + +# Model download +./scripts/download-kraken-models.sh + +# Data generation +cd scripts && python generate_data.py +``` + +Ensure scripts are executable: + +```bash +chmod +x scripts/*.sh +``` + +## Adding New Scripts + +1. Place the script in `scripts/` +2. Add a header comment describing purpose and usage +3. Make it executable (`chmod +x`) +4. Document it in this `README.md`