docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7

Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 23:33:41 +02:00
parent a5f4b0df31
commit e2c86626f7
11 changed files with 452 additions and 732 deletions
--- a/ocr-service/CLAUDE.md
+++ b/ocr-service/CLAUDE.md
@@ -1,154 +1,7 @@
-# OCR Service — Familienarchiv
+# OCR Service

-## Overview
+→ See [ocr-service/README.md](./README.md) for tech stack, architecture, endpoints, environment variables, local development, testing, and training.

-Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
+**LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.

-## Tech Stack
-
- **Framework**: FastAPI 0.115.6 (Python 3.11)
- **OCR Engines**:
-  - **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting
-  - **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers
- **PDF Processing**: `pypdfium2` (rendering), `pillow`
- **Image Processing**: `opencv-python-headless`, `pyvips`
- **Spell Checking**: `pyspellchecker`
- **HTTP Client**: `httpx`
-
-## Architecture
-
-The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
-
-### Interface Contract
-
-**Request:**
-```json
-{
-  "pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
-  "scriptType": "HANDWRITING_KURRENT",
-  "language": "de"
-}
-```
-
-**Response:** Array of `OcrBlock` objects:
-```json
-[
-  {
-    "pageNumber": 0,
-    "x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
-    "polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
-    "text": "Sehr geehrter Herr ..."
-  }
-]
-```
-
-Coordinates are normalized (0-1) relative to page dimensions.
-
-### File Structure
-
-```
-ocr-service/
-├── main.py                  # FastAPI app, endpoints, request handling
-├── models.py                # Pydantic models (OcrRequest, OcrBlock)
-├── engines/
-│   ├── __init__.py
-│   ├── kraken.py            # Kraken engine wrapper (Kurrent models)
-│   └── surya.py             # Surya engine wrapper (typewritten/Latin)
-├── preprocessing.py         # Image preprocessing (CLAHE, deskew, denoise)
-├── confidence.py            # Confidence scoring and thresholding
-├── spell_check.py           # Post-OCR spell correction
-├── ensure_blla_model.py     # Model download / verification helper
-├── dictionaries/            # Historical word lists for spell checking
-├── requirements.txt         # Python dependencies
-├── Dockerfile               # Production container image
-└── entrypoint.sh            # Container startup script
-```
-
-### Key Endpoints
-
-| Endpoint | Method | Description |
-|---|---|---|
-| `/health` | GET | Returns 200 only after models are loaded |
-| `/ocr` | POST | Extract text blocks from a PDF URL |
-| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events |
-| `/training/submit` | POST | Submit training data for model fine-tuning |
-
-### Environment Variables
-
-| Variable | Default | Description |
-|---|---|---|
-| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file |
-| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints |
-| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts |
-| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts |
-| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size |
-| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size |
-| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit |
-| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size |
-| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) |
-| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts |
-
-## How to Run
-
-### Local Development (Python venv)
-
-```bash
-cd ocr-service
-python -m venv .venv
-source .venv/bin/activate
-
-# Install PyTorch CPU first (saves ~2 GB vs CUDA)
-pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
-
-# Install remaining dependencies
-pip install -r requirements.txt
-
-# Run development server
-fastapi dev main.py --host 0.0.0.0 --port 8000
-
-# Or production mode
-uvicorn main:app --host 0.0.0.0 --port 8000
-```
-
-### Docker (via docker-compose)
-
-The OCR service is included in the root `docker-compose.yml`:
-
-```bash
-docker-compose up -d ocr-service
-```
-
-The container:
- Exposes port 8000 internally (not mapped to host by default)
- Mounts `ocr_models` and `ocr_cache` volumes for persistence
- Has a 120-second startup grace period for model loading
- Memory limit: 12 GB
-
-### Model Downloads
-
-Use the helper script to download Kraken models:
-
-```bash
-./scripts/download-kraken-models.sh
-```
-
-Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally.
-
-## Testing
-
-Only a subset of tests can run without the full ML stack:
-
-```bash
-cd ocr-service
-pip install pytest pytest-asyncio pyspellchecker
-
-# No ML required — pure logic tests
-python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
-```
-
-Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv.
-
-## Training
-
-The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.
+`ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev.