docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7

Processes all 7 CLAUDE.md files according to the 3-bucket classification.
Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md,
domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last.

### scripts/CLAUDE.md → scripts/README.md
New `scripts/README.md` with full script documentation (preserving the
⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md`
reduced to a pointer + "document new scripts in README.md" reminder.

### .devcontainer/CLAUDE.md → .devcontainer/README.md
New `.devcontainer/README.md` with all configuration, usage, and limitations.
`devcontainer/CLAUDE.md` reduced to a single pointer line.

### docs/CLAUDE.md → docs/README.md
New `docs/README.md` covering the folder structure, ADR guide, infrastructure
docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder.

### ocr-service/CLAUDE.md
Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6).
Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk.

### backend/CLAUDE.md
- Layering Rules → pointer to docs/ARCHITECTURE.md
- Error Handling → pointer to CONTRIBUTING.md + reminder
- Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder
- Package Structure → tagged TODO post-REFACTOR-1
- Fixed errors.ts path to frontend/src/lib/shared/errors.ts
- Added ANNOTATE_ALL + BLOG_WRITE to permission list
- Key Entities, Entity Code Style, Services → kept (Bucket-2)

### root CLAUDE.md
- Stack, Infrastructure, Dev Container → pointers
- Layering Rules, Error Handling, Security, OpenAPI, API Client,
  Date Handling, UI Components, Frontend Error Handling → pointers + reminders
- Package Structure → tagged TODO post-REFACTOR-1
- Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2)

### frontend/CLAUDE.md
- API Client Pattern, Date Handling → pointers + reminders
- Key UI Components → pointer to domain READMEs
- Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Marcel
2026-05-05 23:33:41 +02:00
parent a5f4b0df31
commit e2c86626f7
11 changed files with 452 additions and 732 deletions

View File

@@ -1,154 +1,7 @@
# OCR Service — Familienarchiv
# OCR Service
## Overview
→ See [ocr-service/README.md](./README.md) for tech stack, architecture, endpoints, environment variables, local development, testing, and training.
Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
**LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.
## Tech Stack
- **Framework**: FastAPI 0.115.6 (Python 3.11)
- **OCR Engines**:
- **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting
- **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers
- **PDF Processing**: `pypdfium2` (rendering), `pillow`
- **Image Processing**: `opencv-python-headless`, `pyvips`
- **Spell Checking**: `pyspellchecker`
- **HTTP Client**: `httpx`
## Architecture
The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
### Interface Contract
**Request:**
```json
{
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
"scriptType": "HANDWRITING_KURRENT",
"language": "de"
}
```
**Response:** Array of `OcrBlock` objects:
```json
[
{
"pageNumber": 0,
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
"text": "Sehr geehrter Herr ..."
}
]
```
Coordinates are normalized (0-1) relative to page dimensions.
### File Structure
```
ocr-service/
├── main.py # FastAPI app, endpoints, request handling
├── models.py # Pydantic models (OcrRequest, OcrBlock)
├── engines/
│ ├── __init__.py
│ ├── kraken.py # Kraken engine wrapper (Kurrent models)
│ └── surya.py # Surya engine wrapper (typewritten/Latin)
├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise)
├── confidence.py # Confidence scoring and thresholding
├── spell_check.py # Post-OCR spell correction
├── ensure_blla_model.py # Model download / verification helper
├── dictionaries/ # Historical word lists for spell checking
├── requirements.txt # Python dependencies
├── Dockerfile # Production container image
└── entrypoint.sh # Container startup script
```
### Key Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Returns 200 only after models are loaded |
| `/ocr` | POST | Extract text blocks from a PDF URL |
| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events |
| `/training/submit` | POST | Submit training data for model fine-tuning |
### Environment Variables
| Variable | Default | Description |
|---|---|---|
| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file |
| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints |
| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts |
| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts |
| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size |
| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size |
| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit |
| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size |
| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) |
| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts |
## How to Run
### Local Development (Python venv)
```bash
cd ocr-service
python -m venv .venv
source .venv/bin/activate
# Install PyTorch CPU first (saves ~2 GB vs CUDA)
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
# Install remaining dependencies
pip install -r requirements.txt
# Run development server
fastapi dev main.py --host 0.0.0.0 --port 8000
# Or production mode
uvicorn main:app --host 0.0.0.0 --port 8000
```
### Docker (via docker-compose)
The OCR service is included in the root `docker-compose.yml`:
```bash
docker-compose up -d ocr-service
```
The container:
- Exposes port 8000 internally (not mapped to host by default)
- Mounts `ocr_models` and `ocr_cache` volumes for persistence
- Has a 120-second startup grace period for model loading
- Memory limit: 12 GB
### Model Downloads
Use the helper script to download Kraken models:
```bash
./scripts/download-kraken-models.sh
```
Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally.
## Testing
Only a subset of tests can run without the full ML stack:
```bash
cd ocr-service
pip install pytest pytest-asyncio pyspellchecker
# No ML required — pure logic tests
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
```
Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv.
## Training
The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.
`ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev.