docs(legibility): migrate CLAUDE.md rules into human docs — DOC-7
Processes all 7 CLAUDE.md files according to the 3-bucket classification. Migration targets (CONTRIBUTING.md, docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, domain READMEs) are introduced by DOC-2/4/5/6 — this PR must merge last. ### scripts/CLAUDE.md → scripts/README.md New `scripts/README.md` with full script documentation (preserving the ⚠️ destructive-operation warning on reset-db.sh). `scripts/CLAUDE.md` reduced to a pointer + "document new scripts in README.md" reminder. ### .devcontainer/CLAUDE.md → .devcontainer/README.md New `.devcontainer/README.md` with all configuration, usage, and limitations. `devcontainer/CLAUDE.md` reduced to a single pointer line. ### docs/CLAUDE.md → docs/README.md New `docs/README.md` covering the folder structure, ADR guide, infrastructure docs, and specs folder. `docs/CLAUDE.md` reduced to pointer + ADR reminder. ### ocr-service/CLAUDE.md Reduced to pointer to `ocr-service/README.md` (content migrated in DOC-6). Kept LLM reminders: single-node constraint, ALLOWED_PDF_HOSTS SSRF risk. ### backend/CLAUDE.md - Layering Rules → pointer to docs/ARCHITECTURE.md - Error Handling → pointer to CONTRIBUTING.md + reminder - Security/Permissions → pointer to docs/ARCHITECTURE.md + reminder - Package Structure → tagged TODO post-REFACTOR-1 - Fixed errors.ts path to frontend/src/lib/shared/errors.ts - Added ANNOTATE_ALL + BLOG_WRITE to permission list - Key Entities, Entity Code Style, Services → kept (Bucket-2) ### root CLAUDE.md - Stack, Infrastructure, Dev Container → pointers - Layering Rules, Error Handling, Security, OpenAPI, API Client, Date Handling, UI Components, Frontend Error Handling → pointers + reminders - Package Structure → tagged TODO post-REFACTOR-1 - Domain Model, Entity Code Style, Form Actions, Styling → kept (Bucket-2) ### frontend/CLAUDE.md - API Client Pattern, Date Handling → pointers + reminders - Key UI Components → pointer to domain READMEs - Styling, Form Actions, How to Run, Vite Proxy, i18n → kept (Bucket-2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,154 +1,7 @@
|
||||
# OCR Service — Familienarchiv
|
||||
# OCR Service
|
||||
|
||||
## Overview
|
||||
→ See [ocr-service/README.md](./README.md) for tech stack, architecture, endpoints, environment variables, local development, testing, and training.
|
||||
|
||||
Python FastAPI microservice that performs OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) on historical family documents. It exposes a simple HTTP API consumed by the Spring Boot backend. The service is stateless — all job tracking and business logic remain in Java.
|
||||
**LLM reminder:** the OCR service is a **single-node container** — training reloads the model in-process, so multiple replicas cause model-state divergence (see ADR-001). All job tracking and business logic stay in Spring Boot; the Python service is stateless OCR only.
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Framework**: FastAPI 0.115.6 (Python 3.11)
|
||||
- **OCR Engines**:
|
||||
- **Surya** (`surya-ocr`) — Transformer-based, handles typewritten and modern Latin handwriting
|
||||
- **Kraken** (`kraken==7.0`) — Historical HTR model support, required for pre-1941 German Kurrent/Sütterlin scripts
|
||||
- **ML**: PyTorch 2.7.1 (CPU-only), torchvision, transformers
|
||||
- **PDF Processing**: `pypdfium2` (rendering), `pillow`
|
||||
- **Image Processing**: `opencv-python-headless`, `pyvips`
|
||||
- **Spell Checking**: `pyspellchecker`
|
||||
- **HTTP Client**: `httpx`
|
||||
|
||||
## Architecture
|
||||
|
||||
The service is a single-node container (see ADR-001). OCR training reloads the model in-process after each run, so multiple replicas would cause training conflicts and model-state divergence.
|
||||
|
||||
### Interface Contract
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"pdfUrl": "http://minio:9000/archive-documents/abc.pdf?presigned...",
|
||||
"scriptType": "HANDWRITING_KURRENT",
|
||||
"language": "de"
|
||||
}
|
||||
```
|
||||
|
||||
**Response:** Array of `OcrBlock` objects:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"pageNumber": 0,
|
||||
"x": 0.12, "y": 0.08, "width": 0.76, "height": 0.04,
|
||||
"polygon": [[0.12,0.08],[0.88,0.09],[0.87,0.12],[0.13,0.11]],
|
||||
"text": "Sehr geehrter Herr ..."
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Coordinates are normalized (0-1) relative to page dimensions.
|
||||
|
||||
### File Structure
|
||||
|
||||
```
|
||||
ocr-service/
|
||||
├── main.py # FastAPI app, endpoints, request handling
|
||||
├── models.py # Pydantic models (OcrRequest, OcrBlock)
|
||||
├── engines/
|
||||
│ ├── __init__.py
|
||||
│ ├── kraken.py # Kraken engine wrapper (Kurrent models)
|
||||
│ └── surya.py # Surya engine wrapper (typewritten/Latin)
|
||||
├── preprocessing.py # Image preprocessing (CLAHE, deskew, denoise)
|
||||
├── confidence.py # Confidence scoring and thresholding
|
||||
├── spell_check.py # Post-OCR spell correction
|
||||
├── ensure_blla_model.py # Model download / verification helper
|
||||
├── dictionaries/ # Historical word lists for spell checking
|
||||
├── requirements.txt # Python dependencies
|
||||
├── Dockerfile # Production container image
|
||||
└── entrypoint.sh # Container startup script
|
||||
```
|
||||
|
||||
### Key Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|---|---|---|
|
||||
| `/health` | GET | Returns 200 only after models are loaded |
|
||||
| `/ocr` | POST | Extract text blocks from a PDF URL |
|
||||
| `/ocr/stream` | POST | Streaming OCR with SSE-style progress events |
|
||||
| `/training/submit` | POST | Submit training data for model fine-tuning |
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|---|---|---|
|
||||
| `KRAKEN_MODEL_PATH` | `/app/models/german_kurrent.mlmodel` | Path to Kraken model file |
|
||||
| `TRAINING_TOKEN` | `""` | Bearer token required for training endpoints |
|
||||
| `OCR_CONFIDENCE_THRESHOLD` | `0.3` | Minimum confidence for Latin scripts |
|
||||
| `OCR_CONFIDENCE_THRESHOLD_KURRENT` | `0.5` | Minimum confidence for Kurrent scripts |
|
||||
| `RECOGNITION_BATCH_SIZE` | `16` | Kraken recognition batch size |
|
||||
| `DETECTOR_BATCH_SIZE` | `8` | Surya detector batch size |
|
||||
| `OCR_CLAHE_CLIP_LIMIT` | `2.0` | CLAHE contrast enhancement limit |
|
||||
| `OCR_CLAHE_TILE_SIZE` | `8` | CLAHE tile grid size |
|
||||
| `OCR_MAX_CACHED_MODELS` | `2` | LRU model cache size (~500 MB each) |
|
||||
| `ALLOWED_PDF_HOSTS` | `minio,localhost,127.0.0.1` | SSRF protection — allowed PDF URL hosts |
|
||||
|
||||
## How to Run
|
||||
|
||||
### Local Development (Python venv)
|
||||
|
||||
```bash
|
||||
cd ocr-service
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
|
||||
# Install PyTorch CPU first (saves ~2 GB vs CUDA)
|
||||
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
# Install remaining dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run development server
|
||||
fastapi dev main.py --host 0.0.0.0 --port 8000
|
||||
|
||||
# Or production mode
|
||||
uvicorn main:app --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
### Docker (via docker-compose)
|
||||
|
||||
The OCR service is included in the root `docker-compose.yml`:
|
||||
|
||||
```bash
|
||||
docker-compose up -d ocr-service
|
||||
```
|
||||
|
||||
The container:
|
||||
- Exposes port 8000 internally (not mapped to host by default)
|
||||
- Mounts `ocr_models` and `ocr_cache` volumes for persistence
|
||||
- Has a 120-second startup grace period for model loading
|
||||
- Memory limit: 12 GB
|
||||
|
||||
### Model Downloads
|
||||
|
||||
Use the helper script to download Kraken models:
|
||||
|
||||
```bash
|
||||
./scripts/download-kraken-models.sh
|
||||
```
|
||||
|
||||
Models are stored in the `ocr_models` Docker volume or `./ocr-service/models/` locally.
|
||||
|
||||
## Testing
|
||||
|
||||
Only a subset of tests can run without the full ML stack:
|
||||
|
||||
```bash
|
||||
cd ocr-service
|
||||
pip install pytest pytest-asyncio pyspellchecker
|
||||
|
||||
# No ML required — pure logic tests
|
||||
python -m pytest test_spell_check.py test_confidence.py test_sender_registry.py -v
|
||||
```
|
||||
|
||||
Tests requiring PyTorch/Kraken/Surya (e.g., `test_engines.py`) must be run in the Docker container or a fully provisioned venv.
|
||||
|
||||
## Training
|
||||
|
||||
The service supports in-process model fine-tuning via Kraken's `ketos` training pipeline. Training endpoints require the `TRAINING_TOKEN` bearer token. After training completes, the model is reloaded in-process — this is why only a single replica is supported.
|
||||
`ALLOWED_PDF_HOSTS` must never be set to `*` — that opens SSRF. The default (`minio,localhost,127.0.0.1`) is correct for dev.
|
||||
|
||||
Reference in New Issue
Block a user