familienarchiv/scripts/README.md

# scripts/

Utility scripts for development, data management, model downloads, and database operations. These are standalone shell and Python scripts used outside the normal application runtime.

## Scripts

### `reset-db.sh`

**Purpose**: Hard-reset the development database, wiping all documents, persons, tags, and related data.

**Usage:**

```bash
./scripts/reset-db.sh
# Type 'yes' to confirm
```

**What it truncates:**

- `transcription_block_versions`
- `transcription_blocks`
- `comment_mentions`
- `document_comments`
- `document_annotations`
- `document_versions`
- `notifications`
- `documents`
- `person_name_aliases`
- `persons`
- `tag`

> ⚠️ **Destructive operation — only for development!** This wipes ALL data. Not reversible without a backup.

---

### `rebuild-frontend.sh`

**Purpose**: Force a clean rebuild of the frontend Docker container.

**Usage:**

```bash
./scripts/rebuild-frontend.sh
```

---

### `download-kraken-models.sh`

**Purpose**: Download Kraken HTR models for German Kurrent and Sütterlin scripts.

**Usage:**

```bash
./scripts/download-kraken-models.sh
```

Downloads models into `./ocr-service/models/` or the `ocr_models` Docker volume. Models are ~100–500 MB each.

---

### `download-paperless.sh`

**Purpose**: Download exported documents from a Paperless-ngx instance.

**Usage:**

```bash
./scripts/download-paperless.sh
```

Requires environment variables or config for the Paperless API endpoint and token.

---

### `flatten-paperless.sh`

**Purpose**: Flatten nested Paperless export directories into a single import-ready structure.

**Usage:**

```bash
./scripts/flatten-paperless.sh
```

---

### `generate_data.py`

**Purpose**: Generate synthetic test data for development.

**Usage:**

```bash
python scripts/generate_data.py
```

Generates fake documents, persons, and tags suitable for load testing or UI development.

---

### `prepare_historical_dict.py`

**Purpose**: Build a historical German word dictionary for the OCR spell-checker.

**Usage:**

```bash
python scripts/prepare_historical_dict.py
```

Processes raw word lists into the format expected by `ocr-service/spell_check.py`.

---

### `schema.sql`

**Purpose**: Complete database schema dump for reference.

**Note**: Flyway migrations in `backend/src/main/resources/db/migration/` are the source of truth for schema evolution. `schema.sql` is a snapshot for quick reference only.

---

### `large-data.sql`

**Purpose**: Pre-seeded dataset with a large number of documents for performance testing.

**Usage:**

```bash
# Import into PostgreSQL
docker exec -i archive-db psql -U archive_user -d family_archive_db < scripts/large-data.sql
```

## How to Use

Most scripts should be run from the **repository root**:

```bash
# Database reset
./scripts/reset-db.sh

# Model download
./scripts/download-kraken-models.sh

# Data generation
cd scripts && python generate_data.py
```

Ensure scripts are executable:

```bash
chmod +x scripts/*.sh
```

## Adding New Scripts

1. Place the script in `scripts/`
2. Add a header comment describing purpose and usage
3. Make it executable (`chmod +x`)
4. Document it in this `README.md`