122 lines
5.2 KiB
Markdown
122 lines
5.2 KiB
Markdown
---
|
|
name: transcribe
|
|
description: Transcribe a document's PDF by visually analyzing each page, creating annotation-backed transcription blocks via the API with paragraph-level bounding boxes and OCR text.
|
|
---
|
|
|
|
# Transcribe — PDF-to-Transcription-Blocks Workflow
|
|
|
|
## Argument
|
|
|
|
The user provides:
|
|
1. A **document URL**, e.g. `http://localhost:5173/documents/{id}` — extract the document UUID from the path.
|
|
2. A **PDF file path**, e.g. `@import/C-1654.pdf` — the source file to read and transcribe.
|
|
|
|
---
|
|
|
|
## Phase 1 — Gather Context
|
|
|
|
1. **Read the PDF** using the Read tool to get the visual content of every page.
|
|
2. **Check the API** — the transcription blocks endpoint is:
|
|
```
|
|
POST /api/documents/{documentId}/transcription-blocks
|
|
```
|
|
with Basic Auth (`admin:admin123`) and JSON body:
|
|
```json
|
|
{
|
|
"pageNumber": <1-based>,
|
|
"x": <0-1 normalized>,
|
|
"y": <0-1 normalized>,
|
|
"width": <0-1 normalized>,
|
|
"height": <0-1 normalized>,
|
|
"text": "transcribed text",
|
|
"label": "optional label or null"
|
|
}
|
|
```
|
|
3. **Check for existing blocks** — `GET /api/documents/{documentId}/transcription-blocks`. If blocks already exist, ask the user whether to delete them first or abort. Do not silently overwrite.
|
|
|
|
### Coordinate system
|
|
|
|
- All coordinates are **normalized 0-1 fractions** of page width and height.
|
|
- `x`, `y` is the **top-left corner** of the annotation rectangle.
|
|
- Page numbers are **1-based** (page 1 = 1, page 2 = 2).
|
|
|
|
---
|
|
|
|
## Phase 2 — Visual Analysis & Segmentation
|
|
|
|
For each page of the PDF:
|
|
|
|
1. **Identify the script type**: typewritten, Kurrent/Sutterlin, Latin handwriting, mixed, printed, etc.
|
|
2. **Segment into logical blocks** — each block is one visual paragraph or logical section:
|
|
- Header / letterhead / date line
|
|
- Salutation / greeting
|
|
- Body paragraphs (split at natural paragraph breaks)
|
|
- Closing / signature
|
|
- Address fields (postcards)
|
|
- Margin notes, annotations, stamps
|
|
- Rotated text sections (note the rotation in the label)
|
|
3. **Estimate bounding boxes** for each block as normalized 0-1 coordinates. The rectangle should tightly enclose all the text in that block with a small margin.
|
|
4. **Assign labels** to structural blocks:
|
|
- `Briefkopf` — letterhead / header with date and location
|
|
- `Anrede` — salutation line
|
|
- `Gruss` — closing and signature
|
|
- `Adresse` — address field (postcards)
|
|
- `Fortsetzung (gedreht)` — rotated continuation text
|
|
- `null` — regular body paragraphs (no label needed)
|
|
|
|
---
|
|
|
|
## Phase 3 — Transcription
|
|
|
|
For each identified block, transcribe the text:
|
|
|
|
### Rules
|
|
|
|
- **Never guess**. If a word or passage is not clearly readable, use `[unleserlich]` as a placeholder.
|
|
- Preserve the original spelling, punctuation, and line breaks where they indicate structure (e.g. address lines, signature blocks). Do not "correct" old German spelling.
|
|
- For typewritten text with handwritten corrections/additions above or below the line, note them inline, e.g. `statt [unleserlich]` or describe in brackets: `[handschriftliche Erganzung: ...]`.
|
|
- For Kurrent/Sutterlin script: be especially conservative. It is better to mark something `[unleserlich]` than to guess incorrectly. If an entire block is unreadable, use: `[unleserlich - Kurrentschrift, kurze Beschreibung des Inhaltsbereichs]`.
|
|
- For rotated text, note the rotation in the label field.
|
|
- Use `\n` for line breaks within a block (e.g. multi-line addresses, signature blocks).
|
|
|
|
### Script-specific guidance
|
|
|
|
| Script | Confidence threshold | Notes |
|
|
|--------|---------------------|-------|
|
|
| Typewritten (Schreibmaschine) | High — most words should be readable | Watch for corrections, strikethroughs, carbon copy artifacts |
|
|
| Latin handwriting | Medium — depends on hand | Easier than Kurrent but still variable |
|
|
| Kurrent / Sutterlin | Low — expect heavy `[unleserlich]` usage | Angular strokes, long-s, distinctive letter forms. Context helps (dates, place names, salutations are easier) |
|
|
| Mixed | Per-section | Common on postcards: Latin address + Kurrent message |
|
|
|
|
---
|
|
|
|
## Phase 4 — Create Blocks via API
|
|
|
|
1. **Delete existing blocks** if user approved it in Phase 1.
|
|
2. **Create blocks in reading order** using `curl` with Basic Auth:
|
|
```bash
|
|
curl -s -u admin:admin123 -X POST \
|
|
"http://localhost:8080/api/documents/${DOC_ID}/transcription-blocks" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{ "pageNumber": 1, "x": 0.03, "y": 0.02, "width": 0.94, "height": 0.07, "text": "...", "label": "Briefkopf" }'
|
|
```
|
|
3. Create blocks **page by page, top to bottom**. The API auto-assigns `sortOrder` incrementally.
|
|
4. Verify each response returns a valid block ID.
|
|
|
|
---
|
|
|
|
## Phase 5 — Summary
|
|
|
|
After all blocks are created, present a table:
|
|
|
|
| # | Page | Label | Readability | Content (truncated) |
|
|
|---|------|-------|-------------|---------------------|
|
|
|
|
Where readability is one of:
|
|
- **Klar** — fully readable, no `[unleserlich]` markers
|
|
- **Teilweise** — some `[unleserlich]` markers, majority readable
|
|
- **Schwer** — heavy `[unleserlich]` usage, only fragments readable
|
|
- **Unleserlich** — entire block could not be transcribed
|
|
|
|
End with a note about the overall script type and any sections that would benefit from expert review.
|