docs(normalizer): README + seed overrides
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
38
tools/import-normalizer/README.md
Normal file
38
tools/import-normalizer/README.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Import Normalizer
|
||||
|
||||
Transforms the raw family-archive spreadsheets in `../../import/` into a clean canonical
|
||||
dataset (`out/`) plus review reports (`review/`). See the spec:
|
||||
`../../docs/import-migration/02-normalization-spec.md`.
|
||||
|
||||
## Setup
|
||||
Requires **Python 3.12** (uses `StrEnum`).
|
||||
```bash
|
||||
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Run
|
||||
```bash
|
||||
.venv/bin/python normalize.py
|
||||
```
|
||||
Outputs:
|
||||
- `out/canonical-documents.xlsx`, `out/canonical-persons.xlsx`
|
||||
- `review/*.csv` (residue to fix), `review/summary.txt` (grouped run stats incl. unknown-date rate)
|
||||
|
||||
## Iteration loop
|
||||
1. **Run.** Read `review/summary.txt` for the health snapshot.
|
||||
2. **Fix the residue** by editing the version-controlled overrides files, then re-run. Repeat.
|
||||
|
||||
| Review file | What to do |
|
||||
| --- | --- |
|
||||
| `unparsed-dates.csv` | For each `raw` (sorted by frequency), fill `suggested_iso` + `suggested_precision`, then paste `raw,suggested_iso,suggested_precision` into `overrides/dates.csv` (header `raw,iso,precision`). |
|
||||
| `unmatched-names.csv` | If `suggested_id` is right, copy `raw,suggested_id` into `overrides/names.csv`; else look up the correct id in `out/canonical-persons.xlsx` (the `person_id` column). |
|
||||
| `ambiguous-receivers.csv` | A space-joined pair we refused to auto-split (e.g. `Ella Anita`). Decide and add a names override if it is really two people. |
|
||||
| `index-file-mismatch.csv` | The `Datei` path disagrees with the index-derived filename — reconcile when the PDFs arrive. |
|
||||
| `duplicate-index.csv`, `blank-index-rows.csv`, `skipped-x-suffix.csv` | Inspect; fix in the source spreadsheet if needed. |
|
||||
|
||||
**Valid `person_id` values** all come from the `person_id` column of `out/canonical-persons.xlsx`.
|
||||
|
||||
## Tests
|
||||
```bash
|
||||
.venv/bin/python -m pytest tests/test_dates.py -v # run files individually (never the whole suite at once)
|
||||
```
|
||||
Reference in New Issue
Block a user