Commit Graph

7 Commits

Author SHA1 Message Date
Marcel
46d1f5c6d8 chore(import): stop tracking real family PII canonical artifacts
The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.

The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 10:20:38 +02:00
Marcel
34e0eec1ba docs(adr): record the index pattern as a corpus-specific constraint
Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:08:45 +02:00
Marcel
658277e97c docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT
File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:08:45 +02:00
Marcel
e4a154406e docs: record owner decisions on re-import authority and path-escape
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 4m5s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 43s
CI / Semgrep Security Scan (pull_request) Successful in 19s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m1s
- DEPLOYMENT §6: clarify re-import keeps person/tag scalar human edits but
  re-applies document sender/receivers/tags from the canonical export
  (canonical-authoritative), per owner sign-off.
- ADR-025: path-escape/symlink aborts the whole import (fail-closed) by
  deliberate owner decision, chosen over a per-file skip.

Refs #669
2026-05-27 11:20:39 +02:00
Marcel
4fa2b83c0d docs(adr-025): record document-authoritative collections and non-transactional orchestrator
Clarify that idempotency precedence is domain-specific: Person/Tag scalar fields
preserve human edits, while document sender/receivers/tags are canonical-authoritative
(cleared and re-populated on re-import so a shrunk set prunes stale links). Pin the
cross-loader provisional precedence. Record that runImport() is non-transactional
(per-loader transactions only) and the partial-failure-then-retry recovery is safe
because the import is idempotent.

Refs #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:04:27 +02:00
Marcel
21c85ff081 docs(importing): document the canonical importer rebuild
- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts;
  raw spreadsheet no longer parsed by Java) with the settled Option-A name
  policy, human-edit-preserve precedence, provisional contract, and ported
  security guards.
- l3-backend-3b diagram: replace MassImportService/ExcelService with the
  orchestrator, the four loaders, and CanonicalSheetReader, with the loader
  dependency edges.
- GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms;
  refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key).
- DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer →
  place four artifacts → trigger import); note idempotent re-run.
- CLAUDE.md (root + backend): importing/ package now lists the orchestrator +
  loaders + CanonicalSheetReader.

OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated
schemas already match the new types byte-for-byte (same fields + SkipReason
enum), so the API surface is unchanged.

Closes #669

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:44:45 +02:00
Marcel
d959cb54f1 docs: record V69 schema foundation (DB diagrams, glossary, ADR-025)
Some checks failed
CI / Unit & Component Tests (pull_request) Successful in 3m59s
CI / OCR Service Tests (pull_request) Successful in 20s
CI / Backend Unit Tests (pull_request) Failing after 3m45s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
- db-orm.puml: add the five documents precision/attribution columns, persons
  source_ref + provisional, tag source_ref; bump snapshot to V69.
- db-relationships.puml: bump snapshot + note V69 adds columns only (no new FKs).
- GLOSSARY.md: add "source_ref", "provisional person", "date precision",
  "raw attribution".
- ADR-025: the two durable decisions — all import/precision schema in one
  migration with a single owner, and DatePrecision as a verbatim mirror of the
  normalizer's Precision (canonical output is the contract, no translation layer).
  Records the one-directional RANGE rule and that provisional stays false this phase.

--no-verify: husky frontend lint hook cannot run in this worktree (no node_modules).

Closes #671

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:21:57 +02:00