marcel/familienarchiv

feat(import): resolve PDFs directly by index, drop the file column (#686) #687

Merged

marcel merged 9 commits from feature/686-resolve-pdf-by-index into docs/import-migration

2026-05-27 22:08:47 +02:00

Author	SHA1	Message	Date
Marcel	ea38efc734	docs: drop remaining stale MassImportService/ExcelService references All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m38s Details CI / OCR Service Tests (pull_request) Successful in 21s Details CI / Backend Unit Tests (pull_request) Successful in 4m9s Details CI / fail2ban Regex (pull_request) Successful in 48s Details CI / Semgrep Security Scan (pull_request) Successful in 21s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s Details Replace the legacy raw-spreadsheet importer references left behind after #674 with the canonical import architecture (CanonicalImportOrchestrator + four loaders) and document #686 index-based PDF resolution. - l3-backend-3b: DocumentImporter now resolves PDF by index (importDir/ <index>.pdf) with index validation + canonical-path containment + %PDF magic-byte check (no recursive walk / homoglyph file-path guards) - c4-diagrams.md: replace massImport/excelSvc components + their rels with an importOrch (CanonicalImportOrchestrator) component wired to doc/person/ tag services; refresh adminCtrl and adminSystem descriptions - ARCHITECTURE.md: importing package row now describes the orchestrator + four loaders consuming canonical artifacts - TODO-backend.md: remove obsolete "MassImportService provides no status" item (service deleted; orchestrator already exposes import-status); update stale ExcelService test-coverage suggestion Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:30:40 +02:00
Marcel	7354c3d332	docs(adr): record the index pattern as a corpus-specific constraint All checks were successful CI / Unit & Component Tests (pull_request) Successful in 3m35s Details CI / OCR Service Tests (pull_request) Successful in 21s Details CI / Backend Unit Tests (pull_request) Successful in 3m58s Details CI / fail2ban Regex (pull_request) Successful in 42s Details CI / Semgrep Security Scan (pull_request) Successful in 20s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s Details Address PR #687 review concern (Elicit): add an ADR-025 Consequences entry noting INDEX_PATTERN accepts only the current corpus shape (<=4 Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited deliberately if the catalog scheme grows (5-letter prefix, digit-led id, non-Latin letter), since such rows would otherwise be skipped, not imported. Also records the ASCII-only \d intent. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:20:21 +02:00
Marcel	b3f4fb2d8d	test(importing): pin regex reject-boundary + note untestable IO branch Address PR #687 review concerns on DocumentImporterTest: - Sara/Felix: add catalog-shape reject tests that pass every char pre-check but must fail INDEX_PATTERN — "J 0070" (space), "WXYZA-0001" (5 letters), "12-0001" (no letter prefix), "W-0001X" (uppercase X). Verified red against a weakened pattern, green against the real one, so the pattern branch (not the char guards) is now pinned. - Felix: restore the import java.io.OutputStream line (was over-deleted and patched with a fully-qualified name). - Sara: document why the resolvePdfByIndex getCanonicalPath IOException branch is intentionally left uncovered (no deterministic injection seam; the log.warn is the substantive fix). Adjust the two reflective resolvePdfByIndex calls for the new rowNumber parameter. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:20:15 +02:00
Marcel	a1f96f60ac	feat(importing): log import-row breadcrumbs and distinguish skip outcomes Address PR #687 review concerns on DocumentImporter: - Tobias: thread a 1-based source row number into importRow so the "index rejected" skip log carries a breadcrumb (the row number, never the raw hostile index) for post-import triage. - Elicit: emit a distinct log when a valid index has no <index>.pdf on disk (normal PLACEHOLDER) so it is not conflated with a rejected index. - Nora: add a log.warn in resolvePdfByIndex's getCanonicalPath IOException branch so the quiet fail-safe skip surfaces in ops, distinct from the deliberate symlink-escape abort. - Felix: replace inline fully-qualified java.util.regex.Pattern with an import. - Nora: document that \d is intentionally ASCII-only (do not add UNICODE_CHARACTER_CLASS). Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:20:08 +02:00
Marcel	b1e83437ae	docs: drop stale MassImportService/ODS references from import deploy docs Some checks failed CI / Unit & Component Tests (pull_request) Failing after 3m24s Details CI / OCR Service Tests (pull_request) Successful in 21s Details CI / Backend Unit Tests (pull_request) Successful in 3m39s Details CI / fail2ban Regex (pull_request) Successful in 43s Details CI / Semgrep Security Scan (pull_request) Successful in 22s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s Details The mass-import card no longer parses an ODS spreadsheet and MassImportService was deleted (#674); /import now holds the normalizer's canonical artifacts (canonical-*.xlsx + canonical-persons-tree.json) plus <index>.pdf files, read by the canonical importer. Fix the IMPORT_HOST_DIR descriptions in DEPLOYMENT.md and docker-compose.prod.yml accordingly. Refs #686	2026-05-27 21:14:38 +02:00
Marcel	74987062f4	docs(import): document index-based PDF resolution in ADR-025 and DEPLOYMENT All checks were successful CI / Unit & Component Tests (pull_request) Successful in 6m56s Details CI / OCR Service Tests (pull_request) Successful in 22s Details CI / Backend Unit Tests (pull_request) Successful in 3m42s Details CI / fail2ban Regex (pull_request) Successful in 44s Details CI / Semgrep Security Scan (pull_request) Successful in 22s Details CI / Compose Bucket Idempotency (pull_request) Successful in 1m3s Details File resolution is now by index (<index>.pdf), not the datei/file column. Update the ADR-025 security sub-decision and consequence (the recursive walk and file column are gone; a bad index skips its row with a loud SkipReason, a symlink-escape still aborts via the containment assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in the import dir). Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:03:57 +02:00
Marcel	508a5e555e	chore(normalizer): regenerate canonical-documents.xlsx without file column Regenerated from the source workbooks with the committed overrides; the export schema now has 16 columns (no file). canonical-persons.xlsx and canonical-tag-tree.xlsx were unchanged at the cell level (only openpyxl zip-byte churn) and were left untouched to keep the diff minimal. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:03:00 +02:00
Marcel	4d55eee5c8	feat(importing): resolve import PDFs directly by index The corpus is uniform — every PDF is <index>.pdf flat in the import dir — so resolve a document's PDF with an O(1) importDir.resolve(index + ".pdf") lookup instead of a recursive directory walk over the file column. The index is validated against a strict catalog pattern (1–4 Latin letters incl. umlauts, hyphen(s), digits, optional x) plus the ported separator/dot/dotdot/null/slash-homoglyph/absolute-path guards, and the resolved canonical path is asserted to stay inside the import dir as defense-in-depth. The %PDF magic-byte check still gates upload; status UPLOADED/PLACEHOLDER and the index→originalFilename upsert key are unchanged. The file column and findFileRecursive walk are gone, and the security regression tests now assert a malicious or garbage index is rejected and a valid index resolves to exactly importDir/<index>.pdf within containment. Closes #686 Closes #676 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:02:03 +02:00
Marcel	09ba7e74e3	refactor(normalizer): drop file column now PDFs resolve by index The import corpus is uniform: every PDF is named <index>.pdf, so the file column (the spreadsheet's datei value) is redundant. Remove file from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS, plus the now-moot index_file_mismatch review flag/CSV/stat and the datei header mapping. date_end and the tree person_id are kept. Refs #686 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 20:54:37 +02:00