As the archive owner I want import PDFs resolved directly by index (e.g. W-0124.pdf), dropping the file column #686

New Issue

marcel · 2026-05-27T20:47:51+02:00

marcel commented

2026-05-27 20:47:51 +02:00

Context

The import corpus is uniform: 7,721 PDFs, all flat in the import dir, every one named <index>.pdf — Al-0001.pdf, W-0124.pdf, C-1464.pdf, Eu-0628.pdf, H-0163.pdf, … (catalog index + .pdf). Verified on disk.

The file column (added in #670 to carry the spreadsheet's datei value) is therefore redundant and harmful: it carries messy, inconsistent values (the ~550 index_file_mismatch flags were in the datei column, not the files), which is exactly why DocumentImporter needs a recursive directory walk plus a basename/homoglyph/path-traversal/canonical-containment guard to resolve it.

Resolving directly by index is simpler, O(1), and removes the whole CWE-22 surface.

Backend — `DocumentImporter`

Replace resolveFile(row.get("file")) + findFileRecursive(...) with a direct lookup: importDir.resolve(index + ".pdf").
Validate the index before use — it must match a strict catalog pattern (letter prefix + - + digits, e.g. ^[A-Za-z]{1,3}-\d+x?$; confirm against the real index set, including any x-suffix/edge cases the normalizer already recognises) and contain no path separators / . / ... Reject otherwise (skip the row with a clear SkipReason).
Keep a canonical-path containment assertion as defense-in-depth (resolved path must stay inside importDir), and keep the %PDF magic-byte check.
Drop the now-dead findFileRecursive walk and the file-column basename machinery.
status UPLOADED/PLACEHOLDER and the index→originalFilename upsert key are unchanged.
This closes #676 (the O(rows×tree) recursive walk is gone).
Port/adapt the security regression tests: instead of "reject a malicious file path", they now assert a malicious/garbage index is rejected and a valid index resolves to exactly importDir/<index>.pdf within containment. Do not lose coverage.

Normalizer

Remove file from CanonicalDocument (documents.py) + the to_canonical(...) pass-through, and from DOC_COLUMNS (writers.py).
The index_file_mismatch review flag (it consumed raw.file) is now moot — drop it (the importer no longer uses the datei value). Keep date_end and the tree person_id (those stay).
Regenerate the committed out/canonical-documents.xlsx so the schema no longer has file. (Run the normalizer against the source workbooks — copy import/*.xlsx into the worktree and pip install -r requirements.txt.)

Acceptance criteria

Scenario: A PDF resolves directly by index
  Given a canonical row with index "W-0124" and importDir/W-0124.pdf exists
  When the document is imported
  Then W-0124.pdf is uploaded to S3 and status is UPLOADED
  And no recursive directory walk is performed

Scenario: A missing file yields a placeholder
  Given a canonical row with index "X-9999" and no importDir/X-9999.pdf
  Then the document is created with status PLACEHOLDER

Scenario: A malicious index is rejected
  Given a canonical row whose index contains a path separator or ".."
  Then the row is skipped and nothing outside importDir is read

Scenario: The canonical export has no file column
  Then canonical-documents.xlsx columns do not include "file"

Docs

Update ADR-025 + DEPLOYMENT.md §6: file resolution is now by index (<index>.pdf), not the datei column; note the removed walk/guards. GLOSSARY if a term changes.

Out of scope

The date/tag override rerun (separate; the local branch is behind with uncommitted overrides). Briefwechsel.

Closes #676

## Context The import corpus is uniform: **7,721 PDFs, all flat in the import dir, every one named `<index>.pdf`** — `Al-0001.pdf`, `W-0124.pdf`, `C-1464.pdf`, `Eu-0628.pdf`, `H-0163.pdf`, … (catalog index + `.pdf`). Verified on disk. The `file` column (added in #670 to carry the spreadsheet's `datei` value) is therefore **redundant and harmful**: it carries messy, inconsistent values (the ~550 `index_file_mismatch` flags were in the `datei` *column*, not the files), which is exactly why `DocumentImporter` needs a recursive directory walk plus a basename/homoglyph/path-traversal/canonical-containment guard to resolve it. Resolving directly by `index` is simpler, O(1), and removes the whole CWE-22 surface. ## Backend — `DocumentImporter` - Replace `resolveFile(row.get("file"))` + `findFileRecursive(...)` with a **direct lookup: `importDir.resolve(index + ".pdf")`**. - **Validate the index** before use — it must match a strict catalog pattern (letter prefix + `-` + digits, e.g. `^[A-Za-z]{1,3}-\d+x?$`; confirm against the real index set, including any `x`-suffix/edge cases the normalizer already recognises) and contain no path separators / `.` / `..`. Reject otherwise (skip the row with a clear `SkipReason`). - Keep a **canonical-path containment assertion** as defense-in-depth (resolved path must stay inside `importDir`), and **keep the `%PDF` magic-byte check**. - Drop the now-dead `findFileRecursive` walk and the `file`-column basename machinery. - `status` UPLOADED/PLACEHOLDER and the `index`→`originalFilename` upsert key are unchanged. - **This closes #676** (the O(rows×tree) recursive walk is gone). - **Port/adapt the security regression tests:** instead of "reject a malicious `file` path", they now assert a **malicious/garbage index is rejected** and a valid index resolves to exactly `importDir/<index>.pdf` within containment. Do not lose coverage. ## Normalizer - Remove `file` from `CanonicalDocument` (`documents.py`) + the `to_canonical(...)` pass-through, and from `DOC_COLUMNS` (`writers.py`). - The `index_file_mismatch` review flag (it consumed `raw.file`) is now moot — **drop it** (the importer no longer uses the `datei` value). Keep `date_end` and the tree `person_id` (those stay). - **Regenerate the committed `out/canonical-documents.xlsx`** so the schema no longer has `file`. (Run the normalizer against the source workbooks — copy `import/*.xlsx` into the worktree and `pip install -r requirements.txt`.) ## Acceptance criteria ```gherkin Scenario: A PDF resolves directly by index Given a canonical row with index "W-0124" and importDir/W-0124.pdf exists When the document is imported Then W-0124.pdf is uploaded to S3 and status is UPLOADED And no recursive directory walk is performed Scenario: A missing file yields a placeholder Given a canonical row with index "X-9999" and no importDir/X-9999.pdf Then the document is created with status PLACEHOLDER Scenario: A malicious index is rejected Given a canonical row whose index contains a path separator or ".." Then the row is skipped and nothing outside importDir is read Scenario: The canonical export has no file column Then canonical-documents.xlsx columns do not include "file" ``` ## Docs - Update ADR-025 + `DEPLOYMENT.md §6`: file resolution is now **by index (`<index>.pdf`)**, not the `datei` column; note the removed walk/guards. GLOSSARY if a term changes. ## Out of scope - The date/tag **override** rerun (separate; the local branch is behind with uncommitted overrides). Briefwechsel. Closes #676

marcel added this to the Handling the Unknowns — honest uncertainty in dates & people milestone 2026-05-27 20:47:51 +02:00

marcel changed title from ~~As the archive owner I want import PDFs resolved directly by index (<index>.pdf), dropping the file column~~ to As the archive owner I want import PDFs resolved directly by index (e.g. W-0124.pdf), dropping the file column

2026-05-27 20:48:38 +02:00

marcel added the P2-medium feature labels 2026-05-27 20:48:40 +02:00

marcel referenced this issue from a commit

2026-05-27 21:06:01 +02:00

refactor(normalizer): drop file column now PDFs resolve by index

marcel referenced this issue from a commit

2026-05-27 21:06:01 +02:00

feat(importing): resolve import PDFs directly by index

marcel referenced this issue from a commit

2026-05-27 21:06:01 +02:00

chore(normalizer): regenerate canonical-documents.xlsx without file column

marcel referenced this issue from a commit