As the archive owner I want import PDFs resolved directly by index (e.g. W-0124.pdf), dropping the file column #686
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The import corpus is uniform: 7,721 PDFs, all flat in the import dir, every one named
<index>.pdf—Al-0001.pdf,W-0124.pdf,C-1464.pdf,Eu-0628.pdf,H-0163.pdf, … (catalog index +.pdf). Verified on disk.The
filecolumn (added in #670 to carry the spreadsheet'sdateivalue) is therefore redundant and harmful: it carries messy, inconsistent values (the ~550index_file_mismatchflags were in thedateicolumn, not the files), which is exactly whyDocumentImporterneeds a recursive directory walk plus a basename/homoglyph/path-traversal/canonical-containment guard to resolve it.Resolving directly by
indexis simpler, O(1), and removes the whole CWE-22 surface.Backend —
DocumentImporterresolveFile(row.get("file"))+findFileRecursive(...)with a direct lookup:importDir.resolve(index + ".pdf").-+ digits, e.g.^[A-Za-z]{1,3}-\d+x?$; confirm against the real index set, including anyx-suffix/edge cases the normalizer already recognises) and contain no path separators /./... Reject otherwise (skip the row with a clearSkipReason).importDir), and keep the%PDFmagic-byte check.findFileRecursivewalk and thefile-column basename machinery.statusUPLOADED/PLACEHOLDER and theindex→originalFilenameupsert key are unchanged.filepath", they now assert a malicious/garbage index is rejected and a valid index resolves to exactlyimportDir/<index>.pdfwithin containment. Do not lose coverage.Normalizer
filefromCanonicalDocument(documents.py) + theto_canonical(...)pass-through, and fromDOC_COLUMNS(writers.py).index_file_mismatchreview flag (it consumedraw.file) is now moot — drop it (the importer no longer uses thedateivalue). Keepdate_endand the treeperson_id(those stay).out/canonical-documents.xlsxso the schema no longer hasfile. (Run the normalizer against the source workbooks — copyimport/*.xlsxinto the worktree andpip install -r requirements.txt.)Acceptance criteria
Docs
DEPLOYMENT.md §6: file resolution is now by index (<index>.pdf), not thedateicolumn; note the removed walk/guards. GLOSSARY if a term changes.Out of scope
Closes #676
As the archive owner I want import PDFs resolved directly by index (<index>.pdf), dropping the file columnto As the archive owner I want import PDFs resolved directly by index (e.g. W-0124.pdf), dropping the file column