Canonical importer can leave orphaned S3 objects when a document's file changes #675

Open
opened 2026-05-27 11:21:12 +02:00 by marcel · 0 comments
Owner

Context

Backlog item surfaced in the multi-persona review of PR #674 (Phase 3 modular importer, #669) — DevOps (Wendt).

The DocumentImporter uploads a document's file to S3 on the PLACEHOLDER → UPLOADED transition. There is no cleanup of the previous object if a document that already had a file is later re-imported with a different file (or its file column changes). The old S3 object becomes orphaned — it consumes storage and is never referenced again.

Bounded, not urgent: per ADR-025, once a document has a file the importer largely leaves it alone, so this only bites on the relatively rare "file changed for an existing index" path. No data-loss or correctness impact — purely storage hygiene.

Suggested approach

  • On re-upload for an existing document, delete the prior file_path object from S3 (within the same transaction boundary / after-commit hook), or
  • Add a periodic reconciliation/cleanup job that removes S3 objects with no referencing documents.file_path.

Prefer the inline delete-on-replace if it's cheap; otherwise a scheduled reconciliation.

Out of scope

Not part of #669 — that PR's review explicitly classified this as a follow-up. No migration involved.

## Context Backlog item surfaced in the multi-persona review of PR #674 (Phase 3 modular importer, #669) — DevOps (Wendt). The `DocumentImporter` uploads a document's file to S3 on the `PLACEHOLDER → UPLOADED` transition. There is **no cleanup of the previous object** if a document that already had a file is later re-imported with a different file (or its `file` column changes). The old S3 object becomes orphaned — it consumes storage and is never referenced again. **Bounded, not urgent:** per ADR-025, once a document has a file the importer largely leaves it alone, so this only bites on the relatively rare "file changed for an existing index" path. No data-loss or correctness impact — purely storage hygiene. ## Suggested approach - On re-upload for an existing document, delete the prior `file_path` object from S3 (within the same transaction boundary / after-commit hook), **or** - Add a periodic reconciliation/cleanup job that removes S3 objects with no referencing `documents.file_path`. Prefer the inline delete-on-replace if it's cheap; otherwise a scheduled reconciliation. ## Out of scope Not part of #669 — that PR's review explicitly classified this as a follow-up. No migration involved.
marcel added the P3-laterdevops labels 2026-05-27 11:21:28 +02:00
Sign in to join this conversation.
No Label P3-later devops
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#675