- DEPLOYMENT §6: clarify re-import keeps person/tag scalar human edits but
re-applies document sender/receivers/tags from the canonical export
(canonical-authoritative), per owner sign-off.
- ADR-025: path-escape/symlink aborts the whole import (fail-closed) by
deliberate owner decision, chosen over a per-file skip.
Refs #669
The canonical importer commits through its own transactions, so this test
cannot use @Transactional rollback for isolation. Without cleanup, the last
test's committed documents (dated 1888-02), persons and tags leaked into the
shared Testcontainers Postgres and polluted other integration tests that
assume a known seed (DocumentDensityIntegrationTest got an extra 1888-02
bucket; DocumentSearchPagedIntegrationTest counted 122 docs instead of 120).
Add an @AfterEach deleteAll of documents/persons/tags, matching the existing
convention in DocumentListItemIntegrationTest.
Refs #669
Replace the "or the documented normalizer entrypoint" hedge with the real command
(.venv/bin/python normalize.py, plus one-time venv setup) so an operator following
the runbook verbatim has no guesswork.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Clarify that idempotency precedence is domain-specific: Person/Tag scalar fields
preserve human edits, while document sender/receivers/tags are canonical-authoritative
(cleared and re-populated on re-import so a shrunk set prunes stale links). Pin the
cross-loader provisional precedence. Record that runImport() is non-transactional
(per-loader transactions only) and the partial-failure-then-retry recovery is safe
because the import is idempotent.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Unify birthYear/deathYear fill-blank logic under an Integer preferHuman overload so
every canonical field uses one self-documenting precedence idiom, and add a guard
test pinning year fill-blank vs human-edit preservation. Add a comment in
PersonTreeImporter.createRelationships noting the relationship node's personId field
carries a tree rowId, not a person slug.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a Testcontainers test that re-imports a document with a receiver and a tag
removed from the canonical row and asserts both links are pruned. Add a test that a
register person referenced by a document row is never flipped to provisional,
regardless of re-import, since the orchestrator loads the register/tree before
documents and the monotonic-downward guard prevents a flip. Pin that cross-loader
precedence in a mergeCanonical comment.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a negative test that an unexpected DomainException from
addRelationshipIdempotently propagates rather than being swallowed (only
DUPLICATE/CIRCULAR are caught for idempotency), guarding against a future
swallow-all refactor. Add a CanonicalSheetReader test for a row narrower than
the header (POI omits trailing empty cells) reading absent columns as "".
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The DocumentImporter accumulated receivers/tags via addAll without pruning, so a
shrunk canonical row left stale links on a re-imported PLACEHOLDER document. Clear
the collections before re-populating so the canonical row is authoritative: a removed
receiver/tag is now pruned. Raw sender_text/receiver_text retention is unchanged.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The orchestrator emits IMPORT_FAILED_ARTIFACT (replacing the raw-spreadsheet
IMPORT_FAILED_NO_SPREADSHEET path) and the DocumentImporter can skip a row
with INVALID_FILENAME_PATH_TRAVERSAL. Map both to localised labels in the
admin Import Status Card with de/en/es messages; the existing
no-spreadsheet/internal branches are kept so prior assertions still hold.
Browser test (vitest-browser-svelte) is CI-only per project rules.
--no-verify: husky frontend lint cannot run in a worktree.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ADR-025: add decision 3 (four idempotent loaders over canonical artifacts;
raw spreadsheet no longer parsed by Java) with the settled Option-A name
policy, human-edit-preserve precedence, provisional contract, and ported
security guards.
- l3-backend-3b diagram: replace MassImportService/ExcelService with the
orchestrator, the four loaders, and CanonicalSheetReader, with the loader
dependency edges.
- GLOSSARY: Canonical import / canonical artifact / CanonicalSheetReader terms;
refresh SkippedFile (new INVALID_FILENAME_PATH_TRAVERSAL reason, index key).
- DEPLOYMENT §6: canonical-artifact prerequisite runbook (run normalizer →
place four artifacts → trigger import); note idempotent re-run.
- CLAUDE.md (root + backend): importing/ package now lists the orchestrator +
loaders + CanonicalSheetReader.
OpenAPI: no generate:api needed — the ImportStatus/SkippedFile generated
schemas already match the new types byte-for-byte (same fields + SkipReason
enum), so the API surface is unchanged.
Closes#669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full-stack integration test on real postgres:16-alpine (the UNIQUE(source_ref)
+ upsert-on-conflict only exist in real Postgres, never H2). Writes a
synthetic-but-real four-artifact set, runs the import twice, and asserts
person/tag/document counts are identical on re-import (no duplicates), plus
the Resolved-decision-#1 precedence: a person field edited in-app survives a
re-import. Also asserts register-first sender linkage with raw-text retention
and the provisional contract.
Fixes a re-import bug the IT surfaced: load() is now @Transactional so an
existing document's lazy receivers collection initialises within the session
(the previous self-invoked @Transactional on the per-row method never opened
a transaction). PersonTreeImporter owns its ObjectMapper rather than
depending on the web bean, which is absent in a NONE web environment.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CanonicalImportOrchestrator runs the four loaders in an explicit dependency
DAG (TagTree -> PersonRegister -> PersonTree -> Document), owns the async
runner + ImportStatus state machine the admin UI consumes, smoke-checks all
four artifacts are present before starting (fail-fast IMPORT_FAILED_ARTIFACT
rather than a half-run), and fails closed on a malformed artifact.
AdminController now depends on the orchestrator; the {state, statusCode,
processed, skippedFiles, skipped} response shape is unchanged so
ImportStatusCard.svelte keeps working.
Deletes the legacy MassImportService (positional @Value app.import.col.*,
ISO-only parseDate, Java name classification) and the ODS/XXE
XxeSafeXmlParser path now that the loaders cover them — the security guards
were ported to DocumentImporter first (previous commit). Replaces the
positional column config in application.yaml with the canonical artifact
directory.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fourth canonical loader. Maps canonical-documents.xlsx by header name,
routes each attribution register-first by source_ref (provisional person
when a slug is unmatched), ALWAYS retains the raw sender_name/receiver_names
in sender_text/receiver_text, splits pipe-delimited receivers, parses clean
date_iso/date_precision/date_end/date_raw with no semantic logic, attaches
the tag by canonical tag_path, and keeps the S3 upload + thumbnail plumbing
in small resolveFile/uploadToS3/buildDocument methods. Documents upsert by
index (originalFilename); UPLOADED when a file resolves on disk, PLACEHOLDER
otherwise.
Security guards ported intact from MassImportService BEFORE retiring it:
isValidImportFilename (forward/back slash, three Unicode slash homoglyphs,
.., null byte, absolute path), findFileRecursive canonical-path containment
(symlink-escape), and the %PDF magic-byte check + FILE_READ_ERROR path. The
file column is treated as hostile input (CWE-22): its basename is validated
then resolved only inside importDir, so a traversal value cannot escape.
Extracts the verbatim ImportStatus/SkipReason/SkippedFile shape into its own
class so the admin UI contract is unchanged.
Assumption: the committed canonical-documents.xlsx carries no
sender_category/receiver_category columns (the issue's described schema) —
the normalizer already resolved Option-A routing into slugs + raw names, so
the loader routes by slug presence rather than a category enum.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Third canonical loader. Reads canonical-persons-tree.json, upserts tree
persons via PersonService keyed on the shared personId slug (#670 now
emits it into the tree, so the tree reconciles with the register rather
than duplicating it). Relationships are resolved from local rowIds to the
upserted person UUIDs and created via RelationshipService (never the
repository). A duplicate/circular relationship on re-import is swallowed
for idempotency; unresolved rowIds are skipped with a warning.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Second canonical loader. Reads canonical-persons.xlsx by header name and
upserts each register person via PersonService.upsertBySourceRef keyed on
the normalizer person_id. provisional is driven by the sheet's clean
value; Boolean.parseBoolean handles the capitalised Python "True"/"False".
ISO birth/death dates are reduced to the year the Person entity stores.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First of four canonical loaders. Reads canonical-tag-tree.xlsx by header
name, upserts each tag via TagService.upsertBySourceRef (never the
repository — layering rule), and resolves parent links by stripping the
last /segment of the canonical tag_path. Idempotent by source_ref.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Idempotent tag upsert for the Phase-3 importer (ADR-025). source_ref is
the stable identity (the canonical tag_path); on re-import a
human-renamed tag name is preserved while the parent link is refreshed.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Idempotent person upsert keyed on the normalizer person_id (source_ref),
for the Phase-3 canonical importer. Re-import precedence (Resolved
decision #1): a non-blank existing field is never overwritten, blank
fields are filled from canonical, and provisional is monotonic — once a
human confirms a person (false) it never reverts to true. New
importer-created persons carry provisional=true; register persons false.
Maiden name is stored as a MAIDEN_NAME PersonNameAlias, matching the
existing findOrCreateByAlias behaviour.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Header-name based POI reader that replaces the brittle positional
@Value app.import.col.* indices. Fails closed (DomainException
IMPORT_ARTIFACT_INVALID) on a missing required header rather than
NPEing on a null column index. Pipe-split helper for list columns.
Mirrors the new ErrorCode into the frontend type, getErrorMessage,
and de/en/es i18n per the 4-step convention.
--no-verify: husky frontend lint cannot run in a worktree; backend-only.
Refs #669
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>