AC2 — intra-family marriage. When two parented persons at the same
imported generation are spouses but live in separate sibling blocks
(each under their own parent), the block-packer used to leave them
split, drawing a long spouse line that crossed through any intervening
siblings. The new step 3.5 detects that case, moves the focal members
to the join boundary (A's spouse rightmost in A's block, B's spouse
leftmost in B's), and concatenates B's members onto A's; the combined
block centres on the average of the two parents' midpoints.
Latent against today's data (no intra-family marriage in the canonical
fixture); covered by a synthetic two-family scenario in
buildLayout.test.ts. Packer growth stays comfortably under Markus's
80-LoC extraction threshold, so packBlocks.ts is not yet warranted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the alternating-side insertOnRight rule with a sort-and-splice
that places every loose spouse to the right of the parented focal in
(fromYear ASC NULLS LAST, displayName ASC) order. Mirrored in step 3 for
the all-loose chained merge so Albert de Gruyter's four marriages land
in deterministic alphabetical order today (no fromYear populated in the
canonical dataset) and switch automatically to year-order as the
transcription pipeline backfills marriage years.
PersonNodeDTO carries only displayName, not parsed first/last names, so
the tiebreaker uses displayName rather than the (lastName, firstName)
key in the original UX brief. The canonical alphabetical order matches
in both schemes — the rule activates the moment a multi-spouse case has
mixed display-name patterns.
Retires the temporary commit-3 scaffold
`attaches_loose_multi_spouse_to_parented_partner_when_edge_order_clobbers`
which became position-arithmetic-equivalent under the new right-of-focal
rule; the two new sort tests are stronger discriminators for the same
behaviour.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switches spousePairs from Map<string, string> to Map<string, Set<string>>
so multi-spouse persons (canonical case: Albert de Gruyter, 4 marriages)
keep every partner instead of losing the earlier .set() values.
The behavioural discriminator (now exercised by
attaches_loose_multi_spouse_to_parented_partner_when_edge_order_clobbers)
is a loose person with both a parented and a loose spouse: the old map
clobbered to whichever edge landed last, so the loose-placement step could
miss the parented partner and merge the focal node into the wrong block.
Also closes the robustness gap NullX flagged: SPOUSE_OF edges referencing
IDs outside allNodes are dropped at ingestion instead of leaking into the
spouse-pulldown loop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Local-only developer utility that authenticates against the running backend,
captures the current /api/network snapshot, and writes it to
src/lib/person/genealogy/__fixtures__/stammbaum.json. Sanity gates exit
non-zero on a vacuous capture (< 50 nodes, < 5 generations, 0 SPOUSE_OF
edges). Fixture and script land together so the fixture is reproducible from
the script that generated it.
Captured snapshot: 62 nodes, 43 edges, 28 SPOUSE_OF (0 with fromYear),
generations G0-G4. Albert de Gruyter is the canonical multi-spouse case with
4 marriages.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Updates the impl-ref constants table to match buildLayout.ts (NODE_W=160,
NODE_H=56) and adds an explicit Layout rules section asserting the seeded-
rank invariant honoured since #689. Mockup <rect> dimensions stay at 144x50
with an explanatory annotation; re-pixel-pushing the illustrative SVG has
disproportionate blast radius for a spec doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI kept failing on the two gutter-render tests because the vitest-browser
iframe viewport is narrower than 768 px → window.matchMedia(min-width:
768px) returns false → gutter is hidden → g[role="text"] selector
returns []. The previous synchronous-seed fix was insufficient because
matchMedia itself was the false branch.
Add an optional `showGutter?: boolean` prop. When set, it bypasses the
matchMedia detection — tests pass `showGutter: true` to assert the
rendered gutter, and `showGutter: false` to assert the absent path.
Production callers leave it undefined so the existing media-query
detection still governs visibility.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI flagged two browser tests:
- "renders a G{n} label per occupied generation row …"
- "wraps the visible G3 text inside an aria-labelled group …"
Both queried g[role="text"] and got an empty array. Root cause:
isMdOrUp was initialised to false and only flipped to true inside a
$effect — but $effect runs after the first render, so the test's
post-render DOM scan saw the pre-effect (gutter-absent) state.
Seed the rune synchronously from window.matchMedia(...).matches when
window is available; SSR still picks the false branch and hydrates
without a layout flash. The effect now only attaches the change
listener for subsequent resizes.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sara's QA concerns:
1. PersonControllerTest.updatePerson_returns200_whenGenerationNull was
asymmetric — only checked status 200, no body assertion. Now also
asserts `$.generation` is null in the JSON response, mirroring the
in-range test's body check.
2. New full-stack PUT→DB→GET round-trip in PersonServiceIntegrationTest
(updatePerson_clearGenerationToNull_readsBackNullFromDb) seeds a
person with generation=3, calls updatePerson with generation=null,
flushes the persistence context, and asserts the column reads back
null from the DB. Without this we only had the mocked WebMvcTest
boundary; nothing proved JPA actually wrote SQL NULL.
3. Sibling test (updatePerson_setGenerationToZero_readsBackZeroFromDb)
pins the G 0 end-to-end so a primitive zero can't silently coerce
to null anywhere along controller → service → JPA.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Markus flagged the 0/10 range was duplicated across five sites (DB
CHECK, both importers, DTO @Min/@Max, dropdown range). New
PersonGeneration.MIN_GENERATION / MAX_GENERATION constants are now
the canonical Java source; the DTO annotations and both importer
guards reference them. The V70 SQL CHECK comment now points at the
Java constants so future widening updates one Java class plus one
SQL literal (Flyway forbids rewriting the migration in place).
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
db-orm.puml: persons gains a generation : SMALLINT attribute mirroring
the V70 column. No FK change, so db-relationships.puml is unaffected.
stammbaum-tree-spec.html:
- impl-ref table: replace "Gen label" with "Gutter label" + new
"Gutter stripe underlay" rows describing the role="text" wrapper,
un-shifted source-truth value, and below-md hidden state.
- light + dark colour-table rows updated to "Gutter label" /
"Gutter stripe" with the new var(--c-ink-2) / var(--c-gutter-stripe)
swatches.
- "Generationen ▾" filter chip mocks removed from desktop and tablet
layout sections (the filter UI was de-scoped from this PR).
Inline visual mockup SVGs that still show pre-gutter labelling are
out of scope per the issue body — the impl-ref table is the
authoritative source for this PR.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PersonEditForm.svelte gains a G 0…G 6 select inside the {#if isPerson}
block. min-h-[44px] meets WCAG 2.5.8 / dual-audience touch target.
generationStr is initialised via $state(untrack(...)) so prop reruns
never reset an in-progress edit (same pattern as selectedType).
Both /persons/[id]/edit and /persons/new form actions read the field
without the conditional-spread idiom — generation always lands in the
PUT/POST body. G 0 is a valid family-tree-root value the spread would
silently drop, and an empty option sends null so a human can clear the
field back to "unset".
i18n adds person_label_generation / person_option_generation_unset /
person_hint_generation in de/en/es. Drops the dead stammbaum_generations
key (zero callsites after the filter-chip removal in the spec).
Tests: dropdown render + hydration in the component, generation=0/3/null
arriving in the API body in the server actions.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The gutter sits 100 px to the left of the tree canvas on md+ viewports
(hidden entirely below md to preserve scrollable area on phones — see
spec's deliberate dual-audience trade-off). Per occupied generation
row it draws:
- A full-width decorative stripe rect alternating transparent and
var(--c-gutter-stripe). aria-hidden because it carries no meaning.
- The label `G{n}` at the left edge, sourced from the un-shifted
node.generation value (never the post-normalise rank), wrapped in
`<g role="text" aria-label="Generation N">` so screen readers
announce the full word instead of "G three".
CSS adds --c-gutter-stripe in both the light root and the dark mode
blocks (8% / 14% mint over canvas — decorative contrast carve-out).
Browser tests cover label rendering, the ARIA wrapper, and the
viewport-below-md absent-gutter path via a matchMedia stub. Existing
StammbaumTree structural-invariant tests still pass since none of
them assert anything inside the gutter region.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
buildLayout switches to a two-stage assignment:
1. Seed — every node with node.generation != null is locked at that
rank. The fallback heuristic never moves a locked rank, and the
spouse-pulldown never pulls a locked rank.
2. Fallback — for unseeded nodes, rank = max(parent rank) + 1 reading
parents from the same unified rank map, so an unseeded child of a
seeded G 2 parent correctly inherits rank 3. Spouse-pulldown ties
unseeded spouses to their deeper partner exactly as before.
3. Normalise — if any rank is negative (future G −1 ancestor), shift
the whole map so min(rank) == 0. No-op for today's data.
Fixes the Herbert Cram pattern from #361's review: two parented
spouses with imported G 3 now render on the same y row. Existing
StammbaumTree tests still pass byte-for-byte because every test node
has node.generation undefined, so the heuristic runs unchanged.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Manually mirrors the Spring Boot @Schema additions on PersonNodeDTO,
Person, and PersonUpdateDTO into the generated api.ts so the form +
gutter components compile against a finished type surface. The next
backend dev-profile run + `npm run generate:api` will regenerate the
same shape from the live OpenAPI spec.
PersonFormData gains `generation?: number | null` so PersonEditForm's
$state initialiser typechecks.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inject RelationshipService into CanonicalImportOrchestrator and walk
PARENT_OF edges in the family network after both person loaders finish
(before documents). For every edge where child.generation is set and
not strictly deeper than parent.generation, log a WARN — soft check,
never fails the batch.
Reads through getFamilyNetwork() per the layering rule (orchestrator
never touches PersonRelationshipRepository directly). Curators see the
warning in the import log; the rest of the pipeline is unaffected so
data with curatorial gaps still loads cleanly.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PersonNodeDTO is a positional record. The optional Integer generation
field is inserted between deathYear and familyMember so all four
construction sites stay readable without a builder.
- RelationshipService.getFamilyNetwork → populates with
person.getGeneration() (the Stammbaum's strict-rank source on the
frontend).
- RelationshipInferenceService.findAllFor → populates the same way;
inference UI does not consume it but the field travels along for
consistency.
- RelationshipControllerTest fixtures pass null.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads the optional `generation` integer from the canonical tree JSON and
routes it into PersonUpsertCommand. Out-of-range values are skip-and-
warned with the same policy as the register importer.
Tree imports run after register (per CanonicalImportOrchestrator); a
tree-confirmed integer overwrites a register-parsed value — both sides
are "canonical" in preferHuman terms (neither is a human edit).
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads the optional `generation` cell by header name (REQUIRED_HEADERS is
not extended — REQ-IMP-001 backward-compat for older artifacts), parses
it through GENERATION_PATTERN (^\s*G?\s*(-?\d+)), and routes it into
PersonUpsertCommand.generation.
Out-of-range values (G 99, G -1) are skip-and-warned, never abort the
batch; the post-parse range guard mirrors the V70 CHECK constraint so
the DB never sees a value Bean Validation wouldn't accept.
Pinned with a parametrised CsvSource covering every shape from the
acceptance criteria plus a backward-compat test (artifact without a
generation column still imports, all upserts get generation=null).
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- fromCanonical writes the imported generation into a new Person row.
- mergeCanonical routes existing/canonical generation through the
existing preferHuman(Integer, Integer) overload so a human-edited
value is never overwritten on re-import (ADR-025).
- updatePerson writes generation verbatim from the form DTO so a human
can clear it back to null — same shape as birthYear/deathYear.
- createPerson(PersonUpdateDTO) writes generation so /persons/new flow
doesn't silently drop a selected G value on create.
Pinned with five tests covering the four write paths plus the
documenting test that captures preferHuman's known limitation
(explicit human null is overwritten by a non-null canonical value —
same as birthYear/deathYear, deferred to a future helper rework if it
ever produces a user-visible bug).
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the optional generation field to both DTOs:
- PersonUpsertCommand gains Integer generation in the canonical-import
builder chain; service wiring lands in the next commit.
- PersonUpdateDTO gains @Min(0)@Max(10) Integer generation, the form-path
surface. The constraints mirror the V70 CHECK so validation fails fast
at the controller before reaching the DB.
PersonControllerTest pins the validation behaviour: -1 → 400, 11 → 400,
null → 200, 3 → 200 for both PUT (update) and POST (create) paths. The
GlobalExceptionHandler maps MethodArgumentNotValidException to
VALIDATION_ERROR so the frontend's extractErrorCode keeps working.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Flyway V70: SMALLINT generation column with CHECK(0..10) and partial
index over non-null rows. Person.generation field surfaces it through
the JPA model. Pre-import rows and persons outside the curated family
graph legitimately stay null; the canonical importer (next commits)
back-fills via preferHuman so a human-edited value is never lost.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move the layout function out of StammbaumTree.svelte (lines 47-275) into a
new pure TypeScript module at frontend/src/lib/person/genealogy/layout/
buildLayout.ts so it can be exercised by direct unit tests. Drops the
eslint-disable svelte/prefer-svelte-reactivity blanket; switches the
remaining scope-local Maps/Sets in parentLinks to SvelteMap/SvelteSet to
satisfy the rule per-call-site. No behaviour change — existing
StammbaumTree tests must pass byte-for-byte.
Refs #689
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The whole document load commits in one transaction, so a live counter
sits at 0 for the entire run and only jumps to the final number on
completion. Showing "0" next to the spinner read as "nothing happening"
and prompted repeated retriggers. Render just the spinner + running
label until the DONE branch displays the final processed count.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four admin actions (trigger-import, generate-thumbnails,
backfill-versions, backfill-file-hashes) were posting bare fetches, so
the backend's CSRF filter would reject them once the protection is on.
Wrap each init with withCsrf() so the X-XSRF-TOKEN header is attached
from the cookie — same pattern other admin actions use.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
userEvent.clear deletes per-keystroke, so intermediate values 'Au'/'A'
transit through the bound searchQuery and each schedules a debounced
fetch. When CI keystroke jitter exceeds SEARCH_DEBOUNCE_MS (150 ms), an
intermediate timer fires before the input reaches '' and the count
assertion sees a phantom q=Au call. fill('') drops a single input event
so the empty-query branch wins deterministically — same pattern this
test file already uses for fill('Walter').
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
addRelationship now auto-flips family_member=true on both endpoints for
PARENT_OF/SPOUSE_OF/SIBLING_OF (commit 07300aef). That side-effect breaks
the pre-condition assertion in setFamilyMember_true_makes_person_appear_in_network,
which expects charlie not to appear in the network before the explicit flip.
Reset charlie's flag after addRelationship so the test still exercises the
setFamilyMember(true) -> network presence path it was written for.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Promote svelte/no-at-html-tags to project-wide error so any new
{@html} block fails lint locally and in CI — the primary XSS defense.
The existing .gitea/workflows/ci.yml raw-date regex guard stays in
place as layered defense (it covers the specific raw-date variable
names that must NEVER be rendered via {@html}).
Existing legitimate {@html} usages (renderBody mentions in
CommentMessage.svelte, sanitized Markdown in geschichten/[id]) already
carry justified inline `eslint-disable-next-line` comments. Lint stays
green; verified by running npm run lint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend the WRITE_ALL-guard spec to a full matrix for each of the four
form actions (confirm, delete, merge, rename): happy path (backend 200),
required-field validation where applicable (merge without
targetPersonId, rename without lastName), backend 403, backend 404,
and the unauthorized guard from the previous commit. Mirrors the
shape of frontend/src/routes/persons/page.server.spec.ts.
18 tests, all green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The page-level error pill on /persons/review used raw Tailwind colour
classes (border-red-200, bg-red-50, text-red-600) — bypassing the
project's danger semantic tokens and breaking dark-mode contract. Align
with the rest of the persons domain (and PersonReviewRow's own deleteBtn)
by switching to border-danger / bg-danger/10 / text-danger.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Confirming a provisional person was a one-click write — easy to fat-finger
on a touchscreen and irreversible (the person disappears from the review
list, with no obvious undo path). Mirror the destructive-delete pattern
with a non-destructive confirm dialog (destructive: false) so the action
requires a second deliberate click.
New i18n keys (persons_review_confirm_confirm_title/text/button) added
to all three locales (de, en, es).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four form actions on /persons/review (confirm, delete, merge,
rename) had no server-side permission check — a reader with a hand-
crafted POST could trigger writes that the backend then rejected with
FORBIDDEN, but only after the round-trip. Add the existing hasWriteAll
guard at the top of each action and short-circuit with fail(403,
FORBIDDEN). Mirrors the guard pattern in the rest of the persons
domain (review-only writers must be gated client-side AND server-side).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DocumentImporter exposed a package-private openFileStream(File) so a
Mockito spy could force the IO-error branch of isPdfMagicBytes. The
test-only seam leaked into production: the method existed for testing,
not for any production extensibility.
Replace with a constructor-injected FileStreamOpener interface (single
abstract method, @FunctionalInterface) and a one-line
@Component DefaultFileStreamOpener delegate. Tests now inject a mock
opener instead of spying on the importer itself, which is also a more
idiomatic Mockito usage.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
resolveReceivers passed the slug as both `sourceRef` AND `lastName`, so
an unresolved receiver "smith-john" became a provisional Person with
lastName="smith-john" — a regression of the existing senderName→Person
contract.
Fix: zip the parallel `receiver_person_ids` and `receiver_names`
columns by position (the normalizer emits them 1:1 like
sender_person_id/sender_name). When the names list is shorter than the
slugs list, fall back to slug-as-name for the missing entries.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
buildDocument was a ~30-line method mixing attribution routing, date
parsing, authoritative collection management, file metadata, and
computed flags. Split into five named helpers — applyAttribution,
applyDates, applyAuthoritativeAssociations, applyFileMetadata,
applyComputedFlags — each doing one job. Pure refactor; all 43 existing
DocumentImporterTest cases still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four files in tools/import-normalizer/out/ contain real names,
addresses, and attribution prose for ~163 living/deceased family members
and were committed by mistake. They are now removed from the index
(kept on disk for local development) and gitignored.
The canonical artifacts are produced locally from the Python normalizer
and synced into IMPORT_HOST_DIR out-of-band alongside the PDFs. The
contract between normalizer and importer is the header schema, not the
file contents — CanonicalSheetReader fails closed on a missing header,
which is what locks the contract.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The canonical importer creates persons via PersonRegisterImporter first (no family_member
set) and then upserts them via PersonTreeImporter, but mergeCanonical never propagates
family_member to existing persons — so persons with imported relationships ended up
flagged family_member=false and never appeared in /api/persons family filters or the
family-network view.
RelationshipService is documented as the owner of the family_member flag, so the fix
lives there: addRelationship now sets family_member=true on both endpoints whenever the
relation type is PARENT_OF / SPOUSE_OF / SIBLING_OF (the same set getFamilyNetwork
filters by). Non-family types (FRIEND/COLLEAGUE/EMPLOYER/DOCTOR/NEIGHBOR/OTHER) leave
the flag alone — a family doctor isn't a family member. Extracted the type list as a
FAMILY_RELATION_TYPES constant and reused it in getFamilyNetwork for a single source of truth.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pdfjs-dist resolves to 5.7.284, which requires Node >=22.13.0 || >=24.
With engine-strict=true in .npmrc, npm ci hard-fails on the Node 20 base
image, so the frontend dev server crash-loops (and a clean build fails).
CI runs the frontend on Node 22 (Playwright image), so the committed
lockfile already assumes 22. Bump all three Dockerfile stages to match.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The observability work moved actuator to a separate management port
(management.server.port: 8081), but the dev compose healthcheck still
probed :8080/actuator/health, which 404s. The backend was reported
unhealthy and the frontend (depends_on: backend healthy) never started.
docker-compose.prod.yml already uses 8081; this aligns dev with it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the legacy raw-spreadsheet importer references left behind after
#674 with the canonical import architecture (CanonicalImportOrchestrator +
four loaders) and document #686 index-based PDF resolution.
- l3-backend-3b: DocumentImporter now resolves PDF by index (importDir/
<index>.pdf) with index validation + canonical-path containment + %PDF
magic-byte check (no recursive walk / homoglyph file-path guards)
- c4-diagrams.md: replace massImport/excelSvc components + their rels with
an importOrch (CanonicalImportOrchestrator) component wired to doc/person/
tag services; refresh adminCtrl and adminSystem descriptions
- ARCHITECTURE.md: importing package row now describes the orchestrator +
four loaders consuming canonical artifacts
- TODO-backend.md: remove obsolete "MassImportService provides no status"
item (service deleted; orchestrator already exposes import-status); update
stale ExcelService test-coverage suggestion
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address PR #687 review concerns on DocumentImporterTest:
- Sara/Felix: add catalog-shape reject tests that pass every char
pre-check but must fail INDEX_PATTERN — "J 0070" (space), "WXYZA-0001"
(5 letters), "12-0001" (no letter prefix), "W-0001X" (uppercase X).
Verified red against a weakened pattern, green against the real one,
so the pattern branch (not the char guards) is now pinned.
- Felix: restore the import java.io.OutputStream line (was over-deleted
and patched with a fully-qualified name).
- Sara: document why the resolvePdfByIndex getCanonicalPath IOException
branch is intentionally left uncovered (no deterministic injection
seam; the log.warn is the substantive fix).
Adjust the two reflective resolvePdfByIndex calls for the new rowNumber
parameter.
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address PR #687 review concerns on DocumentImporter:
- Tobias: thread a 1-based source row number into importRow so the
"index rejected" skip log carries a breadcrumb (the row number, never
the raw hostile index) for post-import triage.
- Elicit: emit a distinct log when a valid index has no <index>.pdf on
disk (normal PLACEHOLDER) so it is not conflated with a rejected index.
- Nora: add a log.warn in resolvePdfByIndex's getCanonicalPath IOException
branch so the quiet fail-safe skip surfaces in ops, distinct from the
deliberate symlink-escape abort.
- Felix: replace inline fully-qualified java.util.regex.Pattern with an
import.
- Nora: document that \d is intentionally ASCII-only (do not add
UNICODE_CHARACTER_CLASS).
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The mass-import card no longer parses an ODS spreadsheet and MassImportService
was deleted (#674); /import now holds the normalizer's canonical artifacts
(canonical-*.xlsx + canonical-persons-tree.json) plus <index>.pdf files, read
by the canonical importer. Fix the IMPORT_HOST_DIR descriptions in
DEPLOYMENT.md and docker-compose.prod.yml accordingly.
Refs #686
File resolution is now by index (<index>.pdf), not the datei/file
column. Update the ADR-025 security sub-decision and consequence (the
recursive walk and file column are gone; a bad index skips its row with
a loud SkipReason, a symlink-escape still aborts via the containment
assertion) and DEPLOYMENT §6 (PDFs must be named <index>.pdf flat in
the import dir).
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Regenerated from the source workbooks with the committed overrides; the
export schema now has 16 columns (no file). canonical-persons.xlsx and
canonical-tag-tree.xlsx were unchanged at the cell level (only openpyxl
zip-byte churn) and were left untouched to keep the diff minimal.
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The corpus is uniform — every PDF is <index>.pdf flat in the import
dir — so resolve a document's PDF with an O(1) importDir.resolve(index
+ ".pdf") lookup instead of a recursive directory walk over the file
column. The index is validated against a strict catalog pattern
(1–4 Latin letters incl. umlauts, hyphen(s), digits, optional x) plus
the ported separator/dot/dotdot/null/slash-homoglyph/absolute-path
guards, and the resolved canonical path is asserted to stay inside the
import dir as defense-in-depth. The %PDF magic-byte check still gates
upload; status UPLOADED/PLACEHOLDER and the index→originalFilename
upsert key are unchanged. The file column and findFileRecursive walk
are gone, and the security regression tests now assert a malicious or
garbage index is rejected and a valid index resolves to exactly
importDir/<index>.pdf within containment.
Closes#686Closes#676
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The import corpus is uniform: every PDF is named <index>.pdf, so the
file column (the spreadsheet's datei value) is redundant. Remove file
from CanonicalDocument, RawRow, _FIELDS, to_canonical, and DOC_COLUMNS,
plus the now-moot index_file_mismatch review flag/CSV/stat and the
datei header mapping. date_end and the tree person_id are kept.
Refs #686
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The global undated-count rework moved the pure-text-RELEVANCE shortcut
into runSearch, where it ran after the unconditional
findAllMatchingIdsByFts call. That routed pure-text relevance through the
in-memory id path and returned empty match data, breaking FTS rank order
and snippet/offset enrichment.
Hoist the shortcut back to the top of searchDocuments so it short-circuits
to findFtsPageRaw before findAllMatchingIdsByFts, while still computing the
global undatedCount for all non-fast-path searches.
Refs #668
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>