Compare commits

..

3 Commits

Author SHA1 Message Date
Marcel
7354c3d332 docs(adr): record the index pattern as a corpus-specific constraint
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m35s
CI / OCR Service Tests (pull_request) Successful in 21s
CI / Backend Unit Tests (pull_request) Successful in 3m58s
CI / fail2ban Regex (pull_request) Successful in 42s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m5s
Address PR #687 review concern (Elicit): add an ADR-025 Consequences
entry noting INDEX_PATTERN accepts only the current corpus shape (<=4
Latin-1 letters, hyphens, ASCII digits, optional x) and must be revisited
deliberately if the catalog scheme grows (5-letter prefix, digit-led id,
non-Latin letter), since such rows would otherwise be skipped, not
imported. Also records the ASCII-only \d intent.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:20:21 +02:00
Marcel
b3f4fb2d8d test(importing): pin regex reject-boundary + note untestable IO branch
Address PR #687 review concerns on DocumentImporterTest:
- Sara/Felix: add catalog-shape reject tests that pass every char
  pre-check but must fail INDEX_PATTERN — "J 0070" (space), "WXYZA-0001"
  (5 letters), "12-0001" (no letter prefix), "W-0001X" (uppercase X).
  Verified red against a weakened pattern, green against the real one,
  so the pattern branch (not the char guards) is now pinned.
- Felix: restore the import java.io.OutputStream line (was over-deleted
  and patched with a fully-qualified name).
- Sara: document why the resolvePdfByIndex getCanonicalPath IOException
  branch is intentionally left uncovered (no deterministic injection
  seam; the log.warn is the substantive fix).

Adjust the two reflective resolvePdfByIndex calls for the new rowNumber
parameter.

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:20:15 +02:00
Marcel
a1f96f60ac feat(importing): log import-row breadcrumbs and distinguish skip outcomes
Address PR #687 review concerns on DocumentImporter:
- Tobias: thread a 1-based source row number into importRow so the
  "index rejected" skip log carries a breadcrumb (the row number, never
  the raw hostile index) for post-import triage.
- Elicit: emit a distinct log when a valid index has no <index>.pdf on
  disk (normal PLACEHOLDER) so it is not conflated with a rejected index.
- Nora: add a log.warn in resolvePdfByIndex's getCanonicalPath IOException
  branch so the quiet fail-safe skip surfaces in ops, distinct from the
  deliberate symlink-escape abort.
- Felix: replace inline fully-qualified java.util.regex.Pattern with an
  import.
- Nora: document that \d is intentionally ASCII-only (do not add
  UNICODE_CHARACTER_CLASS).

Refs #686

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:20:08 +02:00
3 changed files with 89 additions and 11 deletions

View File

@@ -37,6 +37,7 @@ import java.util.List;
import java.util.Optional;
import java.util.Set;
import java.util.UUID;
import java.util.regex.Pattern;
/**
* Loads {@code canonical-documents.xlsx} into the document domain. Java performs no
@@ -67,8 +68,11 @@ public class DocumentImporter {
// "Mü-0001"), one or more hyphens (the corpus has a few "C--0029" data-entry artefacts),
// digits, and an optional trailing "x" the normalizer recognises. Anchored, with no
// separator / dot / slash characters in the class, so "<index>.pdf" can never traverse.
private static final java.util.regex.Pattern INDEX_PATTERN =
java.util.regex.Pattern.compile("[A-Za-z\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF]{1,4}-+\\d+x?");
// NOTE: `\d` here is intentionally ASCII-only ([0-9]). Java's java.util.regex matches `\d`
// against [0-9] unless Pattern.UNICODE_CHARACTER_CLASS is set — do NOT add that flag, or
// Arabic-Indic / fullwidth digits would silently widen the accepted set.
private static final Pattern INDEX_PATTERN =
Pattern.compile("[A-Za-z\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF]{1,4}-+\\d+x?");
private final DocumentService documentService;
private final PersonService personService;
@@ -93,10 +97,14 @@ public class DocumentImporter {
List<CanonicalSheetReader.Row> rows = CanonicalSheetReader.readRows(artifact, REQUIRED_HEADERS);
int processed = 0;
List<ImportStatus.SkippedFile> skipped = new ArrayList<>();
// 1-based source row number for ops triage breadcrumbs (the spreadsheet header is row 1,
// so the first data row is row 2 — matches what an operator sees in the .xlsx).
int rowNumber = 1;
for (CanonicalSheetReader.Row row : rows) {
rowNumber++;
String index = row.get("index");
if (index.isBlank()) continue;
Optional<ImportStatus.SkipReason> skipReason = importRow(row, index);
Optional<ImportStatus.SkipReason> skipReason = importRow(row, index, rowNumber);
if (skipReason.isPresent()) {
skipped.add(new ImportStatus.SkippedFile(index, skipReason.get()));
} else {
@@ -107,13 +115,24 @@ public class DocumentImporter {
return new LoadResult(processed, skipped);
}
private Optional<ImportStatus.SkipReason> importRow(CanonicalSheetReader.Row row, String index) {
private Optional<ImportStatus.SkipReason> importRow(CanonicalSheetReader.Row row, String index, int rowNumber) {
if (!isValidImportIndex(index)) {
log.warn("Skipping import row: index rejected");
// Breadcrumb is the source row number, NOT the raw (possibly-hostile) index — an
// operator triaging the import can find the offending row in the .xlsx without us
// echoing attacker-controlled input into the log.
log.warn("Skipping import row {}: index rejected (fails catalog-shape validation)", rowNumber);
return Optional.of(ImportStatus.SkipReason.INVALID_FILENAME_PATH_TRAVERSAL);
}
Optional<File> resolved = resolvePdfByIndex(index);
if (resolved.isPresent()) {
Optional<File> resolved = resolvePdfByIndex(index, rowNumber);
if (resolved.isEmpty()) {
// Distinct from the "index rejected" skip above: the index is VALID but no
// <index>.pdf is on disk, so the row becomes a normal PLACEHOLDER (not skipped). The
// index is a validated catalog id (no hostile content), so it is safe to log here —
// this surfaces a corpus that drifts from the "<index>.pdf" assumption (e.g. a file
// that arrived under a different name) rather than dropping it silently.
log.info("Import row {}: index {} is valid but {}.pdf is absent — creating PLACEHOLDER",
rowNumber, index, index);
} else {
try {
if (!isPdfMagicBytes(resolved.get())) {
return Optional.of(ImportStatus.SkipReason.INVALID_PDF_SIGNATURE);
@@ -319,7 +338,7 @@ public class DocumentImporter {
// O(1) direct lookup: the PDF is exactly importDir/<index>.pdf. The caller has already
// validated the index shape; the canonical-path containment assertion below is
// defense-in-depth so even a symlinked <index>.pdf cannot read outside importDir.
private Optional<File> resolvePdfByIndex(String index) {
private Optional<File> resolvePdfByIndex(String index, int rowNumber) {
File baseDir = new File(importDir);
File candidate = baseDir.toPath().resolve(index + ".pdf").toFile();
try {
@@ -330,6 +349,11 @@ public class DocumentImporter {
}
return Optional.of(candidate);
} catch (IOException e) {
// Distinct from the deliberate symlink-escape abort above (which throws): canonical
// resolution itself failed (e.g. the OS rejected the path mid-resolution). We fail
// safe to a PLACEHOLDER, but never silently — log it so the asymmetry surfaces in ops.
log.warn("Canonical path resolution failed for import row {}: treating {}.pdf as absent",
rowNumber, index, e);
return Optional.empty();
}
}

View File

@@ -24,6 +24,7 @@ import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.PutObjectRequest;
import java.io.File;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.LocalDate;
@@ -126,6 +127,36 @@ class DocumentImporterTest {
assertThat(validIndex("W-0001.pdf")).isFalse();
}
// ─── catalog-shape rejects — pass the char pre-checks but must fail INDEX_PATTERN ────
// These pin the regex branch itself: each string contains no separator, dot, slash
// homoglyph, null byte, or absolute marker, so it sails past every char guard and is
// rejected *only* because INDEX_PATTERN.matches() returns false. A weaker pattern would
// let them through — these tests would then go red.
@Test
void isValidImportIndex_returnsFalse_whenSpaceInIndex() {
// The real-world reject: "J 0070" is a space-typo with no PDF on disk.
assertThat(validIndex("J 0070")).isFalse();
}
@Test
void isValidImportIndex_returnsFalse_whenFiveLetterPrefix() {
// The catalog prefix is at most 4 letters; 5 must not match.
assertThat(validIndex("WXYZA-0001")).isFalse();
}
@Test
void isValidImportIndex_returnsFalse_whenNoLetterPrefix() {
// A digit-led id (no letter prefix) is not a catalog shape.
assertThat(validIndex("12-0001")).isFalse();
}
@Test
void isValidImportIndex_returnsFalse_whenUppercaseXSuffix() {
// Only a lowercase trailing "x" is allowed; an uppercase "X" suffix must fail.
assertThat(validIndex("W-0001X")).isFalse();
}
@Test
void isValidImportIndex_returnsTrue_whenPlainCatalogIndex() {
assertThat(validIndex("W-0124")).isTrue();
@@ -222,7 +253,7 @@ class DocumentImporterTest {
ReflectionTestUtils.setField(importer, "importDir", importDirPath.toString());
org.assertj.core.api.Assertions.assertThatThrownBy(
() -> ReflectionTestUtils.invokeMethod(importer, "resolvePdfByIndex", "W-0001"))
() -> ReflectionTestUtils.invokeMethod(importer, "resolvePdfByIndex", "W-0001", 2))
.isInstanceOf(org.raddatz.familienarchiv.exception.DomainException.class);
}
@@ -232,12 +263,24 @@ class DocumentImporterTest {
Path expected = tempDir.resolve("Eu-0628.pdf");
Files.writeString(expected, "%PDF-1.4");
Optional<File> resolved = ReflectionTestUtils.invokeMethod(importer, "resolvePdfByIndex", "Eu-0628");
Optional<File> resolved = ReflectionTestUtils.invokeMethod(importer, "resolvePdfByIndex", "Eu-0628", 2);
assertThat(resolved).isPresent();
assertThat(resolved.get().getCanonicalFile()).isEqualTo(expected.toFile().getCanonicalFile());
}
// NOTE (Sara, PR #687): the IOException branch of resolvePdfByIndex — where
// File.getCanonicalPath() itself throws (an OS-level failure mid-resolution, not the
// symlink-escape DomainException) — is intentionally NOT covered by a test. Unlike
// isPdfMagicBytes, which has the package-private openFileStream(File) seam a Mockito spy can
// make throw, getCanonicalPath() is called on a File built internally with no injection seam,
// and there is no portable, deterministic way to make it throw on a temp file (it does not
// throw for missing/symlinked paths — those are handled by isFile()/the containment check).
// Adding a seam purely to test this would be production code in service of a non-defect; the
// substantive fix is the log.warn() now emitted in that branch so the quiet skip surfaces in
// ops. Left uncovered by deliberate decision, documented here so the branch is not assumed
// tested.
// ─── PDF magic-byte guard — ported — do not remove ──────────────────────────────
@Test
@@ -544,7 +587,7 @@ class DocumentImporterTest {
row.createCell(c).setCellValue(rows[r].getOrDefault(headers.get(c), ""));
}
}
try (java.io.OutputStream out = Files.newOutputStream(xlsx)) {
try (OutputStream out = Files.newOutputStream(xlsx)) {
wb.write(out);
}
}

View File

@@ -149,6 +149,17 @@ Settled sub-decisions:
the same state, so the operational recovery for a partial failure is simply to fix the
offending artifact and re-trigger the import — no manual cleanup of half-written data is
required. A future maintainer must not assume all-or-nothing semantics.
- **The index pattern is corpus-specific and must be revisited if the catalog scheme grows.**
`INDEX_PATTERN` accepts only the *current* corpus shape — at most four Latin-1 letters (incl.
umlauts) followed by one or more hyphens, ASCII digits, and an optional trailing `x`. This is a
conscious constraint, not a general filename validator: a future sub-collection catalogued with
a 5-letter prefix, a digit-led id, or a non-Latin-1 letter (e.g. `Č` or a Cyrillic id) would
fail `isValidImportIndex` and its rows would be **skipped** (`INVALID_FILENAME_PATH_TRAVERSAL`),
not imported. Likewise a real PDF that does not follow `<index>.pdf` produces a `PLACEHOLDER`
(the importer logs both cases distinctly — see #686). If the catalog scheme ever changes, the
pattern and its tests must be widened deliberately; do not loosen it casually, as it is the
allowlist that keeps the on-disk lookup safe. Note `\d` is intentionally ASCII-only — adding
`Pattern.UNICODE_CHARACTER_CLASS` would silently widen the accepted digit set.
- **A malicious/garbage index skips its row with a loud `SkipReason`, by design.** Since #686
the index is the only on-disk lookup key. An index that fails `isValidImportIndex`
(path separator, traversal token, slash homoglyph, null byte, absolute path, or a non-catalog