feat(resilience): wrap OCR client with Resilience4j retry + circuit-breaker + time-limiter #463

New Issue

marcel · 2026-05-07T17:24:25+02:00

marcel commented

2026-05-07 17:24:25 +02:00

Context

backend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.java configures connect/read timeouts (10s connect / 10min read) but has no retry, no circuit-breaker, no bulkhead. The fallback path is try/catch → DomainException.internal(OCR_SERVICE_UNAVAILABLE).

Operational consequences:

A flaky OCR pod (OOM under PaddleOCR-equivalent load, slow Hugging Face cache miss, CLAHE preprocessing on a giant scan) cascades into thread-pool exhaustion in Jetty.
Retry storms compound the load on a recovering pod.
The OCR mem_limit: 12g per docker-compose.yml means a single bad PDF can OOM-kill the whole service. The backend will hammer it during recovery and prevent it from coming back cleanly.

The dynamic part of the audit corroborates: pg_stat_statements shows the backend is normally fast (sub-ms queries), so the OCR boundary is the most likely failure mode for cascading user-visible 5xx.

Approach

Add resilience4j-spring-boot3 (or the matching Boot 4 starter) and annotate the OCR client. Use three patterns: retry, circuit-breaker, time-limiter. Do not retry on 4xx (bad PDF) — only on 5xx and timeouts.

`backend/pom.xml`

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.3.0</version>  <!-- or whatever pairs with Boot 4 -->
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-reactor</artifactId>
</dependency>

`RestClientOcrClient.java`

@CircuitBreaker(name = "ocr", fallbackMethod = "ocrUnavailable")
@Retry(name = "ocr")
public OcrResult startOcr(...) { ... }

private OcrResult ocrUnavailable(/*params*/, Throwable t) {
    log.warn("OCR circuit open or retries exhausted", t);
    throw DomainException.serviceUnavailable(ErrorCode.OCR_SERVICE_UNAVAILABLE,
        "OCR temporarily unavailable, please retry in a moment");
}

For the streaming /ocr/stream call, pin @TimeLimiter instead of trusting the read timeout alone.

`application.yaml`

resilience4j:
  circuitbreaker:
    instances:
      ocr:
        sliding-window-size: 10
        sliding-window-type: COUNT_BASED
        failure-rate-threshold: 50
        slow-call-rate-threshold: 50
        slow-call-duration-threshold: 30s
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 3
        record-exceptions:
          - java.io.IOException
          - java.net.SocketTimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - org.springframework.web.client.HttpClientErrorException  # 4xx — not retryable
  retry:
    instances:
      ocr:
        max-attempts: 3
        wait-duration: 2s
        exponential-backoff-multiplier: 2
        retry-exceptions:
          - java.net.SocketTimeoutException
          - java.io.IOException
        ignore-exceptions:
          - org.springframework.web.client.HttpClientErrorException

Metrics

Resilience4j exposes Micrometer metrics out of the box. After #140 (Prometheus stack), add a Grafana panel for resilience4j_circuitbreaker_state{name="ocr"} and resilience4j_retry_calls{name="ocr",kind="successful_with_retry"}.

Critical files

backend/pom.xml
backend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.java — annotations + fallback
backend/src/main/resources/application.yaml — resilience4j config
backend/src/main/java/org/raddatz/familienarchiv/exception/ErrorCode.java — new OCR_SERVICE_UNAVAILABLE code (if not already present)
frontend/src/lib/shared/errors.ts — mirror the new ErrorCode
frontend/messages/{de,en,es}.json — i18n key for the new error
backend/src/test/java/org/raddatz/familienarchiv/ocr/RestClientOcrClientResilienceTest.java — new

Verification

Test: stop ocr-service (docker compose stop ocr-service). Trigger an OCR run via the backend. Expect: 3 retries with 2s/4s/8s back-off, then a clean 503 with OCR_SERVICE_UNAVAILABLE, not a stack-trace 500.
Test: leave the OCR service running but make it slow (docker exec archive-ocr kill -STOP 1). After 30s of slow calls the circuit opens; subsequent requests fail fast (no retry, no wait) until 30s pass.
Test: call OCR with a malformed/4xx-triggering payload. The retry annotation does not retry; the 4xx surfaces directly to the caller.
Metrics: curl http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker_state returns the gauge.

Acceptance criteria

Backend never hammers a failing OCR service for more than ~14s before opening the circuit.
User-facing error on OCR outage is a clean 503 with a translated message, not a 500 with a stack trace.
4xx errors from OCR are passed through unchanged (no retry).
Resilience4j metrics exposed at /actuator/prometheus.
Tests cover all three modes (closed, open, half-open).

Effort

S — 1 day including resilience4j-vs-Boot 4 BOM compatibility check.

Risk if not addressed

OCR outages cascade to backend thread-pool exhaustion → 5xx for all user requests, not just OCR ones.

Tracked in audit doc as F-09 (High).

## Context `backend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.java` configures connect/read timeouts (10s connect / 10min read) but has **no retry, no circuit-breaker, no bulkhead**. The fallback path is `try/catch → DomainException.internal(OCR_SERVICE_UNAVAILABLE)`. Operational consequences: - A flaky OCR pod (OOM under PaddleOCR-equivalent load, slow Hugging Face cache miss, CLAHE preprocessing on a giant scan) cascades into thread-pool exhaustion in Jetty. - Retry storms compound the load on a recovering pod. - The OCR `mem_limit: 12g` per `docker-compose.yml` means a single bad PDF can OOM-kill the whole service. The backend will hammer it during recovery and prevent it from coming back cleanly. The dynamic part of the audit corroborates: `pg_stat_statements` shows the backend is normally fast (sub-ms queries), so the OCR boundary is the most likely failure mode for cascading user-visible 5xx. ## Approach Add `resilience4j-spring-boot3` (or the matching Boot 4 starter) and annotate the OCR client. Use three patterns: retry, circuit-breaker, time-limiter. **Do not retry on 4xx** (bad PDF) — only on 5xx and timeouts. ### `backend/pom.xml` ```xml <dependency> <groupId>io.github.resilience4j</groupId> <artifactId>resilience4j-spring-boot3</artifactId> <version>2.3.0</version>  </dependency> <dependency> <groupId>io.github.resilience4j</groupId> <artifactId>resilience4j-reactor</artifactId> </dependency> ``` ### `RestClientOcrClient.java` ```java @CircuitBreaker(name = "ocr", fallbackMethod = "ocrUnavailable") @Retry(name = "ocr") public OcrResult startOcr(...) { ... } private OcrResult ocrUnavailable(/*params*/, Throwable t) { log.warn("OCR circuit open or retries exhausted", t); throw DomainException.serviceUnavailable(ErrorCode.OCR_SERVICE_UNAVAILABLE, "OCR temporarily unavailable, please retry in a moment"); } ``` For the streaming `/ocr/stream` call, pin `@TimeLimiter` instead of trusting the read timeout alone. ### `application.yaml` ```yaml resilience4j: circuitbreaker: instances: ocr: sliding-window-size: 10 sliding-window-type: COUNT_BASED failure-rate-threshold: 50 slow-call-rate-threshold: 50 slow-call-duration-threshold: 30s wait-duration-in-open-state: 30s permitted-number-of-calls-in-half-open-state: 3 record-exceptions: - java.io.IOException - java.net.SocketTimeoutException - org.springframework.web.client.HttpServerErrorException ignore-exceptions: - org.springframework.web.client.HttpClientErrorException # 4xx — not retryable retry: instances: ocr: max-attempts: 3 wait-duration: 2s exponential-backoff-multiplier: 2 retry-exceptions: - java.net.SocketTimeoutException - java.io.IOException ignore-exceptions: - org.springframework.web.client.HttpClientErrorException ``` ### Metrics Resilience4j exposes Micrometer metrics out of the box. After #140 (Prometheus stack), add a Grafana panel for `resilience4j_circuitbreaker_state{name="ocr"}` and `resilience4j_retry_calls{name="ocr",kind="successful_with_retry"}`. ## Critical files - `backend/pom.xml` - `backend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.java` — annotations + fallback - `backend/src/main/resources/application.yaml` — resilience4j config - `backend/src/main/java/org/raddatz/familienarchiv/exception/ErrorCode.java` — new `OCR_SERVICE_UNAVAILABLE` code (if not already present) - `frontend/src/lib/shared/errors.ts` — mirror the new ErrorCode - `frontend/messages/{de,en,es}.json` — i18n key for the new error - `backend/src/test/java/org/raddatz/familienarchiv/ocr/RestClientOcrClientResilienceTest.java` — **new** ## Verification 1. **Test**: stop `ocr-service` (`docker compose stop ocr-service`). Trigger an OCR run via the backend. Expect: 3 retries with 2s/4s/8s back-off, then a clean 503 with `OCR_SERVICE_UNAVAILABLE`, not a stack-trace 500. 2. **Test**: leave the OCR service running but make it slow (`docker exec archive-ocr kill -STOP 1`). After 30s of slow calls the circuit opens; subsequent requests fail fast (no retry, no wait) until 30s pass. 3. **Test**: call OCR with a malformed/4xx-triggering payload. The retry annotation does **not** retry; the 4xx surfaces directly to the caller. 4. **Metrics**: `curl http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker_state` returns the gauge. ## Acceptance criteria - [ ] Backend never hammers a failing OCR service for more than ~14s before opening the circuit. - [ ] User-facing error on OCR outage is a clean 503 with a translated message, not a 500 with a stack trace. - [ ] 4xx errors from OCR are passed through unchanged (no retry). - [ ] Resilience4j metrics exposed at `/actuator/prometheus`. - [ ] Tests cover all three modes (closed, open, half-open). ## Effort S — 1 day including resilience4j-vs-Boot 4 BOM compatibility check. ## Risk if not addressed OCR outages cascade to backend thread-pool exhaustion → 5xx for *all* user requests, not just OCR ones. Tracked in audit doc as **F-09** (High).

marcel added this to the OCR Service Hardening milestone 2026-05-07 17:24:25 +02:00

marcel added the P1-high feature labels 2026-05-07 17:26:57 +02:00

marcel commented

2026-05-07 17:37:45 +02:00

🏛️ Markus Keller — Application Architect

Observations

The issue correctly identifies the OCR boundary as the primary cascade risk. The audit data backs this: sub-ms PostgreSQL, slow OCR = OCR is the only realistic thread-pool exhaustion vector in this monolith.
The proposed resilience4j-spring-boot3 artifact name in the issue is wrong for this project. The backend is Spring Boot 4.0 (confirmed in pom.xml line 8). Resilience4j 2.x publishes separate starters: resilience4j-spring-boot3 is the Boot 3.x artifact. For Boot 4 / Spring Framework 7, you must either use the plain resilience4j-spring-boot3 artifact at version 2.3.0+ (which has Boot 4 support starting from a later patch — verify BOM compatibility) or resilience4j-core + manual Spring AOP wiring. This BOM compatibility check is the real risk in the "S — 1 day" estimate; it could become an M if the Boot 4 starter is not yet published.
The @CircuitBreaker / @Retry annotations on RestClientOcrClient are the right application point. The OcrClient interface is already mockable, so the AOP weaving will work cleanly.
The issue proposes annotating startOcr(...) in RestClientOcrClient — but looking at the actual code, the methods are extractBlocks, streamBlocks, trainModel, etc. There is no startOcr method on the client. The issue's Java snippet is illustrative but not copy-pasteable to the real class.
The streaming path (streamBlocks → HttpClient.send(...) with Duration.ofMinutes(5) timeout) is the hardest to wrap with @Retry because the method is long-running and stateful. Blindly retrying a partially-consumed NDJSON stream would corrupt the transcription block sequence. This is a significant nuance the issue glosses over.
Training methods (trainModel, segtrainModel, trainSenderModel) should not get circuit-breaker or retry — training is user-initiated, long-running, non-idempotent, and failure there surfaces correctly today. Annotating them would break the training workflow.
The isHealthy() pre-flight check in OcrService.startOcr() and OcrBatchService.startBatch() is a second form of "circuit-breaker" that already exists. With Resilience4j in place, the health-check pre-flight becomes redundant — the circuit will handle the unavailable state. Document this decision in the implementation: keep the pre-flight or remove it, but don't have both.
DomainException.internal() maps to HTTP 500. The issue spec says the target is HTTP 503. A serviceUnavailable() factory does not exist in DomainException.java — it needs to be added, or new DomainException(ErrorCode.OCR_SERVICE_UNAVAILABLE, HttpStatus.SERVICE_UNAVAILABLE, ...) used directly. This is a one-liner but it is missing from the issue checklist.

Recommendations

Add the serviceUnavailable() factory to DomainException first, before wiring Resilience4j — the fallback method needs it, and it fixes the existing 500→503 mismatch in OcrService and OcrBatchService regardless of whether this issue is done.
Apply @CircuitBreaker + @Retry to extractBlocks and the initial connection phase of streamBlocks only. For streamBlocks, consider wrapping only the HTTP connection/handshake in Resilience4j and letting the streaming itself run to completion or timeout independently.
Explicitly exclude trainModel, segtrainModel, and trainSenderModel from circuit-breaker and retry — they are already long-running, user-visible, and non-idempotent.
Do the BOM compatibility check on day 1. If resilience4j-spring-boot3:2.3.0 does not play clean with Spring Boot 4.0's BOM, fall back to resilience4j-core + manual @Bean CircuitBreakerRegistry — slightly more boilerplate, no starter magic, fully controllable.
Remove the isHealthy() pre-flight in OcrService and OcrBatchService once Resilience4j's circuit-breaker is in place — the health check is now a synchronous call that adds latency and duplicates the circuit's job. Log a warning in the fallback method instead.

Open Decisions

Streaming retry semantics: Should streamBlocks retry from the beginning if the TCP connection drops mid-stream (re-submitting the entire PDF URL for re-processing)? This is safe for idempotent single-document OCR but could create double transcription blocks if clearExistingBlocks is not called before the retry. Needs explicit handling in the retry logic, not just an annotation.
Training circuit-breaker opt-out: The issue does not clarify whether training routes should be wrapped. Recommend excluding them, but this needs a conscious decision documented in the implementation.

## 🏛️ Markus Keller — Application Architect ### Observations - The issue correctly identifies the OCR boundary as the primary cascade risk. The audit data backs this: sub-ms PostgreSQL, slow OCR = OCR is the only realistic thread-pool exhaustion vector in this monolith. - The proposed `resilience4j-spring-boot3` artifact name in the issue is wrong for this project. The backend is Spring Boot **4.0** (confirmed in `pom.xml` line 8). Resilience4j 2.x publishes separate starters: `resilience4j-spring-boot3` is the Boot 3.x artifact. For Boot 4 / Spring Framework 7, you must either use the plain `resilience4j-spring-boot3` artifact at version 2.3.0+ (which has Boot 4 support starting from a later patch — verify BOM compatibility) or `resilience4j-core` + manual Spring AOP wiring. This BOM compatibility check is the real risk in the "S — 1 day" estimate; it could become an M if the Boot 4 starter is not yet published. - The `@CircuitBreaker` / `@Retry` annotations on `RestClientOcrClient` are the right application point. The `OcrClient` interface is already mockable, so the AOP weaving will work cleanly. - The issue proposes annotating `startOcr(...)` in `RestClientOcrClient` — but looking at the actual code, the methods are `extractBlocks`, `streamBlocks`, `trainModel`, etc. There is no `startOcr` method on the client. The issue's Java snippet is illustrative but not copy-pasteable to the real class. - The streaming path (`streamBlocks` → `HttpClient.send(...)` with `Duration.ofMinutes(5)` timeout) is the hardest to wrap with `@Retry` because the method is long-running and stateful. Blindly retrying a partially-consumed NDJSON stream would corrupt the transcription block sequence. This is a significant nuance the issue glosses over. - Training methods (`trainModel`, `segtrainModel`, `trainSenderModel`) should **not** get circuit-breaker or retry — training is user-initiated, long-running, non-idempotent, and failure there surfaces correctly today. Annotating them would break the training workflow. - The `isHealthy()` pre-flight check in `OcrService.startOcr()` and `OcrBatchService.startBatch()` is a second form of "circuit-breaker" that already exists. With Resilience4j in place, the health-check pre-flight becomes redundant — the circuit will handle the unavailable state. Document this decision in the implementation: keep the pre-flight or remove it, but don't have both. - `DomainException.internal()` maps to HTTP 500. The issue spec says the target is HTTP 503. A `serviceUnavailable()` factory does not exist in `DomainException.java` — it needs to be added, or `new DomainException(ErrorCode.OCR_SERVICE_UNAVAILABLE, HttpStatus.SERVICE_UNAVAILABLE, ...)` used directly. This is a one-liner but it is missing from the issue checklist. ### Recommendations - Add the `serviceUnavailable()` factory to `DomainException` **first**, before wiring Resilience4j — the fallback method needs it, and it fixes the existing 500→503 mismatch in `OcrService` and `OcrBatchService` regardless of whether this issue is done. - Apply `@CircuitBreaker` + `@Retry` to `extractBlocks` and the initial connection phase of `streamBlocks` only. For `streamBlocks`, consider wrapping only the HTTP connection/handshake in Resilience4j and letting the streaming itself run to completion or timeout independently. - Explicitly exclude `trainModel`, `segtrainModel`, and `trainSenderModel` from circuit-breaker and retry — they are already long-running, user-visible, and non-idempotent. - Do the BOM compatibility check on day 1. If `resilience4j-spring-boot3:2.3.0` does not play clean with Spring Boot 4.0's BOM, fall back to `resilience4j-core` + manual `@Bean CircuitBreakerRegistry` — slightly more boilerplate, no starter magic, fully controllable. - Remove the `isHealthy()` pre-flight in `OcrService` and `OcrBatchService` once Resilience4j's circuit-breaker is in place — the health check is now a synchronous call that adds latency and duplicates the circuit's job. Log a warning in the fallback method instead. ### Open Decisions - **Streaming retry semantics**: Should `streamBlocks` retry from the beginning if the TCP connection drops mid-stream (re-submitting the entire PDF URL for re-processing)? This is safe for idempotent single-document OCR but could create double transcription blocks if `clearExistingBlocks` is not called before the retry. Needs explicit handling in the retry logic, not just an annotation. - **Training circuit-breaker opt-out**: The issue does not clarify whether training routes should be wrapped. Recommend excluding them, but this needs a conscious decision documented in the implementation.

marcel commented

2026-05-07 17:38:07 +02:00

👨‍💻 Felix Brandt — Fullstack Developer

Observations

RestClientOcrClient currently throws new RuntimeException(...) from two places (streamBlocks serialization and NDJSON stream failure) and wraps IOException | InterruptedException in raw RuntimeException. These bypass DomainException and land in GlobalExceptionHandler.handleGeneric() as 500 with INTERNAL_ERROR. This pre-existing code smell should be fixed as part of this issue — the fallback method should catch these too.
The isHealthy() path in RestClientOcrClient silently returns false on any exception — that is fine. But OcrService.startOcr() calls isHealthy() synchronously on the request thread before queuing the job. If the OCR service is slow-but-alive, this health check itself blocks for up to 10s (connect timeout). Resilience4j's circuit-breaker will short-circuit this path once it is open, but the current code has a latency trap even at healthy times.
The @CircuitBreaker(fallbackMethod = "ocrUnavailable") pattern in the issue works cleanly when the annotated method is on a Spring-managed bean. RestClientOcrClient is annotated @Component, so AOP proxying works. However, parseNdjsonStream is package-private static — it cannot be wrapped by AOP and correctly shouldn't be. The streaming boundary (the HttpClient.send(...) call) is where the annotation belongs, not the parsing method.
The issue proposes @TimeLimiter for the streaming path. With HttpClient.send(...) (blocking), @TimeLimiter requires an executor-backed future — it does not work transparently on synchronous blocking calls unless combined with CompletableFuture. The current streamBlocks is not async. Either switch to HttpClient.sendAsync(...) or rely solely on the existing request.timeout(Duration.ofMinutes(5)) for the stream path and use circuit-breaker only.
The record OcrBlockJson and record TrainingResultJson are fine — immutable, compact. No issues.
The proposed application.yaml snippet uses ignore-exceptions for HttpClientErrorException. In the current implementation, the RestClient throws HttpClientErrorException only for non-streaming calls (extractBlocks, trainModel). The streaming path uses raw HttpClient which throws IOException on failure, not HttpClientErrorException. The ignore-list needs to account for both exception hierarchies.
The test class RestClientOcrClientResilienceTest proposed in the issue needs to test the actual RestClientOcrClient behavior, not just Resilience4j's decorators. Use WireMock or MockWebServer to simulate 500s, slow responses, and 4xx responses. The existing RestClientOcrClientStreamTest uses a raw InputStream — extend that pattern for resilience scenarios.

Recommendations

Fix the new RuntimeException(...) throws in streamBlocks to use DomainException before adding Resilience4j. The fallback method needs a clean exception hierarchy to catch from.
Add the serviceUnavailable() factory to DomainException (HttpStatus.SERVICE_UNAVAILABLE) — the fallback needs it, and the existing OcrService/OcrBatchService callers currently return 500 instead of 503.
Apply @CircuitBreaker and @Retry only to extractBlocks. For streamBlocks, wrap only the HttpClient.send(...) call — not the NDJSON parsing — and rely on the per-request timeout(Duration.ofMinutes(5)) instead of @TimeLimiter to avoid async plumbing.
In OcrAsyncRunner.runBatch(), the per-document try/catch already isolates failures — a circuit-breaker open state will cause the catch to fire cleanly, resulting in OcrDocumentStatus.FAILED for that document. Verify this behavior in the resilience test.
Write tests with WireMock or MockWebServer stubs: (1) stub 500 → verify 3 retries → circuit opens, (2) stub 4xx → verify no retry, (3) stub slow response → verify circuit opens after slow-call threshold.

## 👨‍💻 Felix Brandt — Fullstack Developer ### Observations - `RestClientOcrClient` currently throws `new RuntimeException(...)` from two places (`streamBlocks` serialization and NDJSON stream failure) and wraps `IOException | InterruptedException` in raw `RuntimeException`. These bypass `DomainException` and land in `GlobalExceptionHandler.handleGeneric()` as 500 with `INTERNAL_ERROR`. This pre-existing code smell should be fixed as part of this issue — the fallback method should catch these too. - The `isHealthy()` path in `RestClientOcrClient` silently returns `false` on any exception — that is fine. But `OcrService.startOcr()` calls `isHealthy()` synchronously on the request thread before queuing the job. If the OCR service is slow-but-alive, this health check itself blocks for up to 10s (connect timeout). Resilience4j's circuit-breaker will short-circuit this path once it is open, but the current code has a latency trap even at healthy times. - The `@CircuitBreaker(fallbackMethod = "ocrUnavailable")` pattern in the issue works cleanly when the annotated method is on a Spring-managed bean. `RestClientOcrClient` is annotated `@Component`, so AOP proxying works. However, `parseNdjsonStream` is package-private static — it cannot be wrapped by AOP and correctly shouldn't be. The streaming boundary (the `HttpClient.send(...)` call) is where the annotation belongs, not the parsing method. - The issue proposes `@TimeLimiter` for the streaming path. With `HttpClient.send(...)` (blocking), `@TimeLimiter` requires an executor-backed future — it does not work transparently on synchronous blocking calls unless combined with `CompletableFuture`. The current `streamBlocks` is not async. Either switch to `HttpClient.sendAsync(...)` or rely solely on the existing `request.timeout(Duration.ofMinutes(5))` for the stream path and use circuit-breaker only. - The `record OcrBlockJson` and `record TrainingResultJson` are fine — immutable, compact. No issues. - The proposed `application.yaml` snippet uses `ignore-exceptions` for `HttpClientErrorException`. In the current implementation, the `RestClient` throws `HttpClientErrorException` only for non-streaming calls (`extractBlocks`, `trainModel`). The streaming path uses raw `HttpClient` which throws `IOException` on failure, not `HttpClientErrorException`. The ignore-list needs to account for both exception hierarchies. - The test class `RestClientOcrClientResilienceTest` proposed in the issue needs to test the actual `RestClientOcrClient` behavior, not just Resilience4j's decorators. Use `WireMock` or `MockWebServer` to simulate 500s, slow responses, and 4xx responses. The existing `RestClientOcrClientStreamTest` uses a raw `InputStream` — extend that pattern for resilience scenarios. ### Recommendations - Fix the `new RuntimeException(...)` throws in `streamBlocks` to use `DomainException` before adding Resilience4j. The fallback method needs a clean exception hierarchy to catch from. - Add the `serviceUnavailable()` factory to `DomainException` (`HttpStatus.SERVICE_UNAVAILABLE`) — the fallback needs it, and the existing `OcrService`/`OcrBatchService` callers currently return 500 instead of 503. - Apply `@CircuitBreaker` and `@Retry` only to `extractBlocks`. For `streamBlocks`, wrap only the `HttpClient.send(...)` call — not the NDJSON parsing — and rely on the per-request `timeout(Duration.ofMinutes(5))` instead of `@TimeLimiter` to avoid async plumbing. - In `OcrAsyncRunner.runBatch()`, the per-document `try/catch` already isolates failures — a circuit-breaker open state will cause the catch to fire cleanly, resulting in `OcrDocumentStatus.FAILED` for that document. Verify this behavior in the resilience test. - Write tests with `WireMock` or `MockWebServer` stubs: (1) stub 500 → verify 3 retries → circuit opens, (2) stub 4xx → verify no retry, (3) stub slow response → verify circuit opens after slow-call threshold.

marcel commented

2026-05-07 17:38:31 +02:00

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

The docker-compose.yml already has ocr-service with mem_limit: 12g and restart: unless-stopped. The depends_on: condition: service_started (not service_healthy) means the backend starts even if OCR is still loading models. Model loading takes 30–50s on cold start — this is exactly the window where a health check pre-flight fails and retries would storm a not-yet-ready service.
There is no healthcheck block on the ocr-service in docker-compose.yml. Adding one (GET /health, start_period: 60s) would align with the backend's OcrHealthClient.isHealthy() check and allow depends_on: condition: service_healthy — giving Resilience4j a clean starting state instead of catching cold-start failures as circuit triggers.
The issue references resilience4j_circuitbreaker_state as a Micrometer metric at /actuator/prometheus. Looking at pom.xml, spring-boot-starter-actuator is present but there is no micrometer-registry-prometheus dependency. Without the Prometheus registry, metrics are exposed via JMX/health only — the Grafana panel mentioned in the issue will not have data to scrape. This dependency needs to be added to pom.xml.
The issue correctly references issue #140 (Prometheus stack) as a prerequisite for the Grafana panel. Until #140 is done, add the Prometheus registry dependency now so the metrics are emitted even if no scraper is configured yet — they will be ready when Prometheus arrives.
The PORT_BACKEND is published externally (ports: - "${PORT_BACKEND}:8080"). The Actuator endpoint /actuator/prometheus is on the same port. Without a Caddy route to block /actuator/*, the metrics are publicly accessible. This is a pre-existing issue but worth noting — Resilience4j metrics will expose circuit state, retry counts, and slow-call rates to anyone who knows the port.
The memswap_limit: 12g on OCR matches mem_limit: 12g — effectively disabling swap for the OCR container. An OOM will kill the container immediately rather than swap. This is intentional for performance but means the restart: unless-stopped recovery is the only protection. Resilience4j's wait-duration needs to be long enough for the container to restart and reload models (~60s).

Recommendations

Add a healthcheck to ocr-service in docker-compose.yml:

healthcheck:
  test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
  interval: 10s
  timeout: 5s
  retries: 6
  start_period: 90s  # model loading takes 30-50s; give headroom

Then change depends_on in the backend service to condition: service_healthy.

Add micrometer-registry-prometheus to backend/pom.xml in this PR — the metrics are useless without it, and the issue's acceptance criteria explicitly require /actuator/prometheus output:
```
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
```
Set wait-duration-in-open-state: 60s in the application.yaml config (not 30s as proposed) to give the OCR container enough time to restart and reload models. A 30s wait is likely shorter than the cold-start time, causing the half-open probe to hit a not-yet-ready service and immediately re-open the circuit.
Do not expose /actuator/prometheus on the public-facing port. Either use a management port (management.server.port: 8081) and ensure that port is not published in docker-compose.yml, or add a Caddy block rule for /actuator/*.

Open Decisions

wait-duration-in-open-state value: 30s (issue proposal) vs 60s+ (model reload reality). The right value depends on observed OCR container restart+reload time. Instrument the cold-start time first, then set the wait to cold_start_time + 15s buffer.

## 🔧 Tobias Wendt — DevOps & Platform Engineer ### Observations - The `docker-compose.yml` already has `ocr-service` with `mem_limit: 12g` and `restart: unless-stopped`. The `depends_on: condition: service_started` (not `service_healthy`) means the backend starts even if OCR is still loading models. Model loading takes 30–50s on cold start — this is exactly the window where a health check pre-flight fails and retries would storm a not-yet-ready service. - There is no `healthcheck` block on the `ocr-service` in `docker-compose.yml`. Adding one (`GET /health`, `start_period: 60s`) would align with the backend's `OcrHealthClient.isHealthy()` check and allow `depends_on: condition: service_healthy` — giving Resilience4j a clean starting state instead of catching cold-start failures as circuit triggers. - The issue references `resilience4j_circuitbreaker_state` as a Micrometer metric at `/actuator/prometheus`. Looking at `pom.xml`, `spring-boot-starter-actuator` is present but there is no `micrometer-registry-prometheus` dependency. Without the Prometheus registry, metrics are exposed via JMX/health only — the Grafana panel mentioned in the issue will not have data to scrape. This dependency needs to be added to `pom.xml`. - The issue correctly references issue #140 (Prometheus stack) as a prerequisite for the Grafana panel. Until #140 is done, add the Prometheus registry dependency now so the metrics are emitted even if no scraper is configured yet — they will be ready when Prometheus arrives. - The `PORT_BACKEND` is published externally (`ports: - "${PORT_BACKEND}:8080"`). The Actuator endpoint `/actuator/prometheus` is on the same port. Without a Caddy route to block `/actuator/*`, the metrics are publicly accessible. This is a pre-existing issue but worth noting — Resilience4j metrics will expose circuit state, retry counts, and slow-call rates to anyone who knows the port. - The `memswap_limit: 12g` on OCR matches `mem_limit: 12g` — effectively disabling swap for the OCR container. An OOM will kill the container immediately rather than swap. This is intentional for performance but means the `restart: unless-stopped` recovery is the only protection. Resilience4j's wait-duration needs to be long enough for the container to restart and reload models (~60s). ### Recommendations - Add a `healthcheck` to `ocr-service` in `docker-compose.yml`: ```yaml healthcheck: test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"] interval: 10s timeout: 5s retries: 6 start_period: 90s # model loading takes 30-50s; give headroom ``` Then change `depends_on` in the backend service to `condition: service_healthy`. - Add `micrometer-registry-prometheus` to `backend/pom.xml` in this PR — the metrics are useless without it, and the issue's acceptance criteria explicitly require `/actuator/prometheus` output: ```xml <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> ``` - Set `wait-duration-in-open-state: 60s` in the `application.yaml` config (not 30s as proposed) to give the OCR container enough time to restart and reload models. A 30s wait is likely shorter than the cold-start time, causing the half-open probe to hit a not-yet-ready service and immediately re-open the circuit. - Do not expose `/actuator/prometheus` on the public-facing port. Either use a management port (`management.server.port: 8081`) and ensure that port is not published in `docker-compose.yml`, or add a Caddy block rule for `/actuator/*`. ### Open Decisions - **`wait-duration-in-open-state` value**: 30s (issue proposal) vs 60s+ (model reload reality). The right value depends on observed OCR container restart+reload time. Instrument the cold-start time first, then set the wait to `cold_start_time + 15s` buffer.

marcel commented

2026-05-07 17:38:50 +02:00

📋 Elicit — Requirements Engineer

Observations

The issue is well-structured and unusually dense with implementation detail for a feature issue — the acceptance criteria are clear and mostly testable. This reads more like a design document than a Gitea issue, which is appropriate for a solo developer but worth being explicit about.
AC gap — user-visible error wording: The issue requires "a clean 503 with a translated message" but does not specify what message the user should actually see, in all three languages (de/en/es). The frontend/messages/{de,en,es}.json files are listed as critical but no key name or copy is given. An implementer needs to invent the wording, which introduces consistency risk.
AC gap — scope of retry: The acceptance criteria say "backend never hammers a failing OCR service for more than ~14s". The math is: 3 attempts × (2s + 4s + 8s backoff) = 14s wait + call overhead. But extractBlocks has a 10-minute read timeout — a slow (not failing) OCR service could tie up a thread for 10 minutes before the slow-call threshold triggers. The AC should explicitly distinguish "fast failure (5xx/timeout)" from "slow call" scenarios.
Missing NFR — batch behavior: The issue addresses single-document OCR. OcrBatchService processes documents sequentially in a loop. If the circuit opens mid-batch, subsequent documents in the batch will fail fast (good), but the batch job will record all remaining documents as FAILED rather than PENDING/RETRYABLE. This behavioral requirement is not specified — should a mid-batch circuit-open cause the whole batch to abort (current behavior after this change) or pause-and-resume?
Missing NFR — training routes: trainModel, segtrainModel, trainSenderModel are not mentioned. The issue is silent on whether they get resilience treatment. Silence here creates implementation ambiguity.
Effort estimate "S — 1 day": This does not account for the Boot 4 / Resilience4j BOM compatibility check. If the starter is not available for Boot 4 and manual wiring is required, the effort is M (2-3 days). The estimate should say "S if BOM compatible; escalate to M if manual wiring needed."
Verification step 4 (metrics via curl) is not testable without the Prometheus registry dependency being added — but that dependency is not listed in the critical files.

Recommendations

Add i18n copy to the issue spec: specify the exact message key (error_ocr_service_unavailable) and draft the German string (Der OCR-Dienst ist vorübergehend nicht verfügbar. Bitte versuche es später erneut.) to avoid improvised wording.
Add an explicit AC: "When an OCR batch job is running and the circuit opens, remaining unprocessed documents in the batch are marked FAILED with error message OCR_SERVICE_UNAVAILABLE." — or alternatively "the batch pauses and surfaces a retriable error." Pick one and write it down.
Add to critical files: backend/pom.xml must include micrometer-registry-prometheus for verification step 4 to be testable.
Update the effort estimate to reflect the BOM risk: "S (1 day) if resilience4j-spring-boot3 is Boot 4 compatible; escalate to M (2-3 days) if manual AOP wiring is required."
Add one sentence to the AC clarifying training route behavior: "Training endpoints (/train, /segtrain, /train-sender) are excluded from circuit-breaker and retry — their long-running, non-idempotent nature makes retry unsafe."

## 📋 Elicit — Requirements Engineer ### Observations - The issue is well-structured and unusually dense with implementation detail for a feature issue — the acceptance criteria are clear and mostly testable. This reads more like a design document than a Gitea issue, which is appropriate for a solo developer but worth being explicit about. - **AC gap — user-visible error wording**: The issue requires "a clean 503 with a translated message" but does not specify what message the user should actually see, in all three languages (de/en/es). The `frontend/messages/{de,en,es}.json` files are listed as critical but no key name or copy is given. An implementer needs to invent the wording, which introduces consistency risk. - **AC gap — scope of retry**: The acceptance criteria say "backend never hammers a failing OCR service for more than ~14s". The math is: 3 attempts × (2s + 4s + 8s backoff) = 14s wait + call overhead. But `extractBlocks` has a 10-minute read timeout — a *slow* (not failing) OCR service could tie up a thread for 10 minutes before the slow-call threshold triggers. The AC should explicitly distinguish "fast failure (5xx/timeout)" from "slow call" scenarios. - **Missing NFR — batch behavior**: The issue addresses single-document OCR. `OcrBatchService` processes documents sequentially in a loop. If the circuit opens mid-batch, subsequent documents in the batch will fail fast (good), but the batch job will record all remaining documents as `FAILED` rather than `PENDING/RETRYABLE`. This behavioral requirement is not specified — should a mid-batch circuit-open cause the whole batch to abort (current behavior after this change) or pause-and-resume? - **Missing NFR — training routes**: `trainModel`, `segtrainModel`, `trainSenderModel` are not mentioned. The issue is silent on whether they get resilience treatment. Silence here creates implementation ambiguity. - **Effort estimate "S — 1 day"**: This does not account for the Boot 4 / Resilience4j BOM compatibility check. If the starter is not available for Boot 4 and manual wiring is required, the effort is M (2-3 days). The estimate should say "S if BOM compatible; escalate to M if manual wiring needed." - **Verification step 4** (metrics via `curl`) is not testable without the Prometheus registry dependency being added — but that dependency is not listed in the critical files. ### Recommendations - Add i18n copy to the issue spec: specify the exact message key (`error_ocr_service_unavailable`) and draft the German string (`Der OCR-Dienst ist vorübergehend nicht verfügbar. Bitte versuche es später erneut.`) to avoid improvised wording. - Add an explicit AC: "When an OCR batch job is running and the circuit opens, remaining unprocessed documents in the batch are marked `FAILED` with error message `OCR_SERVICE_UNAVAILABLE`." — or alternatively "the batch pauses and surfaces a retriable error." Pick one and write it down. - Add to critical files: `backend/pom.xml` must include `micrometer-registry-prometheus` for verification step 4 to be testable. - Update the effort estimate to reflect the BOM risk: "S (1 day) if `resilience4j-spring-boot3` is Boot 4 compatible; escalate to M (2-3 days) if manual AOP wiring is required." - Add one sentence to the AC clarifying training route behavior: "Training endpoints (`/train`, `/segtrain`, `/train-sender`) are excluded from circuit-breaker and retry — their long-running, non-idempotent nature makes retry unsafe."

marcel commented

2026-05-07 17:39:10 +02:00

🔐 Nora "NullX" Steiner — Security Engineer

Observations

The Resilience4j ignore-exceptions list in the issue includes HttpClientErrorException (Spring's 4xx exception class). This is correct security posture — 4xx errors (e.g., malformed PDF payload) should not trigger retries that could amplify a bad-input amplification attack. However, the streaming path (streamBlocks) uses raw java.net.http.HttpClient, not Spring's RestClient, so it does not throw HttpClientErrorException for 4xx — it only throws IOException. The ignore-list only applies to extractBlocks and training methods. This is correct behavior but should be documented explicitly so it is not "fixed" by a future developer who thinks the ignore-list is incomplete.
RestClientOcrClient constructs the OCR request body from pdfUrl (a MinIO presigned URL) via string concatenation into a Map, then serialized with Jackson. The presigned URL is generated by FileService from a stored file path — it is not user-controlled. No injection risk here.
The @Slf4j logger in RestClientOcrClient uses SLF4J parameterized logging (log.warn("OCR circuit open or retries exhausted", t)) in the proposed fallback — this is correct. The existing log.info("OCR service does not support /ocr/stream (404), falling back to /ocr") is also correct. No log injection vector.
The Actuator metrics endpoint /actuator/prometheus will expose resilience4j_circuitbreaker_state{name="ocr"}. If this endpoint is reachable by unauthorized users (currently it shares port 8080 with the public API), an attacker can observe circuit state to time requests during half-open windows. This is an information disclosure, not critical, but worth noting. The fix is already recommended by Tobias — management port separation.
The fallback method signature in the issue is private OcrResult ocrUnavailable(/*params*/, Throwable t). Resilience4j's fallback method must be in the same class and must have a matching signature (same params + Throwable appended). Since the class is @Component and AOP-proxied, the fallback must also be a bean method — private visibility works with some AOP configurations but can silently fail with Spring AOP's proxy-based weaving. Use protected or package-private to be safe.
The existing code has a potential SSRF vector in streamBlocks: the pdfUrl passed to the OCR service is a MinIO presigned URL generated internally — not user-supplied directly. But it traverses the OCR Python service, which fetches the URL. The Python service already has ALLOWED_PDF_HOSTS validation. This is pre-existing and correctly handled. No new surface introduced by this issue.

Recommendations

Use protected (not private) for the fallback method: Resilience4j with Spring AOP's proxy-based weaving can silently skip private fallback methods depending on whether the proxy is JDK-based or CGLIB-based. protected is the safe default.
Add a comment in application.yaml next to ignore-exceptions explaining that HttpClientErrorException only applies to the non-streaming RestClient paths — prevents a future developer from "fixing" the exception list with streaming exception types that would accidentally enable retry on malformed inputs in the streaming path.
Treat the /actuator/prometheus exposure as a blocker (or at minimum a P1-high issue to track) until the management port is separated. Resilience4j circuit state is operational information — it should not be publicly readable.

## 🔐 Nora "NullX" Steiner — Security Engineer ### Observations - The Resilience4j `ignore-exceptions` list in the issue includes `HttpClientErrorException` (Spring's 4xx exception class). This is correct security posture — 4xx errors (e.g., malformed PDF payload) should not trigger retries that could amplify a bad-input amplification attack. However, the streaming path (`streamBlocks`) uses raw `java.net.http.HttpClient`, not Spring's `RestClient`, so it does **not** throw `HttpClientErrorException` for 4xx — it only throws `IOException`. The ignore-list only applies to `extractBlocks` and training methods. This is correct behavior but should be documented explicitly so it is not "fixed" by a future developer who thinks the ignore-list is incomplete. - `RestClientOcrClient` constructs the OCR request body from `pdfUrl` (a MinIO presigned URL) via string concatenation into a `Map`, then serialized with Jackson. The presigned URL is generated by `FileService` from a stored file path — it is not user-controlled. No injection risk here. - The `@Slf4j` logger in `RestClientOcrClient` uses SLF4J parameterized logging (`log.warn("OCR circuit open or retries exhausted", t)`) in the proposed fallback — this is correct. The existing `log.info("OCR service does not support /ocr/stream (404), falling back to /ocr")` is also correct. No log injection vector. - The Actuator metrics endpoint `/actuator/prometheus` will expose `resilience4j_circuitbreaker_state{name="ocr"}`. If this endpoint is reachable by unauthorized users (currently it shares port 8080 with the public API), an attacker can observe circuit state to time requests during half-open windows. This is an information disclosure, not critical, but worth noting. The fix is already recommended by Tobias — management port separation. - The fallback method signature in the issue is `private OcrResult ocrUnavailable(/*params*/, Throwable t)`. Resilience4j's fallback method must be in the **same class** and must have a **matching signature** (same params + `Throwable` appended). Since the class is `@Component` and AOP-proxied, the fallback must also be a bean method — `private` visibility works with some AOP configurations but can silently fail with Spring AOP's proxy-based weaving. Use `protected` or package-private to be safe. - The existing code has a potential SSRF vector in `streamBlocks`: the `pdfUrl` passed to the OCR service is a MinIO presigned URL generated internally — not user-supplied directly. But it traverses the OCR Python service, which fetches the URL. The Python service already has `ALLOWED_PDF_HOSTS` validation. This is pre-existing and correctly handled. No new surface introduced by this issue. ### Recommendations - Use `protected` (not `private`) for the fallback method: Resilience4j with Spring AOP's proxy-based weaving can silently skip `private` fallback methods depending on whether the proxy is JDK-based or CGLIB-based. `protected` is the safe default. - Add a comment in `application.yaml` next to `ignore-exceptions` explaining that `HttpClientErrorException` only applies to the non-streaming `RestClient` paths — prevents a future developer from "fixing" the exception list with streaming exception types that would accidentally enable retry on malformed inputs in the streaming path. - Treat the `/actuator/prometheus` exposure as a blocker (or at minimum a P1-high issue to track) until the management port is separated. Resilience4j circuit state is operational information — it should not be publicly readable.

marcel commented

2026-05-07 17:39:35 +02:00

🧪 Sara Holt — QA Engineer

Observations

The issue calls for RestClientOcrClientResilienceTest.java as a new test class. The existing test pattern in RestClientOcrClientStreamTest tests the static parseNdjsonStream method directly with a raw InputStream — clean, fast, no network. The resilience tests will need to test the actual HTTP layer, requiring WireMock or MockWebServer to simulate failures. Neither is currently in the project's pom.xml — a dependency needs to be added.
The three test modes specified in the issue verification steps (circuit closed, circuit open, half-open) map to Resilience4j's state machine. Testing half-open state requires precise control over timing and call counts — this is notoriously hard with wall-clock-based configs. Use COUNT_BASED sliding window (already proposed in the issue) and configure very small window sizes in test configuration (sliding-window-size: 3, permitted-number-of-calls-in-half-open-state: 1) rather than relying on the production config values.
The issue says "Tests cover all three modes (closed, open, half-open)" as an AC. Half-open state requires: (1) put circuit into open, (2) wait for wait-duration-in-open-state to elapse, (3) make a probe call. With 30s wait, this test will take 30 seconds minimum — a problem for a test suite with a <10s unit test target. Either use a test-specific application config with wait-duration-in-open-state: 500ms or use the CircuitBreakerRegistry programmatic API to transition state directly.
The existing OCR test coverage is unit-level (Mockito mocks of OcrClient). No test currently exercises what happens when OcrAsyncRunner.runSingleDocument catches a DomainException with OCR_SERVICE_UNAVAILABLE — the catch block sets OcrJobStatus.FAILED and persists, but there is no test asserting this. Add a test in OcrAsyncRunnerTest covering circuit-open → fallback → job status FAILED.
Verification step 3 in the issue ("call OCR with malformed/4xx-triggering payload — retry does not fire") is the most important test to write. A 4xx that gets retried is a correctness bug and a potential amplification vector. This test should be in CI, not just manual verification.
The existing OcrBatchServiceTest should be extended with a test: "when circuit is open, startBatch() throws DomainException with OCR_SERVICE_UNAVAILABLE and does not create any job or job-document rows."

Recommendations

Add wiremock-spring-boot or com.squareup.okhttp3:mockwebserver to pom.xml for HTTP-layer resilience tests. WireMock integrates cleanly with Spring Boot test slices.
Write a separate application-test.yaml (or use @TestPropertySource) that overrides resilience config to: sliding-window-size: 3, failure-rate-threshold: 100 (100% failure opens circuit), wait-duration-in-open-state: 100ms — this makes state machine tests fast and deterministic.
The explicit test list for RestClientOcrClientResilienceTest:
1. extractBlocks_retriesOnIOException_thenSucceeds — 2 failures then success → no circuit open
2. extractBlocks_opensCircuitAfterThreeConsecutive5xx — 3 failures → circuit opens → 4th call fails fast (no HTTP hit)
3. extractBlocks_doesNotRetryOn4xx — single 4xx → fallback fires, no retry
4. extractBlocks_circuitHalfOpenProbePasses_closesCircuit — open → wait → probe succeeds → circuit closes
Add OcrAsyncRunnerTest.runSingleDocument_setsJobStatusFailed_whenCircuitOpenThrowsDomainException — verifies the catch (Exception e) branch correctly persists OcrJobStatus.FAILED.
All four tests above belong at the unit/integration boundary with a mocked HTTP server — not E2E. CI time impact should be under 5 seconds total with the fast test config overrides.

## 🧪 Sara Holt — QA Engineer ### Observations - The issue calls for `RestClientOcrClientResilienceTest.java` as a new test class. The existing test pattern in `RestClientOcrClientStreamTest` tests the static `parseNdjsonStream` method directly with a raw `InputStream` — clean, fast, no network. The resilience tests will need to test the actual HTTP layer, requiring `WireMock` or `MockWebServer` to simulate failures. Neither is currently in the project's `pom.xml` — a dependency needs to be added. - The three test modes specified in the issue verification steps (circuit closed, circuit open, half-open) map to Resilience4j's state machine. Testing half-open state requires precise control over timing and call counts — this is notoriously hard with wall-clock-based configs. Use `COUNT_BASED` sliding window (already proposed in the issue) and configure very small window sizes in test configuration (`sliding-window-size: 3`, `permitted-number-of-calls-in-half-open-state: 1`) rather than relying on the production config values. - The issue says "Tests cover all three modes (closed, open, half-open)" as an AC. Half-open state requires: (1) put circuit into open, (2) wait for `wait-duration-in-open-state` to elapse, (3) make a probe call. With 30s wait, this test will take 30 seconds minimum — a problem for a test suite with a `<10s` unit test target. Either use a test-specific application config with `wait-duration-in-open-state: 500ms` or use the `CircuitBreakerRegistry` programmatic API to transition state directly. - The existing OCR test coverage is unit-level (Mockito mocks of `OcrClient`). No test currently exercises what happens when `OcrAsyncRunner.runSingleDocument` catches a `DomainException` with `OCR_SERVICE_UNAVAILABLE` — the catch block sets `OcrJobStatus.FAILED` and persists, but there is no test asserting this. Add a test in `OcrAsyncRunnerTest` covering circuit-open → fallback → job status `FAILED`. - Verification step 3 in the issue ("call OCR with malformed/4xx-triggering payload — retry does not fire") is the most important test to write. A 4xx that gets retried is a correctness bug and a potential amplification vector. This test should be in CI, not just manual verification. - The existing `OcrBatchServiceTest` should be extended with a test: "when circuit is open, `startBatch()` throws `DomainException` with `OCR_SERVICE_UNAVAILABLE` and does not create any job or job-document rows." ### Recommendations - Add `wiremock-spring-boot` or `com.squareup.okhttp3:mockwebserver` to `pom.xml` for HTTP-layer resilience tests. WireMock integrates cleanly with Spring Boot test slices. - Write a separate `application-test.yaml` (or use `@TestPropertySource`) that overrides resilience config to: `sliding-window-size: 3`, `failure-rate-threshold: 100` (100% failure opens circuit), `wait-duration-in-open-state: 100ms` — this makes state machine tests fast and deterministic. - The explicit test list for `RestClientOcrClientResilienceTest`: 1. `extractBlocks_retriesOnIOException_thenSucceeds` — 2 failures then success → no circuit open 2. `extractBlocks_opensCircuitAfterThreeConsecutive5xx` — 3 failures → circuit opens → 4th call fails fast (no HTTP hit) 3. `extractBlocks_doesNotRetryOn4xx` — single 4xx → fallback fires, no retry 4. `extractBlocks_circuitHalfOpenProbePasses_closesCircuit` — open → wait → probe succeeds → circuit closes - Add `OcrAsyncRunnerTest.runSingleDocument_setsJobStatusFailed_whenCircuitOpenThrowsDomainException` — verifies the `catch (Exception e)` branch correctly persists `OcrJobStatus.FAILED`. - All four tests above belong at the unit/integration boundary with a mocked HTTP server — not E2E. CI time impact should be under 5 seconds total with the fast test config overrides.

marcel commented

2026-05-07 17:39:52 +02:00

🎨 Leonie Voss — UI/UX Design Lead

Observations

This issue is primarily a backend infrastructure change. The frontend surface it touches is narrow: the OCR_SERVICE_UNAVAILABLE error code already exists in frontend/src/lib/shared/errors.ts (case 'OCR_SERVICE_UNAVAILABLE' is present). So the frontend error mapping path is in place.
What is not in place: the user-visible message for the OCR unavailable state. The issue lists frontend/messages/{de,en,es}.json as critical files but does not provide the copy. This matters for the senior transcriber audience (60+) who will encounter this during active transcription work — the message needs to be clear, calm, and actionable, not a technical status code.
The OCR trigger flow (user clicks "Start OCR" → backend rejects with 503) produces a user-visible error. Based on the project's error-handling pattern (getErrorMessage() → i18n key), the user will see whatever string is in messages/de.json for error_ocr_service_unavailable. Currently that key likely maps to a generic fallback. The circuit-breaker makes this error path more frequent and more predictable — it deserves purpose-written copy.
There is no UX change proposed for the loading/progress state that OCR jobs display via SSE. If the circuit is open when the user triggers OCR, the error fires synchronously before the job is created — the progress panel never opens. This is a cleaner UX than the current "job created → immediately fails → confusing FAILED status" flow, but only if the error message is good.
The batch OCR page (admin-facing) is not addressed. When a batch is started and the circuit opens mid-batch, some documents will show FAILED. The admin user will see a mix of DONE and FAILED statuses with no explanation of why the failures occurred. A batch-level error message in the UI would help — but that is a separate concern from this issue.

Recommendations

Provide the i18n copy as part of this issue's implementation. Suggested keys and values:
- error_ocr_service_unavailable (de): Der OCR-Dienst ist gerade nicht verfügbar. Bitte warte einen Moment und versuche es erneut.
- error_ocr_service_unavailable (en): The OCR service is temporarily unavailable. Please wait a moment and try again.
- error_ocr_service_unavailable (es): El servicio OCR no está disponible en este momento. Por favor, espera un momento e inténtalo de nuevo.
- These are calm, actionable, and do not expose technical details (no mention of circuit breakers, retries, or service pods).
The error message should appear in the same error display location as other OCR errors — not a new UI component. No new UI work is needed beyond the i18n keys.
Verify that the error toast/inline error for OCR failures has a manual dismiss button and uses aria-live="polite" — seniors may need time to read it before it auto-dismisses. This is a pre-existing concern for all OCR error states, but the increased frequency of this specific error (circuit opens regularly under load) makes it worth checking now.

## 🎨 Leonie Voss — UI/UX Design Lead ### Observations - This issue is primarily a backend infrastructure change. The frontend surface it touches is narrow: the `OCR_SERVICE_UNAVAILABLE` error code already exists in `frontend/src/lib/shared/errors.ts` (case `'OCR_SERVICE_UNAVAILABLE'` is present). So the frontend error mapping path is in place. - What is **not** in place: the user-visible message for the OCR unavailable state. The issue lists `frontend/messages/{de,en,es}.json` as critical files but does not provide the copy. This matters for the senior transcriber audience (60+) who will encounter this during active transcription work — the message needs to be clear, calm, and actionable, not a technical status code. - The OCR trigger flow (user clicks "Start OCR" → backend rejects with 503) produces a user-visible error. Based on the project's error-handling pattern (`getErrorMessage()` → i18n key), the user will see whatever string is in `messages/de.json` for `error_ocr_service_unavailable`. Currently that key likely maps to a generic fallback. The circuit-breaker makes this error path *more* frequent and *more* predictable — it deserves purpose-written copy. - There is no UX change proposed for the loading/progress state that OCR jobs display via SSE. If the circuit is open when the user triggers OCR, the error fires synchronously before the job is created — the progress panel never opens. This is a cleaner UX than the current "job created → immediately fails → confusing FAILED status" flow, but only if the error message is good. - The batch OCR page (admin-facing) is not addressed. When a batch is started and the circuit opens mid-batch, some documents will show `FAILED`. The admin user will see a mix of `DONE` and `FAILED` statuses with no explanation of why the failures occurred. A batch-level error message in the UI would help — but that is a separate concern from this issue. ### Recommendations - Provide the i18n copy as part of this issue's implementation. Suggested keys and values: - `error_ocr_service_unavailable` (de): `Der OCR-Dienst ist gerade nicht verfügbar. Bitte warte einen Moment und versuche es erneut.` - `error_ocr_service_unavailable` (en): `The OCR service is temporarily unavailable. Please wait a moment and try again.` - `error_ocr_service_unavailable` (es): `El servicio OCR no está disponible en este momento. Por favor, espera un momento e inténtalo de nuevo.` - These are calm, actionable, and do not expose technical details (no mention of circuit breakers, retries, or service pods). - The error message should appear in the same error display location as other OCR errors — not a new UI component. No new UI work is needed beyond the i18n keys. - Verify that the error toast/inline error for OCR failures has a manual dismiss button and uses `aria-live="polite"` — seniors may need time to read it before it auto-dismisses. This is a pre-existing concern for all OCR error states, but the increased frequency of this specific error (circuit opens regularly under load) makes it worth checking now.

marcel commented

2026-05-07 17:40:24 +02:00

Decision Queue — Cross-Persona Open Items

The following decisions were raised by multiple personas and need explicit human judgment before implementation begins.

Theme 1: Streaming retry semantics (Markus + Felix)

Question: Should streamBlocks be retried from the beginning if the TCP connection drops mid-stream?

Context: OcrAsyncRunner.runSingleDocument calls ocrClient.streamBlocks(...) and accumulates transcription blocks incrementally. If the stream fails mid-way and Resilience4j retries the call, streamBlocks re-submits the full PDF URL and re-processes from page 1. Without calling clearExistingBlocks before the retry, the result is duplicate transcription blocks in the database.

Options:

A: Apply @Retry to streamBlocks, and call clearExistingBlocks at the start of each retry attempt (safe, but requires retry callback/decorator pattern, not just annotation).
B: Do not apply @Retry to streamBlocks at all — rely solely on @CircuitBreaker for the streaming path, and let the existing catch (Exception e) in OcrAsyncRunner handle failure by marking the job FAILED.
C: Apply @Retry only to the initial HTTP connection attempt in streamBlocks, not the full streaming body — requires refactoring streamBlocks to separate connection from consumption.

Recommendation from team: Option B is the safest default and avoids plumbing complexity. Option A requires careful implementation. Option C is the cleanest but requires the most refactoring.

Theme 2: `wait-duration-in-open-state` value (Markus + Tobias)

Question: Should wait-duration-in-open-state be 30s (as proposed in the issue) or 60s+ (as recommended by Tobias based on OCR cold-start time)?

Context: The OCR service's Docker container needs ~30–50s to load models on cold start. If the circuit opens due to an OOM kill, the 30s wait may expire before the service is ready, causing the half-open probe to fail and immediately re-open the circuit — making recovery slower than the current behavior.

Options:

A: 30s (as proposed) — accept that some half-open probes will fail during recovery; circuit recovers on the next cycle.
B: 60s — more conservative, matches observed cold-start time.
C: Measure actual cold-start time (docker logs archive-ocr on restart) and set to cold_start + 15s.

Recommendation from team: Option C is correct long-term. Option B is safe today. Do not use Option A.

Theme 3: Training routes — opt-in or opt-out of resilience? (Markus + Elicit)

Question: Should trainModel, segtrainModel, and trainSenderModel be wrapped by @CircuitBreaker and @Retry?

Context: Training is user-initiated, takes minutes, is non-idempotent (retrying could re-run a training job), and already surfaces failures directly to the admin user. The circuit-breaker on extractBlocks would not protect training routes unless explicitly annotated.

Options:

A: Explicitly exclude training methods (no annotations on them).
B: Apply @CircuitBreaker only (no retry) to training methods — fast-fail when OCR is down, but no retry storm.

Recommendation from team: Option A. Training failures during OCR outage are acceptable and self-evident. Wrapping adds complexity with no clear benefit.

Theme 4: Fate of the `isHealthy()` pre-flight check (Markus + Felix)

Question: Once Resilience4j is in place, should the synchronous OcrHealthClient.isHealthy() pre-flight in OcrService.startOcr() and OcrBatchService.startBatch() be removed?

Context: The pre-flight check adds ~10s latency (connect timeout) when the OCR service is down. With a circuit-breaker in place, the circuit will fail fast after opening. Keeping both results in redundant latency and a confusing dual path.

Options:

A: Remove the pre-flight check entirely — the circuit-breaker handles availability detection.
B: Keep the pre-flight check as a fast pre-qualification before queuing a job, but reduce its connect timeout to 1–2s.

Recommendation from team: Option A, after Resilience4j is confirmed working. Until then, keep the pre-flight as a safety net.

## Decision Queue — Cross-Persona Open Items The following decisions were raised by multiple personas and need explicit human judgment before implementation begins. --- ### Theme 1: Streaming retry semantics (Markus + Felix) **Question:** Should `streamBlocks` be retried from the beginning if the TCP connection drops mid-stream? **Context:** `OcrAsyncRunner.runSingleDocument` calls `ocrClient.streamBlocks(...)` and accumulates transcription blocks incrementally. If the stream fails mid-way and Resilience4j retries the call, `streamBlocks` re-submits the full PDF URL and re-processes from page 1. Without calling `clearExistingBlocks` before the retry, the result is duplicate transcription blocks in the database. **Options:** - A: Apply `@Retry` to `streamBlocks`, and call `clearExistingBlocks` at the start of each retry attempt (safe, but requires retry callback/decorator pattern, not just annotation). - B: Do not apply `@Retry` to `streamBlocks` at all — rely solely on `@CircuitBreaker` for the streaming path, and let the existing `catch (Exception e)` in `OcrAsyncRunner` handle failure by marking the job `FAILED`. - C: Apply `@Retry` only to the initial HTTP connection attempt in `streamBlocks`, not the full streaming body — requires refactoring `streamBlocks` to separate connection from consumption. **Recommendation from team:** Option B is the safest default and avoids plumbing complexity. Option A requires careful implementation. Option C is the cleanest but requires the most refactoring. --- ### Theme 2: `wait-duration-in-open-state` value (Markus + Tobias) **Question:** Should `wait-duration-in-open-state` be 30s (as proposed in the issue) or 60s+ (as recommended by Tobias based on OCR cold-start time)? **Context:** The OCR service's Docker container needs ~30–50s to load models on cold start. If the circuit opens due to an OOM kill, the 30s wait may expire before the service is ready, causing the half-open probe to fail and immediately re-open the circuit — making recovery slower than the current behavior. **Options:** - A: 30s (as proposed) — accept that some half-open probes will fail during recovery; circuit recovers on the next cycle. - B: 60s — more conservative, matches observed cold-start time. - C: Measure actual cold-start time (`docker logs archive-ocr` on restart) and set to `cold_start + 15s`. **Recommendation from team:** Option C is correct long-term. Option B is safe today. Do not use Option A. --- ### Theme 3: Training routes — opt-in or opt-out of resilience? (Markus + Elicit) **Question:** Should `trainModel`, `segtrainModel`, and `trainSenderModel` be wrapped by `@CircuitBreaker` and `@Retry`? **Context:** Training is user-initiated, takes minutes, is non-idempotent (retrying could re-run a training job), and already surfaces failures directly to the admin user. The circuit-breaker on `extractBlocks` would not protect training routes unless explicitly annotated. **Options:** - A: Explicitly exclude training methods (no annotations on them). - B: Apply `@CircuitBreaker` only (no retry) to training methods — fast-fail when OCR is down, but no retry storm. **Recommendation from team:** Option A. Training failures during OCR outage are acceptable and self-evident. Wrapping adds complexity with no clear benefit. --- ### Theme 4: Fate of the `isHealthy()` pre-flight check (Markus + Felix) **Question:** Once Resilience4j is in place, should the synchronous `OcrHealthClient.isHealthy()` pre-flight in `OcrService.startOcr()` and `OcrBatchService.startBatch()` be removed? **Context:** The pre-flight check adds ~10s latency (connect timeout) when the OCR service is down. With a circuit-breaker in place, the circuit will fail fast after opening. Keeping both results in redundant latency and a confusing dual path. **Options:** - A: Remove the pre-flight check entirely — the circuit-breaker handles availability detection. - B: Keep the pre-flight check as a fast pre-qualification before queuing a job, but reduce its connect timeout to 1–2s. **Recommendation from team:** Option A, after Resilience4j is confirmed working. Until then, keep the pre-flight as a safety net.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marcel/familienarchiv#463

feat(resilience): wrap OCR client with Resilience4j retry + circuit-breaker + time-limiter #463

Context

Approach

backend/pom.xml

RestClientOcrClient.java

application.yaml

Metrics

Critical files

Verification

Acceptance criteria

Effort

Risk if not addressed

🏛️ Markus Keller — Application Architect

Observations

Recommendations

Open Decisions

👨‍💻 Felix Brandt — Fullstack Developer

Observations

Recommendations

🔧 Tobias Wendt — DevOps & Platform Engineer

Observations

Recommendations

Open Decisions

📋 Elicit — Requirements Engineer

Observations

Recommendations

🔐 Nora "NullX" Steiner — Security Engineer

Observations

Recommendations

🧪 Sara Holt — QA Engineer

Observations

Recommendations

🎨 Leonie Voss — UI/UX Design Lead

Observations

Recommendations

Decision Queue — Cross-Persona Open Items

Theme 1: Streaming retry semantics (Markus + Felix)

Theme 2: wait-duration-in-open-state value (Markus + Tobias)

Theme 3: Training routes — opt-in or opt-out of resilience? (Markus + Elicit)

Theme 4: Fate of the isHealthy() pre-flight check (Markus + Felix)

`backend/pom.xml`

`RestClientOcrClient.java`

`application.yaml`

Theme 2: `wait-duration-in-open-state` value (Markus + Tobias)

Theme 4: Fate of the `isHealthy()` pre-flight check (Markus + Felix)