feat(resilience): wrap OCR client with Resilience4j retry + circuit-breaker + time-limiter #463
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
backend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.javaconfigures connect/read timeouts (10s connect / 10min read) but has no retry, no circuit-breaker, no bulkhead. The fallback path istry/catch → DomainException.internal(OCR_SERVICE_UNAVAILABLE).Operational consequences:
mem_limit: 12gperdocker-compose.ymlmeans a single bad PDF can OOM-kill the whole service. The backend will hammer it during recovery and prevent it from coming back cleanly.The dynamic part of the audit corroborates:
pg_stat_statementsshows the backend is normally fast (sub-ms queries), so the OCR boundary is the most likely failure mode for cascading user-visible 5xx.Approach
Add
resilience4j-spring-boot3(or the matching Boot 4 starter) and annotate the OCR client. Use three patterns: retry, circuit-breaker, time-limiter. Do not retry on 4xx (bad PDF) — only on 5xx and timeouts.backend/pom.xmlRestClientOcrClient.javaFor the streaming
/ocr/streamcall, pin@TimeLimiterinstead of trusting the read timeout alone.application.yamlMetrics
Resilience4j exposes Micrometer metrics out of the box. After #140 (Prometheus stack), add a Grafana panel for
resilience4j_circuitbreaker_state{name="ocr"}andresilience4j_retry_calls{name="ocr",kind="successful_with_retry"}.Critical files
backend/pom.xmlbackend/src/main/java/org/raddatz/familienarchiv/ocr/RestClientOcrClient.java— annotations + fallbackbackend/src/main/resources/application.yaml— resilience4j configbackend/src/main/java/org/raddatz/familienarchiv/exception/ErrorCode.java— newOCR_SERVICE_UNAVAILABLEcode (if not already present)frontend/src/lib/shared/errors.ts— mirror the new ErrorCodefrontend/messages/{de,en,es}.json— i18n key for the new errorbackend/src/test/java/org/raddatz/familienarchiv/ocr/RestClientOcrClientResilienceTest.java— newVerification
ocr-service(docker compose stop ocr-service). Trigger an OCR run via the backend. Expect: 3 retries with 2s/4s/8s back-off, then a clean 503 withOCR_SERVICE_UNAVAILABLE, not a stack-trace 500.docker exec archive-ocr kill -STOP 1). After 30s of slow calls the circuit opens; subsequent requests fail fast (no retry, no wait) until 30s pass.curl http://localhost:8080/actuator/prometheus | grep resilience4j_circuitbreaker_statereturns the gauge.Acceptance criteria
/actuator/prometheus.Effort
S — 1 day including resilience4j-vs-Boot 4 BOM compatibility check.
Risk if not addressed
OCR outages cascade to backend thread-pool exhaustion → 5xx for all user requests, not just OCR ones.
Tracked in audit doc as F-09 (High).
🏛️ Markus Keller — Application Architect
Observations
resilience4j-spring-boot3artifact name in the issue is wrong for this project. The backend is Spring Boot 4.0 (confirmed inpom.xmlline 8). Resilience4j 2.x publishes separate starters:resilience4j-spring-boot3is the Boot 3.x artifact. For Boot 4 / Spring Framework 7, you must either use the plainresilience4j-spring-boot3artifact at version 2.3.0+ (which has Boot 4 support starting from a later patch — verify BOM compatibility) orresilience4j-core+ manual Spring AOP wiring. This BOM compatibility check is the real risk in the "S — 1 day" estimate; it could become an M if the Boot 4 starter is not yet published.@CircuitBreaker/@Retryannotations onRestClientOcrClientare the right application point. TheOcrClientinterface is already mockable, so the AOP weaving will work cleanly.startOcr(...)inRestClientOcrClient— but looking at the actual code, the methods areextractBlocks,streamBlocks,trainModel, etc. There is nostartOcrmethod on the client. The issue's Java snippet is illustrative but not copy-pasteable to the real class.streamBlocks→HttpClient.send(...)withDuration.ofMinutes(5)timeout) is the hardest to wrap with@Retrybecause the method is long-running and stateful. Blindly retrying a partially-consumed NDJSON stream would corrupt the transcription block sequence. This is a significant nuance the issue glosses over.trainModel,segtrainModel,trainSenderModel) should not get circuit-breaker or retry — training is user-initiated, long-running, non-idempotent, and failure there surfaces correctly today. Annotating them would break the training workflow.isHealthy()pre-flight check inOcrService.startOcr()andOcrBatchService.startBatch()is a second form of "circuit-breaker" that already exists. With Resilience4j in place, the health-check pre-flight becomes redundant — the circuit will handle the unavailable state. Document this decision in the implementation: keep the pre-flight or remove it, but don't have both.DomainException.internal()maps to HTTP 500. The issue spec says the target is HTTP 503. AserviceUnavailable()factory does not exist inDomainException.java— it needs to be added, ornew DomainException(ErrorCode.OCR_SERVICE_UNAVAILABLE, HttpStatus.SERVICE_UNAVAILABLE, ...)used directly. This is a one-liner but it is missing from the issue checklist.Recommendations
serviceUnavailable()factory toDomainExceptionfirst, before wiring Resilience4j — the fallback method needs it, and it fixes the existing 500→503 mismatch inOcrServiceandOcrBatchServiceregardless of whether this issue is done.@CircuitBreaker+@RetrytoextractBlocksand the initial connection phase ofstreamBlocksonly. ForstreamBlocks, consider wrapping only the HTTP connection/handshake in Resilience4j and letting the streaming itself run to completion or timeout independently.trainModel,segtrainModel, andtrainSenderModelfrom circuit-breaker and retry — they are already long-running, user-visible, and non-idempotent.resilience4j-spring-boot3:2.3.0does not play clean with Spring Boot 4.0's BOM, fall back toresilience4j-core+ manual@Bean CircuitBreakerRegistry— slightly more boilerplate, no starter magic, fully controllable.isHealthy()pre-flight inOcrServiceandOcrBatchServiceonce Resilience4j's circuit-breaker is in place — the health check is now a synchronous call that adds latency and duplicates the circuit's job. Log a warning in the fallback method instead.Open Decisions
streamBlocksretry from the beginning if the TCP connection drops mid-stream (re-submitting the entire PDF URL for re-processing)? This is safe for idempotent single-document OCR but could create double transcription blocks ifclearExistingBlocksis not called before the retry. Needs explicit handling in the retry logic, not just an annotation.👨💻 Felix Brandt — Fullstack Developer
Observations
RestClientOcrClientcurrently throwsnew RuntimeException(...)from two places (streamBlocksserialization and NDJSON stream failure) and wrapsIOException | InterruptedExceptionin rawRuntimeException. These bypassDomainExceptionand land inGlobalExceptionHandler.handleGeneric()as 500 withINTERNAL_ERROR. This pre-existing code smell should be fixed as part of this issue — the fallback method should catch these too.isHealthy()path inRestClientOcrClientsilently returnsfalseon any exception — that is fine. ButOcrService.startOcr()callsisHealthy()synchronously on the request thread before queuing the job. If the OCR service is slow-but-alive, this health check itself blocks for up to 10s (connect timeout). Resilience4j's circuit-breaker will short-circuit this path once it is open, but the current code has a latency trap even at healthy times.@CircuitBreaker(fallbackMethod = "ocrUnavailable")pattern in the issue works cleanly when the annotated method is on a Spring-managed bean.RestClientOcrClientis annotated@Component, so AOP proxying works. However,parseNdjsonStreamis package-private static — it cannot be wrapped by AOP and correctly shouldn't be. The streaming boundary (theHttpClient.send(...)call) is where the annotation belongs, not the parsing method.@TimeLimiterfor the streaming path. WithHttpClient.send(...)(blocking),@TimeLimiterrequires an executor-backed future — it does not work transparently on synchronous blocking calls unless combined withCompletableFuture. The currentstreamBlocksis not async. Either switch toHttpClient.sendAsync(...)or rely solely on the existingrequest.timeout(Duration.ofMinutes(5))for the stream path and use circuit-breaker only.record OcrBlockJsonandrecord TrainingResultJsonare fine — immutable, compact. No issues.application.yamlsnippet usesignore-exceptionsforHttpClientErrorException. In the current implementation, theRestClientthrowsHttpClientErrorExceptiononly for non-streaming calls (extractBlocks,trainModel). The streaming path uses rawHttpClientwhich throwsIOExceptionon failure, notHttpClientErrorException. The ignore-list needs to account for both exception hierarchies.RestClientOcrClientResilienceTestproposed in the issue needs to test the actualRestClientOcrClientbehavior, not just Resilience4j's decorators. UseWireMockorMockWebServerto simulate 500s, slow responses, and 4xx responses. The existingRestClientOcrClientStreamTestuses a rawInputStream— extend that pattern for resilience scenarios.Recommendations
new RuntimeException(...)throws instreamBlocksto useDomainExceptionbefore adding Resilience4j. The fallback method needs a clean exception hierarchy to catch from.serviceUnavailable()factory toDomainException(HttpStatus.SERVICE_UNAVAILABLE) — the fallback needs it, and the existingOcrService/OcrBatchServicecallers currently return 500 instead of 503.@CircuitBreakerand@Retryonly toextractBlocks. ForstreamBlocks, wrap only theHttpClient.send(...)call — not the NDJSON parsing — and rely on the per-requesttimeout(Duration.ofMinutes(5))instead of@TimeLimiterto avoid async plumbing.OcrAsyncRunner.runBatch(), the per-documenttry/catchalready isolates failures — a circuit-breaker open state will cause the catch to fire cleanly, resulting inOcrDocumentStatus.FAILEDfor that document. Verify this behavior in the resilience test.WireMockorMockWebServerstubs: (1) stub 500 → verify 3 retries → circuit opens, (2) stub 4xx → verify no retry, (3) stub slow response → verify circuit opens after slow-call threshold.🔧 Tobias Wendt — DevOps & Platform Engineer
Observations
docker-compose.ymlalready hasocr-servicewithmem_limit: 12gandrestart: unless-stopped. Thedepends_on: condition: service_started(notservice_healthy) means the backend starts even if OCR is still loading models. Model loading takes 30–50s on cold start — this is exactly the window where a health check pre-flight fails and retries would storm a not-yet-ready service.healthcheckblock on theocr-serviceindocker-compose.yml. Adding one (GET /health,start_period: 60s) would align with the backend'sOcrHealthClient.isHealthy()check and allowdepends_on: condition: service_healthy— giving Resilience4j a clean starting state instead of catching cold-start failures as circuit triggers.resilience4j_circuitbreaker_stateas a Micrometer metric at/actuator/prometheus. Looking atpom.xml,spring-boot-starter-actuatoris present but there is nomicrometer-registry-prometheusdependency. Without the Prometheus registry, metrics are exposed via JMX/health only — the Grafana panel mentioned in the issue will not have data to scrape. This dependency needs to be added topom.xml.PORT_BACKENDis published externally (ports: - "${PORT_BACKEND}:8080"). The Actuator endpoint/actuator/prometheusis on the same port. Without a Caddy route to block/actuator/*, the metrics are publicly accessible. This is a pre-existing issue but worth noting — Resilience4j metrics will expose circuit state, retry counts, and slow-call rates to anyone who knows the port.memswap_limit: 12gon OCR matchesmem_limit: 12g— effectively disabling swap for the OCR container. An OOM will kill the container immediately rather than swap. This is intentional for performance but means therestart: unless-stoppedrecovery is the only protection. Resilience4j's wait-duration needs to be long enough for the container to restart and reload models (~60s).Recommendations
healthchecktoocr-serviceindocker-compose.yml:depends_onin the backend service tocondition: service_healthy.micrometer-registry-prometheustobackend/pom.xmlin this PR — the metrics are useless without it, and the issue's acceptance criteria explicitly require/actuator/prometheusoutput:wait-duration-in-open-state: 60sin theapplication.yamlconfig (not 30s as proposed) to give the OCR container enough time to restart and reload models. A 30s wait is likely shorter than the cold-start time, causing the half-open probe to hit a not-yet-ready service and immediately re-open the circuit./actuator/prometheuson the public-facing port. Either use a management port (management.server.port: 8081) and ensure that port is not published indocker-compose.yml, or add a Caddy block rule for/actuator/*.Open Decisions
wait-duration-in-open-statevalue: 30s (issue proposal) vs 60s+ (model reload reality). The right value depends on observed OCR container restart+reload time. Instrument the cold-start time first, then set the wait tocold_start_time + 15sbuffer.📋 Elicit — Requirements Engineer
Observations
frontend/messages/{de,en,es}.jsonfiles are listed as critical but no key name or copy is given. An implementer needs to invent the wording, which introduces consistency risk.extractBlockshas a 10-minute read timeout — a slow (not failing) OCR service could tie up a thread for 10 minutes before the slow-call threshold triggers. The AC should explicitly distinguish "fast failure (5xx/timeout)" from "slow call" scenarios.OcrBatchServiceprocesses documents sequentially in a loop. If the circuit opens mid-batch, subsequent documents in the batch will fail fast (good), but the batch job will record all remaining documents asFAILEDrather thanPENDING/RETRYABLE. This behavioral requirement is not specified — should a mid-batch circuit-open cause the whole batch to abort (current behavior after this change) or pause-and-resume?trainModel,segtrainModel,trainSenderModelare not mentioned. The issue is silent on whether they get resilience treatment. Silence here creates implementation ambiguity.curl) is not testable without the Prometheus registry dependency being added — but that dependency is not listed in the critical files.Recommendations
error_ocr_service_unavailable) and draft the German string (Der OCR-Dienst ist vorübergehend nicht verfügbar. Bitte versuche es später erneut.) to avoid improvised wording.FAILEDwith error messageOCR_SERVICE_UNAVAILABLE." — or alternatively "the batch pauses and surfaces a retriable error." Pick one and write it down.backend/pom.xmlmust includemicrometer-registry-prometheusfor verification step 4 to be testable.resilience4j-spring-boot3is Boot 4 compatible; escalate to M (2-3 days) if manual AOP wiring is required."/train,/segtrain,/train-sender) are excluded from circuit-breaker and retry — their long-running, non-idempotent nature makes retry unsafe."🔐 Nora "NullX" Steiner — Security Engineer
Observations
ignore-exceptionslist in the issue includesHttpClientErrorException(Spring's 4xx exception class). This is correct security posture — 4xx errors (e.g., malformed PDF payload) should not trigger retries that could amplify a bad-input amplification attack. However, the streaming path (streamBlocks) uses rawjava.net.http.HttpClient, not Spring'sRestClient, so it does not throwHttpClientErrorExceptionfor 4xx — it only throwsIOException. The ignore-list only applies toextractBlocksand training methods. This is correct behavior but should be documented explicitly so it is not "fixed" by a future developer who thinks the ignore-list is incomplete.RestClientOcrClientconstructs the OCR request body frompdfUrl(a MinIO presigned URL) via string concatenation into aMap, then serialized with Jackson. The presigned URL is generated byFileServicefrom a stored file path — it is not user-controlled. No injection risk here.@Slf4jlogger inRestClientOcrClientuses SLF4J parameterized logging (log.warn("OCR circuit open or retries exhausted", t)) in the proposed fallback — this is correct. The existinglog.info("OCR service does not support /ocr/stream (404), falling back to /ocr")is also correct. No log injection vector./actuator/prometheuswill exposeresilience4j_circuitbreaker_state{name="ocr"}. If this endpoint is reachable by unauthorized users (currently it shares port 8080 with the public API), an attacker can observe circuit state to time requests during half-open windows. This is an information disclosure, not critical, but worth noting. The fix is already recommended by Tobias — management port separation.private OcrResult ocrUnavailable(/*params*/, Throwable t). Resilience4j's fallback method must be in the same class and must have a matching signature (same params +Throwableappended). Since the class is@Componentand AOP-proxied, the fallback must also be a bean method —privatevisibility works with some AOP configurations but can silently fail with Spring AOP's proxy-based weaving. Useprotectedor package-private to be safe.streamBlocks: thepdfUrlpassed to the OCR service is a MinIO presigned URL generated internally — not user-supplied directly. But it traverses the OCR Python service, which fetches the URL. The Python service already hasALLOWED_PDF_HOSTSvalidation. This is pre-existing and correctly handled. No new surface introduced by this issue.Recommendations
protected(notprivate) for the fallback method: Resilience4j with Spring AOP's proxy-based weaving can silently skipprivatefallback methods depending on whether the proxy is JDK-based or CGLIB-based.protectedis the safe default.application.yamlnext toignore-exceptionsexplaining thatHttpClientErrorExceptiononly applies to the non-streamingRestClientpaths — prevents a future developer from "fixing" the exception list with streaming exception types that would accidentally enable retry on malformed inputs in the streaming path./actuator/prometheusexposure as a blocker (or at minimum a P1-high issue to track) until the management port is separated. Resilience4j circuit state is operational information — it should not be publicly readable.🧪 Sara Holt — QA Engineer
Observations
RestClientOcrClientResilienceTest.javaas a new test class. The existing test pattern inRestClientOcrClientStreamTesttests the staticparseNdjsonStreammethod directly with a rawInputStream— clean, fast, no network. The resilience tests will need to test the actual HTTP layer, requiringWireMockorMockWebServerto simulate failures. Neither is currently in the project'spom.xml— a dependency needs to be added.COUNT_BASEDsliding window (already proposed in the issue) and configure very small window sizes in test configuration (sliding-window-size: 3,permitted-number-of-calls-in-half-open-state: 1) rather than relying on the production config values.wait-duration-in-open-stateto elapse, (3) make a probe call. With 30s wait, this test will take 30 seconds minimum — a problem for a test suite with a<10sunit test target. Either use a test-specific application config withwait-duration-in-open-state: 500msor use theCircuitBreakerRegistryprogrammatic API to transition state directly.OcrClient). No test currently exercises what happens whenOcrAsyncRunner.runSingleDocumentcatches aDomainExceptionwithOCR_SERVICE_UNAVAILABLE— the catch block setsOcrJobStatus.FAILEDand persists, but there is no test asserting this. Add a test inOcrAsyncRunnerTestcovering circuit-open → fallback → job statusFAILED.OcrBatchServiceTestshould be extended with a test: "when circuit is open,startBatch()throwsDomainExceptionwithOCR_SERVICE_UNAVAILABLEand does not create any job or job-document rows."Recommendations
wiremock-spring-bootorcom.squareup.okhttp3:mockwebservertopom.xmlfor HTTP-layer resilience tests. WireMock integrates cleanly with Spring Boot test slices.application-test.yaml(or use@TestPropertySource) that overrides resilience config to:sliding-window-size: 3,failure-rate-threshold: 100(100% failure opens circuit),wait-duration-in-open-state: 100ms— this makes state machine tests fast and deterministic.RestClientOcrClientResilienceTest:extractBlocks_retriesOnIOException_thenSucceeds— 2 failures then success → no circuit openextractBlocks_opensCircuitAfterThreeConsecutive5xx— 3 failures → circuit opens → 4th call fails fast (no HTTP hit)extractBlocks_doesNotRetryOn4xx— single 4xx → fallback fires, no retryextractBlocks_circuitHalfOpenProbePasses_closesCircuit— open → wait → probe succeeds → circuit closesOcrAsyncRunnerTest.runSingleDocument_setsJobStatusFailed_whenCircuitOpenThrowsDomainException— verifies thecatch (Exception e)branch correctly persistsOcrJobStatus.FAILED.🎨 Leonie Voss — UI/UX Design Lead
Observations
OCR_SERVICE_UNAVAILABLEerror code already exists infrontend/src/lib/shared/errors.ts(case'OCR_SERVICE_UNAVAILABLE'is present). So the frontend error mapping path is in place.frontend/messages/{de,en,es}.jsonas critical files but does not provide the copy. This matters for the senior transcriber audience (60+) who will encounter this during active transcription work — the message needs to be clear, calm, and actionable, not a technical status code.getErrorMessage()→ i18n key), the user will see whatever string is inmessages/de.jsonforerror_ocr_service_unavailable. Currently that key likely maps to a generic fallback. The circuit-breaker makes this error path more frequent and more predictable — it deserves purpose-written copy.FAILED. The admin user will see a mix ofDONEandFAILEDstatuses with no explanation of why the failures occurred. A batch-level error message in the UI would help — but that is a separate concern from this issue.Recommendations
error_ocr_service_unavailable(de):Der OCR-Dienst ist gerade nicht verfügbar. Bitte warte einen Moment und versuche es erneut.error_ocr_service_unavailable(en):The OCR service is temporarily unavailable. Please wait a moment and try again.error_ocr_service_unavailable(es):El servicio OCR no está disponible en este momento. Por favor, espera un momento e inténtalo de nuevo.aria-live="polite"— seniors may need time to read it before it auto-dismisses. This is a pre-existing concern for all OCR error states, but the increased frequency of this specific error (circuit opens regularly under load) makes it worth checking now.Decision Queue — Cross-Persona Open Items
The following decisions were raised by multiple personas and need explicit human judgment before implementation begins.
Theme 1: Streaming retry semantics (Markus + Felix)
Question: Should
streamBlocksbe retried from the beginning if the TCP connection drops mid-stream?Context:
OcrAsyncRunner.runSingleDocumentcallsocrClient.streamBlocks(...)and accumulates transcription blocks incrementally. If the stream fails mid-way and Resilience4j retries the call,streamBlocksre-submits the full PDF URL and re-processes from page 1. Without callingclearExistingBlocksbefore the retry, the result is duplicate transcription blocks in the database.Options:
@RetrytostreamBlocks, and callclearExistingBlocksat the start of each retry attempt (safe, but requires retry callback/decorator pattern, not just annotation).@RetrytostreamBlocksat all — rely solely on@CircuitBreakerfor the streaming path, and let the existingcatch (Exception e)inOcrAsyncRunnerhandle failure by marking the jobFAILED.@Retryonly to the initial HTTP connection attempt instreamBlocks, not the full streaming body — requires refactoringstreamBlocksto separate connection from consumption.Recommendation from team: Option B is the safest default and avoids plumbing complexity. Option A requires careful implementation. Option C is the cleanest but requires the most refactoring.
Theme 2:
wait-duration-in-open-statevalue (Markus + Tobias)Question: Should
wait-duration-in-open-statebe 30s (as proposed in the issue) or 60s+ (as recommended by Tobias based on OCR cold-start time)?Context: The OCR service's Docker container needs ~30–50s to load models on cold start. If the circuit opens due to an OOM kill, the 30s wait may expire before the service is ready, causing the half-open probe to fail and immediately re-open the circuit — making recovery slower than the current behavior.
Options:
docker logs archive-ocron restart) and set tocold_start + 15s.Recommendation from team: Option C is correct long-term. Option B is safe today. Do not use Option A.
Theme 3: Training routes — opt-in or opt-out of resilience? (Markus + Elicit)
Question: Should
trainModel,segtrainModel, andtrainSenderModelbe wrapped by@CircuitBreakerand@Retry?Context: Training is user-initiated, takes minutes, is non-idempotent (retrying could re-run a training job), and already surfaces failures directly to the admin user. The circuit-breaker on
extractBlockswould not protect training routes unless explicitly annotated.Options:
@CircuitBreakeronly (no retry) to training methods — fast-fail when OCR is down, but no retry storm.Recommendation from team: Option A. Training failures during OCR outage are acceptable and self-evident. Wrapping adds complexity with no clear benefit.
Theme 4: Fate of the
isHealthy()pre-flight check (Markus + Felix)Question: Once Resilience4j is in place, should the synchronous
OcrHealthClient.isHealthy()pre-flight inOcrService.startOcr()andOcrBatchService.startBatch()be removed?Context: The pre-flight check adds ~10s latency (connect timeout) when the OCR service is down. With a circuit-breaker in place, the circuit will fail fast after opening. Keeping both results in redundant latency and a confusing dual path.
Options:
Recommendation from team: Option A, after Resilience4j is confirmed working. Until then, keep the pre-flight as a safety net.