fix(security): add rate limiting to login and password-reset endpoints #111

Open
opened 2026-03-27 17:32:59 +01:00 by marcel · 7 comments
Owner

Problem

The /api/users/me (used for login verification) and /api/auth/forgot-password endpoints have no rate limiting. An attacker can:

  • Brute-force credentials via repeated login attempts
  • Enumerate valid email addresses via the password-reset flow (timing/response differences)
  • Flood the password-reset endpoint to spam a user's inbox

Fix

Add Resilience4j to the backend and configure a RateLimiter on the affected endpoints:

Dependency (pom.xml):

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
</dependency>

application.yml:

resilience4j.ratelimiter:
  instances:
    auth:
      limitForPeriod: 5
      limitRefreshPeriod: 15m
      timeoutDuration: 0

Controller:

@RateLimiter(name = "auth", fallbackMethod = "rateLimitFallback")
@PostMapping("/forgot-password")
public ResponseEntity<?> forgotPassword(...) { ... }

Alternatively, a simpler in-memory approach using a ConcurrentHashMap<String, AtomicInteger> keyed by IP is acceptable for a single-instance deployment.

Acceptance Criteria

  • Login: max 10 attempts per IP per 15 minutes; returns 429 on excess
  • Password reset: max 3 requests per IP per 15 minutes; returns 429 on excess
  • Legitimate users are not blocked by normal usage patterns
## Problem The `/api/users/me` (used for login verification) and `/api/auth/forgot-password` endpoints have no rate limiting. An attacker can: - Brute-force credentials via repeated login attempts - Enumerate valid email addresses via the password-reset flow (timing/response differences) - Flood the password-reset endpoint to spam a user's inbox ## Fix Add Resilience4j to the backend and configure a `RateLimiter` on the affected endpoints: **Dependency (pom.xml):** ```xml <dependency> <groupId>io.github.resilience4j</groupId> <artifactId>resilience4j-spring-boot3</artifactId> </dependency> ``` **application.yml:** ```yaml resilience4j.ratelimiter: instances: auth: limitForPeriod: 5 limitRefreshPeriod: 15m timeoutDuration: 0 ``` **Controller:** ```java @RateLimiter(name = "auth", fallbackMethod = "rateLimitFallback") @PostMapping("/forgot-password") public ResponseEntity<?> forgotPassword(...) { ... } ``` Alternatively, a simpler in-memory approach using a `ConcurrentHashMap<String, AtomicInteger>` keyed by IP is acceptable for a single-instance deployment. ## Acceptance Criteria - Login: max 10 attempts per IP per 15 minutes; returns 429 on excess - Password reset: max 3 requests per IP per 15 minutes; returns 429 on excess - Legitimate users are not blocked by normal usage patterns
marcel added the security label 2026-03-31 20:49:52 +02:00
Author
Owner

👨‍💻 Felix Brandt — Senior Fullstack Developer

Questions & Observations

  • The issue offers two implementation paths: Resilience4j @RateLimiter and a manual ConcurrentHashMap. I'd commit to Resilience4j — it's externally configurable, handles edge cases (burst tolerance, timeout), and the @RateLimiter annotation keeps the controller clean. The ConcurrentHashMap alternative has no TTL cleanup and will grow unboundedly until the process restarts.
  • The fallbackMethod for @RateLimiter must match the exact signature of the annotated method plus a RequestNotPermitted parameter. A mismatched signature silently falls through to the default exception handler — write the test to catch this.
  • Is /api/users/me really the login endpoint? The issue description says it's "used for login verification" — but login is typically a POST /api/auth/login or Spring Security's /login. Worth verifying which endpoint the brute-force attack surface actually is before configuring the rate limiter.

Suggestions

  • TDD order: write a @WebMvcTest that fires 11 POST /forgot-password requests in a loop and asserts the 11th returns 429. Make it fail first (without rate limiting), then implement.
  • Name the test: shouldReturn429WhenPasswordResetRateLimitExceeded.
  • The limitRefreshPeriod: 15m makes integration tests slow if you wait for real time. Use a test profile with a short limitRefreshPeriod: 1s so tests don't need Thread.sleep().
## 👨‍💻 Felix Brandt — Senior Fullstack Developer ### Questions & Observations - The issue offers two implementation paths: Resilience4j `@RateLimiter` and a manual `ConcurrentHashMap`. I'd commit to Resilience4j — it's externally configurable, handles edge cases (burst tolerance, timeout), and the `@RateLimiter` annotation keeps the controller clean. The `ConcurrentHashMap` alternative has no TTL cleanup and will grow unboundedly until the process restarts. - The `fallbackMethod` for `@RateLimiter` must match the exact signature of the annotated method plus a `RequestNotPermitted` parameter. A mismatched signature silently falls through to the default exception handler — write the test to catch this. - Is `/api/users/me` really the login endpoint? The issue description says it's "used for login verification" — but login is typically a `POST /api/auth/login` or Spring Security's `/login`. Worth verifying which endpoint the brute-force attack surface actually is before configuring the rate limiter. ### Suggestions - TDD order: write a `@WebMvcTest` that fires 11 `POST /forgot-password` requests in a loop and asserts the 11th returns 429. Make it fail first (without rate limiting), then implement. - Name the test: `shouldReturn429WhenPasswordResetRateLimitExceeded`. - The `limitRefreshPeriod: 15m` makes integration tests slow if you wait for real time. Use a test profile with a short `limitRefreshPeriod: 1s` so tests don't need `Thread.sleep()`.
Author
Owner

🔒 Nora "NullX" Steiner — Application Security Engineer

Questions & Observations

  • IP extraction is the tricky part. This app sits behind Caddy in production. The real client IP arrives in X-Forwarded-For. Spring Boot must be configured to trust proxy headers, otherwise request.getRemoteAddr() returns the Caddy container IP — and the rate limiter blocks Caddy, not the attacker:
    server:
      forward-headers-strategy: framework
    
  • Shared NAT risk. If the whole family is behind a single home router, all family members share one public IP. A rate limit of 10 attempts / 15 minutes is tight enough to lock out legitimate users if one person has a typo sprint. Consider a more generous limit (e.g. 20 / 15 min) or account-level lockout after N failures rather than pure IP-based limiting.
  • Username enumeration via timing. The issue mentions the password-reset endpoint can reveal valid email addresses via response differences. A rate limiter alone doesn't fix this — the forgot-password handler must return an identical response (same status, same body, same timing) whether the email exists or not. Is this already the case?
  • Resilience4j is in-memory. Fine for a single-instance VPS. If the app ever scales horizontally, the rate limiter state won't be shared between instances. Document this assumption now so it's not a surprise later.

Suggestions

  • Add a Retry-After header to the 429 response so clients (and users) know when to try again:
    response.getHeaders().set("Retry-After", "900"); // 15 minutes in seconds
    
  • The fallback method should log at WARN level with the requester's IP — this is your only visibility into active brute-force attempts before you have monitoring in place.
  • CWE-307: Improper Restriction of Excessive Authentication Attempts.
## 🔒 Nora "NullX" Steiner — Application Security Engineer ### Questions & Observations - **IP extraction is the tricky part.** This app sits behind Caddy in production. The real client IP arrives in `X-Forwarded-For`. Spring Boot must be configured to trust proxy headers, otherwise `request.getRemoteAddr()` returns the Caddy container IP — and the rate limiter blocks Caddy, not the attacker: ```yaml server: forward-headers-strategy: framework ``` - **Shared NAT risk.** If the whole family is behind a single home router, all family members share one public IP. A rate limit of 10 attempts / 15 minutes is tight enough to lock out legitimate users if one person has a typo sprint. Consider a more generous limit (e.g. 20 / 15 min) or account-level lockout after N failures rather than pure IP-based limiting. - **Username enumeration via timing.** The issue mentions the password-reset endpoint can reveal valid email addresses via response differences. A rate limiter alone doesn't fix this — the forgot-password handler must return an identical response (same status, same body, same timing) whether the email exists or not. Is this already the case? - **Resilience4j is in-memory.** Fine for a single-instance VPS. If the app ever scales horizontally, the rate limiter state won't be shared between instances. Document this assumption now so it's not a surprise later. ### Suggestions - Add a `Retry-After` header to the 429 response so clients (and users) know when to try again: ```java response.getHeaders().set("Retry-After", "900"); // 15 minutes in seconds ``` - The fallback method should log at WARN level with the requester's IP — this is your only visibility into active brute-force attempts before you have monitoring in place. - CWE-307: Improper Restriction of Excessive Authentication Attempts.
Author
Owner

🧪 Sara Holt — QA Engineer & Test Strategist

Test Strategy

Rate limiting is notoriously tricky to test because the real timer makes tests slow. The key is a configurable test profile.

Integration test — @SpringBootTest with test profile:

// application-test.yml: limitForPeriod: 3, limitRefreshPeriod: 1s
@Test
void shouldReturn429AfterRateLimitExceeded() throws Exception {
    for (int i = 0; i < 3; i++) {
        mockMvc.perform(post("/api/auth/forgot-password").content("{\"email\":\"test@example.com\"}"))
               .andExpect(status().isOk());
    }
    mockMvc.perform(post("/api/auth/forgot-password").content("{\"email\":\"test@example.com\"}"))
           .andExpect(status().isTooManyRequests())
           .andExpect(header().exists("Retry-After"));
}

@Test
void shouldResetRateLimitAfterRefreshPeriod() throws Exception {
    // exhaust the limit
    for (int i = 0; i < 3; i++) {
        mockMvc.perform(post("/api/auth/forgot-password")...);
    }
    Thread.sleep(1100); // wait for 1s refresh in test profile
    // should succeed again
    mockMvc.perform(post("/api/auth/forgot-password")...)
           .andExpect(status().isOk());
}

Edge Cases to Cover

  • Different IPs should have independent rate limit counters
  • Rate limit applies to /forgot-password and the login endpoint independently (separate counters)
  • Legitimate traffic just under the limit is never blocked

Observations

  • The Thread.sleep() is acceptable here because it's testing time-based behaviour — the test profile sets a 1s refresh to keep it fast. This is not a flaky test; it's a controlled time dependency.
## 🧪 Sara Holt — QA Engineer & Test Strategist ### Test Strategy Rate limiting is notoriously tricky to test because the real timer makes tests slow. The key is a configurable test profile. **Integration test — `@SpringBootTest` with test profile:** ```java // application-test.yml: limitForPeriod: 3, limitRefreshPeriod: 1s @Test void shouldReturn429AfterRateLimitExceeded() throws Exception { for (int i = 0; i < 3; i++) { mockMvc.perform(post("/api/auth/forgot-password").content("{\"email\":\"test@example.com\"}")) .andExpect(status().isOk()); } mockMvc.perform(post("/api/auth/forgot-password").content("{\"email\":\"test@example.com\"}")) .andExpect(status().isTooManyRequests()) .andExpect(header().exists("Retry-After")); } @Test void shouldResetRateLimitAfterRefreshPeriod() throws Exception { // exhaust the limit for (int i = 0; i < 3; i++) { mockMvc.perform(post("/api/auth/forgot-password")...); } Thread.sleep(1100); // wait for 1s refresh in test profile // should succeed again mockMvc.perform(post("/api/auth/forgot-password")...) .andExpect(status().isOk()); } ``` ### Edge Cases to Cover - Different IPs should have independent rate limit counters - Rate limit applies to `/forgot-password` and the login endpoint independently (separate counters) - Legitimate traffic just under the limit is never blocked ### Observations - The `Thread.sleep()` is acceptable here because it's testing time-based behaviour — the test profile sets a 1s refresh to keep it fast. This is not a flaky test; it's a controlled time dependency.
Author
Owner

🏗️ Markus Keller — Application Architect

Questions & Observations

  • Resilience4j over the ConcurrentHashMap alternative. The manual map has no TTL cleanup, no burst tolerance configuration, and no observability. Resilience4j is the right choice — it's externally configurable, integrates with Spring Boot's metrics (Actuator → Prometheus), and the AOP annotation keeps the controller clean. The "simpler" alternative would become a maintenance problem.
  • IP extraction. The rate limiter key must be the client IP, not the Spring request.getRemoteAddr() which returns Caddy's container IP in production. This requires server.forward-headers-strategy: framework in application.yml and trusting the X-Forwarded-For header from Caddy only (not from arbitrary clients). This is not a minor detail — without it, the rate limiter is non-functional in production.
  • Single-instance assumption is valid. For a single VPS deployment, in-memory rate limiting via Resilience4j is exactly right. No Redis needed. Document the assumption (# NOTE: in-memory rate limiter — not suitable for horizontal scaling) in the YAML config block.
  • Separate rate limiter instances per endpoint. The auth instance name in the issue's YAML applies to both login and forgot-password. Consider separate named instances (auth-login and auth-forgot-password) with different limits — 10/15min for login is different from 3/15min for password reset.

Suggestions

  • No architectural escalation needed. This fits cleanly into the existing controller layer via AOP. Ship it.
  • The fallback method returning 429 is fine — make sure it's a proper ResponseEntity<ErrorResponse> with the project's standard error body, not a raw string.
## 🏗️ Markus Keller — Application Architect ### Questions & Observations - **Resilience4j over the ConcurrentHashMap alternative.** The manual map has no TTL cleanup, no burst tolerance configuration, and no observability. Resilience4j is the right choice — it's externally configurable, integrates with Spring Boot's metrics (Actuator → Prometheus), and the AOP annotation keeps the controller clean. The "simpler" alternative would become a maintenance problem. - **IP extraction.** The rate limiter key must be the *client* IP, not the Spring `request.getRemoteAddr()` which returns Caddy's container IP in production. This requires `server.forward-headers-strategy: framework` in `application.yml` and trusting the `X-Forwarded-For` header from Caddy only (not from arbitrary clients). This is not a minor detail — without it, the rate limiter is non-functional in production. - **Single-instance assumption is valid.** For a single VPS deployment, in-memory rate limiting via Resilience4j is exactly right. No Redis needed. Document the assumption (`# NOTE: in-memory rate limiter — not suitable for horizontal scaling`) in the YAML config block. - **Separate rate limiter instances per endpoint.** The `auth` instance name in the issue's YAML applies to both login and forgot-password. Consider separate named instances (`auth-login` and `auth-forgot-password`) with different limits — 10/15min for login is different from 3/15min for password reset. ### Suggestions - No architectural escalation needed. This fits cleanly into the existing controller layer via AOP. Ship it. - The fallback method returning 429 is fine — make sure it's a proper `ResponseEntity<ErrorResponse>` with the project's standard error body, not a raw string.
Author
Owner

🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist

Questions & Observations

  • The 429 error must be user-readable, not just an HTTP status code. A family member who has mistyped their password several times should not see a raw "429 Too Many Requests" error. They need plain language: "Too many login attempts. Please wait 15 minutes before trying again."
  • The error must appear inline, not as a generic error page. The 429 response from the login form should be caught in the SvelteKit form action and displayed as a form-level error message — same pattern as invalid credentials, not a redirect to an error page.
  • Seniors specifically. A 67-year-old family member who has forgotten their password and hit the rate limit will be confused and assume the system is broken. The error message should be calming and actionable: tell them exactly what to do and when.

Suggestions

  • The Paraglide translation key for the 429 error should be something like error_rate_limit_login with a message that includes the wait time: "Zu viele Anmeldeversuche. Bitte warte 15 Minuten."
  • Consider adding a visible countdown or a "try again" time if the Retry-After header is available in the response — even just "try again after 15 minutes" is enough.
  • Test the error display at 320px to make sure it doesn't overflow or truncate.
## 🎨 Leonie Voss — UI/UX Designer & Accessibility Strategist ### Questions & Observations - **The 429 error must be user-readable, not just an HTTP status code.** A family member who has mistyped their password several times should not see a raw "429 Too Many Requests" error. They need plain language: *"Too many login attempts. Please wait 15 minutes before trying again."* - **The error must appear inline, not as a generic error page.** The 429 response from the login form should be caught in the SvelteKit form action and displayed as a form-level error message — same pattern as invalid credentials, not a redirect to an error page. - **Seniors specifically.** A 67-year-old family member who has forgotten their password and hit the rate limit will be confused and assume the system is broken. The error message should be calming and actionable: tell them exactly what to do and when. ### Suggestions - The Paraglide translation key for the 429 error should be something like `error_rate_limit_login` with a message that includes the wait time: *"Zu viele Anmeldeversuche. Bitte warte 15 Minuten."* - Consider adding a visible countdown or a "try again" time if the `Retry-After` header is available in the response — even just "try again after 15 minutes" is enough. - Test the error display at 320px to make sure it doesn't overflow or truncate.
Author
Owner

🚀 Tobias Wendt — DevOps & Platform Engineer

Questions & Observations

  • server.forward-headers-strategy: framework is a deployment requirement. Without it, HttpServletRequest.getRemoteAddr() returns the Caddy container's IP in production — every request looks like it comes from the same host, and the rate limiter will block all traffic after the first N requests. This config must land in application.yml (base) or application-prod.yml before the rate limiter goes live. It's not a code change but it's as critical as the feature itself.
  • Resilience4j metrics integrate with Actuator/Prometheus for free. Once the Prometheus scrape is configured (issue #140), rate limiter metrics (resilience4j_ratelimiter_available_permissions, resilience4j_ratelimiter_waiting_threads) are available without any extra instrumentation. Good operational visibility.
  • The application-test.yml override for a short limitRefreshPeriod (1s) during tests is a clean pattern — document it so the next developer understands why the test profile has different limits.

Suggestions

  • Add server.forward-headers-strategy: framework to application.yml as part of this PR — it's a dependency of the feature, not a separate concern.
  • In application-prod.yml, override the rate limit thresholds to match the agreed production values. The dev defaults in application.yml can be more generous to avoid friction during local development.
## 🚀 Tobias Wendt — DevOps & Platform Engineer ### Questions & Observations - **`server.forward-headers-strategy: framework` is a deployment requirement.** Without it, `HttpServletRequest.getRemoteAddr()` returns the Caddy container's IP in production — every request looks like it comes from the same host, and the rate limiter will block all traffic after the first N requests. This config must land in `application.yml` (base) or `application-prod.yml` before the rate limiter goes live. It's not a code change but it's as critical as the feature itself. - **Resilience4j metrics integrate with Actuator/Prometheus for free.** Once the Prometheus scrape is configured (issue #140), rate limiter metrics (`resilience4j_ratelimiter_available_permissions`, `resilience4j_ratelimiter_waiting_threads`) are available without any extra instrumentation. Good operational visibility. - **The `application-test.yml` override** for a short `limitRefreshPeriod` (1s) during tests is a clean pattern — document it so the next developer understands why the test profile has different limits. ### Suggestions - Add `server.forward-headers-strategy: framework` to `application.yml` as part of this PR — it's a dependency of the feature, not a separate concern. - In `application-prod.yml`, override the rate limit thresholds to match the agreed production values. The dev defaults in `application.yml` can be more generous to avoid friction during local development.
Author
Owner

Audit confirmation (2026-05-07)

Pre-prod audit confirms zero rate-limiting infrastructure in backend/src/main/java/:

$ grep -rE "RateLimiter|Bucket4j|@Bucket" backend/src/main/java
(no matches)

Suggested AC additions tied to audit findings

  • Per-IP bucket — 10 attempts / 60s on /api/auth/login, /api/auth/forgot-password, /api/auth/register.
  • Per-username bucket — separate counter so a botnet can't lock out a real user (lockout-as-DoS protection).
  • Audit hook — every failed login dispatches AuthenticationFailureBadCredentialsEventAuditService.logAfterCommit() with masked email + source IP. This closes audit finding F-14 (failed-login audit trail) in the same PR.
  • Response when limit hit: 429 + Retry-After header + RFC 9457-style ProblemDetail body.
  • Test: 11th login attempt within 60s returns 429 (not 401).
  • Test: a 401 from a bad password is recorded in audit_log with kind = LOGIN_FAILED.

Suggested impl

bucket4j-spring-boot-starter (Maven) or resilience4j-ratelimiter. The latter integrates with the same Resilience4j stack we'd use for OCR resilience (audit F-09), so prefer it if you want a single resilience library across the codebase.

Tracked in audit doc as F-05 (Critical) and F-14 (High). See docs/audits/2026-05-07-pre-prod-architectural-review.md.

## Audit confirmation (2026-05-07) Pre-prod audit confirms zero rate-limiting infrastructure in `backend/src/main/java/`: ``` $ grep -rE "RateLimiter|Bucket4j|@Bucket" backend/src/main/java (no matches) ``` ### Suggested AC additions tied to audit findings - [ ] **Per-IP bucket** — 10 attempts / 60s on `/api/auth/login`, `/api/auth/forgot-password`, `/api/auth/register`. - [ ] **Per-username bucket** — separate counter so a botnet can't lock out a real user (lockout-as-DoS protection). - [ ] **Audit hook** — every failed login dispatches `AuthenticationFailureBadCredentialsEvent` → `AuditService.logAfterCommit()` with masked email + source IP. This closes audit finding **F-14** (failed-login audit trail) in the same PR. - [ ] Response when limit hit: 429 + `Retry-After` header + `RFC 9457`-style ProblemDetail body. - [ ] Test: 11th login attempt within 60s returns 429 (not 401). - [ ] Test: a 401 from a bad password is recorded in `audit_log` with `kind = LOGIN_FAILED`. ### Suggested impl `bucket4j-spring-boot-starter` (Maven) or `resilience4j-ratelimiter`. The latter integrates with the same Resilience4j stack we'd use for OCR resilience (audit F-09), so prefer it if you want a single resilience library across the codebase. Tracked in audit doc as **F-05** (Critical) and **F-14** (High). See `docs/audits/2026-05-07-pre-prod-architectural-review.md`.
Sign in to join this conversation.
No Label security
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marcel/familienarchiv#111