feat(normalizer): parse Spanish month names + Month DD-YYYY hyphen form
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m31s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
All checks were successful
CI / Unit & Component Tests (pull_request) Successful in 3m31s
CI / OCR Service Tests (pull_request) Successful in 22s
CI / Backend Unit Tests (pull_request) Successful in 3m42s
CI / fail2ban Regex (pull_request) Successful in 45s
CI / Semgrep Security Scan (pull_request) Successful in 20s
CI / Compose Bucket Idempotency (pull_request) Successful in 1m2s
Add Spanish month names (Mexican-branch letters) to config.MONTHS and let
the month-first matcher accept a hyphen (not just a dot) before the year, so
"Mayo 18-1929"/"Junio 7-904" parse without manual overrides. Also bound
4-digit years to 1700-2100 so gross typos ("23-9003") stay in review instead
of producing a bogus year. Cuts unknown-date rate 9.2% -> 7.9%.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -48,7 +48,8 @@ def expand_year(token: str):
|
||||
return None
|
||||
n, v = len(token), int(token)
|
||||
if n == 4:
|
||||
return v
|
||||
# reject gross typos (e.g. "9003") so they go to review instead of a bogus year
|
||||
return v if 1700 <= v <= 2100 else None
|
||||
if n == 3:
|
||||
return 1000 + v
|
||||
if n == 2:
|
||||
@@ -157,8 +158,9 @@ def _match_monthname_a(s):
|
||||
return _build_day_month_year(int(m.group(1)), _lookup_month(m.group(2)), expand_year(m.group(3)))
|
||||
|
||||
|
||||
# dot after day is REQUIRED so this can't match "Mai 1895" (MONTH YYYY) as day=18
|
||||
_MONTH_B_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s*(\d{1,2})\.\s*(\d{2,4})")
|
||||
# A separator (dot OR hyphen/en-dash) after the day is REQUIRED so this can't match
|
||||
# "Mai 1895" (MONTH YYYY) as day=18; the hyphen form also covers Spanish "Mayo 18-1929".
|
||||
_MONTH_B_RE = re.compile(r"([A-Za-zÄÖÜäöü]+)\.?\s*(\d{1,2})\s*[.\-–]\s*(\d{2,4})")
|
||||
|
||||
|
||||
def _match_monthname_b(s):
|
||||
|
||||
Reference in New Issue
Block a user