fix(training): switch to PAGE XML format for kurrent recognition training

Kraken 7 removed support for the legacy `path` format (image + .gt.txt pairs) in VGSLRecognitionDataModule despite the CLI still advertising it. Switching to PAGE XML (-f page) format which is the supported standard. - Java export now writes .xml alongside .png (PAGE XML with TextLine, Baseline at 75% height, and Unicode transcription) - XML special characters in transcription text are escaped (& < >) - Python trainer globs *.xml and passes -f page to ketos train - Regenerated frontend API types to include cer/loss/accuracy/epochs on OcrTrainingRun (were missing, causing empty CER column in history) - Updated and extended TrainingDataExportServiceTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 21:45:08 +02:00
parent 94b9c56527
commit 49c9022285
4 changed files with 140 additions and 13 deletions
--- a/ocr-service/main.py
+++ b/ocr-service/main.py
@@ -357,7 +357,7 @@ async def train_model(

            log.info("Extracted %d ZIP entries to %s", len(os.listdir(tmp_dir)), tmp_dir)

-            ground_truth = glob.glob(os.path.join(tmp_dir, "*.gt.txt"))
+            ground_truth = glob.glob(os.path.join(tmp_dir, "*.xml"))
            if not ground_truth:
                raise HTTPException(status_code=422, detail="No ground-truth files found in ZIP")

@@ -368,7 +368,7 @@ async def train_model(
            cmd = [
                "ketos", "--workers", "0", "--device", "cpu", "--threads", "2",
                "train",
-                "-f", "path",
+                "-f", "page",
                "-o", checkpoint_dir,
                "-q", "fixed",
                "-N", "10",