feat: add training resource limits (500MB size check + 30-min timeout)

- Import concurrent.futures for timeout support - In _run_training_background: check df.memory_usage(deep=True).sum() after loading the labeled dataset; raise ValueError if > 500MB - Wrap model.fit() in a ThreadPoolExecutor with a 1800s timeout; on TimeoutError update DB status to "failed" with message "Training timed out after 30 minutes" and return early - Mark task 5.7 as done in openspec/changes/code-review-fix/tasks.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-18 11:31:33 +01:00 · 2026-02-18 11:31:33 +01:00 · 3dc0014328
commit 3dc0014328
parent f94d16c6ab
2 changed files with 53 additions and 6 deletions
--- a/openspec/changes/code-review-fix/tasks.md
+++ b/openspec/changes/code-review-fix/tasks.md
@ -47,7 +47,7 @@
 - [x] 5.4 `[sonnet]` Add date range validation (max 1 year) to `POST /predict/batch` in `services/ml/app/main.py`
 - [x] 5.5 `[sonnet]` Add candle time-sort validation/auto-sort to `POST /predict` in `services/ml/app/main.py`
 - [x] 5.6 `[sonnet]` Implement real health checks: `SELECT 1` for PostgreSQL, MLflow API ping in `services/ml/app/main.py:396-409`
- [ ] 5.7 `[sonnet]` Add training resource limits: 500MB dataset size check, 30-minute timeout with status update on expiry in `services/ml/app/main.py:907-1030`
+- [x] 5.7 `[sonnet]` Add training resource limits: 500MB dataset size check, 30-minute timeout with status update on expiry in `services/ml/app/main.py:907-1030`
 - [ ] 5.8 `[haiku]` Add `run_id` format validation to `DELETE /training/runs/{run_id}` and `GET /training/runs/{run_id}` endpoints

 ## 6. Infrastructure & Docker