candle-annotator/openspec/specs/ml-training/spec.md at 925e7284e33f1af4d058fe210335c44a2f50a316

Marko Djordjevic 925e7284e3 Archive code-review-fix change and sync specs to main

- Synced 14 capability delta specs to main specs
- Created 6 new main specs: api-authentication, error-boundary, input-validation, security-headers, shared-types
- Updated 8 existing specs with security, validation, and performance requirements
- Archived change to openspec/changes/archive/2026-02-20-code-review-fix/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 08:54:59 +01:00

9 KiB

Raw Blame History

ADDED Requirements

Requirement: Temporal train/test splitting

The system SHALL split the labeled dataset into train, validation, and test sets using temporal ordering. Data SHALL be sorted by time. The first portion is training, middle is validation, last is test. Split ratios are defined by stages.training.test_split and stages.training.validation_split. The system SHALL NOT shuffle financial time series data.

Scenario: Temporal split

WHEN test_split is 0.2, validation_split is 0.1, and the dataset has 1000 rows sorted by time
THEN the first 700 rows are training, next 100 are validation, last 200 are test

Scenario: Random split option

WHEN split_method is "random"
THEN the system uses standard random splitting (sklearn train_test_split) but logs a warning that this is not recommended for financial data

Requirement: Class weight balancing

The system SHALL apply class weighting to handle imbalanced pattern labels. When stages.training.class_weights is "balanced", the system SHALL compute inverse-frequency weights so rare pattern classes receive higher training weight.

Scenario: Balanced weights

WHEN class_weights is "balanced" and the dataset has 500 "O" labels and 50 "bull_flag" labels
THEN the model trains with class weights inversely proportional to class frequency

Requirement: Model training dispatch

The system SHALL train the model type specified in stages.training.model_type using the hyperparameters in stages.training.hyperparameters. Supported model types for v1: "random_forest" (scikit-learn RandomForestClassifier) and "xgboost" (XGBClassifier).

Scenario: Train XGBoost model

WHEN model_type is "xgboost" with hyperparameters n_estimators=500, max_depth=6, learning_rate=0.01
THEN the system trains an XGBClassifier with those parameters on the training set

Scenario: Train RandomForest model

WHEN model_type is "random_forest"
THEN the system trains a RandomForestClassifier with the configured hyperparameters

Scenario: Unsupported model type

WHEN model_type is a value not supported in v1 (e.g., "lstm", "transformer")
THEN the system SHALL fail with an error message listing the supported model types

Requirement: MLflow experiment tracking

The system SHALL log all training runs to MLflow. Each run SHALL log: the full pipeline YAML config as an artifact, dataset version (DVC hash if available), total samples, number of classes, model type, window size, per-class sample counts, and all hyperparameters.

Scenario: Log training run

WHEN a training run starts
THEN the system creates an MLflow run under the experiment name from stages.training.mlflow.experiment_name and logs all parameters

Scenario: MLflow server unavailable

WHEN the MLflow tracking URI is unreachable
THEN the system SHALL fail with an error message indicating the MLflow server cannot be reached at the configured URI

Requirement: Training metrics logging

After training, the system SHALL evaluate the model on the test set and log metrics to MLflow: overall accuracy, macro F1, weighted F1, and per-class precision, recall, and F1 for each label.

Scenario: Log overall metrics

WHEN model evaluation completes
THEN the system logs accuracy, f1_macro, and f1_weighted to MLflow

Scenario: Log per-class metrics

WHEN model evaluation completes with labels "bull_flag", "bear_flag", and "O"
THEN the system logs precision_bull_flag, recall_bull_flag, f1_bull_flag (and same for each other label) to MLflow

Requirement: Training artifact logging

When stages.training.mlflow.log_artifacts is true, the system SHALL log to MLflow: a confusion matrix plot (PNG), a feature importance plot (PNG, for tree-based models), and a classification report (text).

Scenario: Log confusion matrix

WHEN log_artifacts is true and training completes
THEN the system generates and logs a confusion matrix plot as "confusion_matrix.png" to MLflow

Scenario: Log feature importance

WHEN log_artifacts is true and the model has feature_importances_ attribute
THEN the system generates and logs a feature importance plot as "feature_importance.png" to MLflow

Requirement: Model registration

When stages.training.mlflow.register_model is true, the system SHALL register the trained model in the MLflow model registry under the name specified by stages.inference.mlflow_model_name.

Scenario: Register model

WHEN register_model is true and training completes
THEN the system registers the model in MLflow registry with the configured model name

Requirement: PostgreSQL training metadata storage

The system SHALL store training run metadata in the PostgreSQL database. Each training run record SHALL include: run_id (MLflow run ID), model_type, experiment_name, pipeline_config_hash, dataset_version, metrics summary (JSON), status, and timestamps (created_at, completed_at).

Scenario: Store training run record

WHEN a training run completes successfully
THEN the system inserts a record into the PostgreSQL training_runs table with the run metadata

Scenario: Query training history

WHEN the system queries training runs
THEN it returns records from PostgreSQL ordered by created_at descending

Scenario: Database name updated

WHEN the ML service connects to PostgreSQL
THEN it connects to the candle_annotator database (not ml_db)

Requirement: Direct annotation data access

The ML service SHALL read candle and annotation data directly from PostgreSQL instead of requiring CSV/JSON file exports. The ML service SHALL query the candles, annotations, span_annotations, and charts tables for training data.

Scenario: Query candle data for training

WHEN the ML training pipeline needs OHLC data for a chart
THEN it queries the candles table in PostgreSQL filtered by chart_id, ordered by time

Scenario: Query span annotations for labels

WHEN the ML training pipeline needs labeled spans for training
THEN it queries the span_annotations table in PostgreSQL filtered by chart_id and optionally by source

Scenario: No CSV/JSON export required

WHEN the ML training pipeline starts
THEN it does not require pre-exported CSV or JSON files — all data is read from PostgreSQL

Scenario: Shared database connection

WHEN the ML service reads candle/annotation data
THEN it uses the same PostgreSQL connection (same database, same credentials) as for training_runs

Requirement: Pipeline config logging

The system SHALL log the full pipeline YAML config as an MLflow artifact with each training run. This config SHALL be used during inference to replicate the exact preprocessing steps.

Scenario: Config artifact logged

WHEN a training run starts
THEN the full pipeline.yaml content is logged as "pipeline_config.yaml" artifact in the MLflow run

Requirement: Training resource limits

The POST /training/start endpoint SHALL enforce resource limits: the training dataset file size SHALL not exceed 500MB, and the training thread SHALL have a configurable timeout (default: 30 minutes). If the timeout is exceeded, the training thread SHALL be marked as failed.

Scenario: Dataset too large

WHEN the training dataset exceeds 500MB
THEN training fails immediately with { "detail": "Dataset too large. Maximum 500MB." }

Scenario: Training timeout

WHEN a training run exceeds the 30-minute timeout
THEN the training status is set to "failed" with reason "Training timed out"

Requirement: run_id validation on training endpoints

The FastAPI training endpoints (DELETE /training/runs/{run_id}, GET /training/runs/{run_id}) SHALL validate that run_id matches /^[a-zA-Z0-9_-]+$/ before any database or file operation.

Scenario: Valid run_id

WHEN DELETE /training/runs/run-2024-01-15_v3 is called
THEN the request proceeds normally

Scenario: Invalid run_id

WHEN DELETE /training/runs/../../admin is called
THEN the endpoint returns HTTP 400 with { "detail": "Invalid run_id format" }

Requirement: Environment variable configuration (credentials)

The project SHALL use environment variables for runtime configuration. Credentials SHALL NOT be hardcoded in any committed file.

Scenario: .env file gitignored

WHEN .gitignore is inspected
THEN it includes .env (bare, not just .env*.local)

Scenario: .env removed from git history

WHEN git ls-files .env is run
THEN .env is NOT tracked by git

Scenario: .env.example has placeholder credentials

WHEN .env.example is inspected
THEN it contains POSTGRES_PASSWORD=change_me_to_a_strong_password (not a real password)

Scenario: No credentials in Python source

WHEN services/ml/app/db.py is inspected
THEN there are no SQL comments containing usernames or passwords, and the code fails fast if DATABASE_URL env var is not set

9 KiB Raw Blame History