## ADDED Requirements ### Requirement: Temporal train/test splitting The system SHALL split the labeled dataset into train, validation, and test sets using temporal ordering. Data SHALL be sorted by time. The first portion is training, middle is validation, last is test. Split ratios are defined by `stages.training.test_split` and `stages.training.validation_split`. The system SHALL NOT shuffle financial time series data. #### Scenario: Temporal split - **WHEN** test_split is 0.2, validation_split is 0.1, and the dataset has 1000 rows sorted by time - **THEN** the first 700 rows are training, next 100 are validation, last 200 are test #### Scenario: Random split option - **WHEN** split_method is "random" - **THEN** the system uses standard random splitting (sklearn train_test_split) but logs a warning that this is not recommended for financial data ### Requirement: Class weight balancing The system SHALL apply class weighting to handle imbalanced pattern labels. When `stages.training.class_weights` is "balanced", the system SHALL compute inverse-frequency weights so rare pattern classes receive higher training weight. #### Scenario: Balanced weights - **WHEN** class_weights is "balanced" and the dataset has 500 "O" labels and 50 "bull_flag" labels - **THEN** the model trains with class weights inversely proportional to class frequency ### Requirement: Model training dispatch The system SHALL train the model type specified in `stages.training.model_type` using the hyperparameters in `stages.training.hyperparameters`. Supported model types for v1: "random_forest" (scikit-learn RandomForestClassifier) and "xgboost" (XGBClassifier). #### Scenario: Train XGBoost model - **WHEN** model_type is "xgboost" with hyperparameters n_estimators=500, max_depth=6, learning_rate=0.01 - **THEN** the system trains an XGBClassifier with those parameters on the training set #### Scenario: Train RandomForest model - **WHEN** model_type is "random_forest" - **THEN** the system trains a RandomForestClassifier with the configured hyperparameters #### Scenario: Unsupported model type - **WHEN** model_type is a value not supported in v1 (e.g., "lstm", "transformer") - **THEN** the system SHALL fail with an error message listing the supported model types ### Requirement: MLflow experiment tracking The system SHALL log all training runs to MLflow. Each run SHALL log: the full pipeline YAML config as an artifact, dataset version (DVC hash if available), total samples, number of classes, model type, window size, per-class sample counts, and all hyperparameters. #### Scenario: Log training run - **WHEN** a training run starts - **THEN** the system creates an MLflow run under the experiment name from `stages.training.mlflow.experiment_name` and logs all parameters #### Scenario: MLflow server unavailable - **WHEN** the MLflow tracking URI is unreachable - **THEN** the system SHALL fail with an error message indicating the MLflow server cannot be reached at the configured URI ### Requirement: Training metrics logging After training, the system SHALL evaluate the model on the test set and log metrics to MLflow: overall accuracy, macro F1, weighted F1, and per-class precision, recall, and F1 for each label. #### Scenario: Log overall metrics - **WHEN** model evaluation completes - **THEN** the system logs accuracy, f1_macro, and f1_weighted to MLflow #### Scenario: Log per-class metrics - **WHEN** model evaluation completes with labels "bull_flag", "bear_flag", and "O" - **THEN** the system logs precision_bull_flag, recall_bull_flag, f1_bull_flag (and same for each other label) to MLflow ### Requirement: Training artifact logging When `stages.training.mlflow.log_artifacts` is true, the system SHALL log to MLflow: a confusion matrix plot (PNG), a feature importance plot (PNG, for tree-based models), and a classification report (text). #### Scenario: Log confusion matrix - **WHEN** log_artifacts is true and training completes - **THEN** the system generates and logs a confusion matrix plot as "confusion_matrix.png" to MLflow #### Scenario: Log feature importance - **WHEN** log_artifacts is true and the model has `feature_importances_` attribute - **THEN** the system generates and logs a feature importance plot as "feature_importance.png" to MLflow ### Requirement: Model registration When `stages.training.mlflow.register_model` is true, the system SHALL register the trained model in the MLflow model registry under the name specified by `stages.inference.mlflow_model_name`. #### Scenario: Register model - **WHEN** register_model is true and training completes - **THEN** the system registers the model in MLflow registry with the configured model name ### Requirement: PostgreSQL training metadata storage The system SHALL store training run metadata in the PostgreSQL database. Each training run record SHALL include: run_id (MLflow run ID), model_type, experiment_name, pipeline_config_hash, dataset_version, metrics summary (JSON), status, and timestamps (created_at, completed_at). #### Scenario: Store training run record - **WHEN** a training run completes successfully - **THEN** the system inserts a record into the PostgreSQL `training_runs` table with the run metadata #### Scenario: Query training history - **WHEN** the system queries training runs - **THEN** it returns records from PostgreSQL ordered by created_at descending ### Requirement: Pipeline config logging The system SHALL log the full pipeline YAML config as an MLflow artifact with each training run. This config SHALL be used during inference to replicate the exact preprocessing steps. #### Scenario: Config artifact logged - **WHEN** a training run starts - **THEN** the full pipeline.yaml content is logged as "pipeline_config.yaml" artifact in the MLflow run