docs: add ML pipeline quickstart guide for training first model

Complete step-by-step guide covering: - Creating training data through annotations - Exporting annotations and candle data - Running the full ML pipeline - Verifying model creation and loading - Getting predictions in the UI - Troubleshooting common issues - Iteration through active learning loop
2026-02-15 19:08:09 +01:00 · 2026-02-15 19:08:09 +01:00 · 228f70daf3
commit 228f70daf3
parent 21f184aa8d
1 changed files with 299 additions and 0 deletions
--- a/ML_QUICKSTART.md
+++ b/ML_QUICKSTART.md
@ -0,0 +1,299 @@
 # ML Pipeline Quickstart Guide
 This guide walks you through training your first model from scratch.
 ## Prerequisites
 Make sure all services are running:
 ```bash
 docker-compose up -d
 ```
 This starts:
 - **candle-annotator** (Next.js app) - http://localhost:3000
 - **ml-service** (FastAPI) - http://localhost:8001
 - **mlflow** (MLflow UI) - http://localhost:5000
 - **postgres** (Database)
 ## Step 1: Create Training Data (Annotate Patterns)
 You need at least 20-30 annotated patterns for initial training.
 ### 1.1 Upload Candle Data
 1. Open http://localhost:3000
 2. Click "Choose CSV File" and upload your OHLCV data
 3. Verify the chart displays correctly
 ### 1.2 Annotate Patterns
 1. Click on "Manage Span Label Types" in the sidebar
 2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
 3. Return to main page
 4. Select a label type from the span tools
 5. Click and drag on the chart to create span annotations
 6. Annotate at least 20-30 patterns (more is better)
 **Tip:** Aim for diverse patterns across different market conditions.
 ## Step 2: Export Annotations
 ### 2.1 Export via API
 ```bash
 # Export span annotations in ML pipeline format
 curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
 ```
 ### 2.2 Verify Export
 ```bash
 # Check the exported file
 cat services/ml/data/annotations/export.json | jq '.'
 ```
 You should see JSON with your annotations:
 ```json
 {
  "annotations": [
    {
      "start_time": 1700000000,
      "end_time": 1700003600,
      "label": "Bullish Engulfing",
      "confidence": 1.0,
      "source": "human"
    }
  ]
 }
 ```
 ## Step 3: Prepare Raw OHLCV Data
 Copy your candle data to the ML pipeline:
 ```bash
 # Export candles from the database
 docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
 ```
 Or manually copy your CSV file:
 ```bash
 cp your_data.csv services/ml/data/raw/OHLCV.csv
 ```
 **Format required:**
 ```csv
 time,open,high,low,close,volume
 1700000000,1.0500,1.0520,1.0490,1.0510,1000
 1700000060,1.0510,1.0530,1.0505,1.0525,1200
 ```
 ## Step 4: Run the ML Pipeline
 ### 4.1 Enter ML Service Container
 ```bash
 docker-compose exec ml-service bash
 ```
 ### 4.2 Run Full Pipeline
 ```bash
 # Run all stages: feature engineering → annotation ingestion → training
 python pipeline.py --config config/pipeline.yaml
 ```
 This will:
 1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.)
 2. **Annotation Ingestion** - Convert annotations to labeled dataset
 3. **Training** - Train RandomForest model and log to MLflow
 **Expected output:**
 ```
 [INFO] Starting pipeline...
 [INFO] Stage: feature_engineering
 [INFO] Computing TA-Lib indicators...
 [INFO] Computed 42 features
 [INFO] Stage: annotation_ingestion
 [INFO] Loaded 25 human annotations
 [INFO] Created 25 training samples
 [INFO] Stage: training
 [INFO] Training RandomForest with 200 estimators
 [INFO] Training complete. F1 macro: 0.78
 [INFO] Model saved to models/best_model.pkl
 ```
 ### 4.3 Verify Model Created
 ```bash
 ls -lh models/best_model.pkl
 ```
 ## Step 5: Configure Inference Service
 ### 5.1 Check Model Path
 The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`).
 ### 5.2 Restart Inference Service
 Exit the container and restart the ml-service:
 ```bash
 exit
 docker-compose restart ml-service
 ```
 ### 5.3 Verify Model Loaded
 ```bash
 # Check health
 curl http://localhost:8001/health
 # Get model info
 curl http://localhost:8001/model/info | jq '.'
 ```
 You should see:
 ```json
 {
  "model_info": {
    "model_name": "candlestick_pattern_v1",
    "model_version": "...",
    "model_type": "RandomForest",
    "trained_at": "...",
    "feature_count": 42,
    "label_names": ["Bullish Engulfing", "Doji", "Hammer"]
  },
  "metrics": {
    "accuracy": 0.85,
    "f1_macro": 0.78,
    "per_class": {...}
  }
 }
 ```
 ## Step 6: Get Predictions in UI
 1. Open http://localhost:3000
 2. Scroll down in the left sidebar to the **Predictions** section
 3. Toggle "Show" to enable predictions
 4. Click "Run on Visible" to predict patterns on visible candles
 5. Predictions appear as colored histogram overlays on the chart
 ### Prediction Controls
 - **Confidence Threshold** - Slider to filter low-confidence predictions
 - **Filter by Label** - Checkboxes to show/hide specific patterns
 - **Prediction Summary** - Shows agreement/disagreement with human annotations
 ## Troubleshooting
 ### "No module named 'talib'"
 TA-Lib C library not installed in container. Rebuild:
 ```bash
 docker-compose build --no-cache ml-service
 docker-compose up -d ml-service
 ```
 ### "Model file not found"
 The pipeline didn't create the model. Check:
 ```bash
 docker-compose logs ml-service
 ```
 Make sure you have enough annotations (min 10-20).
 ### "Not enough data for training"
 You need more annotated spans. Go back to Step 1 and add more annotations.
 ### "MLflow connection refused"
 MLflow service not running:
 ```bash
 docker-compose up -d mlflow
 docker-compose restart ml-service
 ```
 ### Predictions are all wrong
 The model needs more diverse training data. Add annotations for:
 - Different pattern types
 - Various market conditions (uptrend, downtrend, sideways)
 - Different timeframes
 Then retrain:
 ```bash
 docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
 docker-compose restart ml-service
 ```
 ## View Training Experiments
 Open MLflow UI at http://localhost:5000 to:
 - View all training runs
 - Compare model metrics
 - Download artifacts (confusion matrix, feature importance)
 - Register models
 ## Next Steps
 ### Improve Model Performance
 1. **Add More Annotations** - Annotate 100+ patterns for better accuracy
 2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment
 3. **Try XGBoost** - Change `model_type: "xgboost"` in config
 4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py`
 ### Use Programmatic Labels
 Enable TA-Lib pattern detection to auto-generate labels:
 Edit `config/pipeline.yaml`:
 ```yaml
 programmatic_labels:
  enabled: true  # Set to true
 ```
 This adds labels from TA-Lib CDL* functions alongside your human annotations.
 ### Active Learning Loop
 1. Get predictions on new data
 2. Review disagreements (model missed patterns you saw, or vice versa)
 3. Correct predictions by adding new annotations
 4. Re-export annotations
 5. Retrain model
 6. Repeat
 The model improves iteratively through this feedback cycle.
 ## Quick Reference
 ```bash
 # Export annotations
 curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
 # Export candle data
 docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
 # Train model
 docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
 # Restart inference service
 docker-compose restart ml-service
 # Check model loaded
 curl http://localhost:8001/model/info
 # View MLflow UI
 open http://localhost:5000
 ```