candle-annotator/ML_QUICKSTART.md
Marko Djordjevic 228f70daf3 docs: add ML pipeline quickstart guide for training first model
Complete step-by-step guide covering:
- Creating training data through annotations
- Exporting annotations and candle data
- Running the full ML pipeline
- Verifying model creation and loading
- Getting predictions in the UI
- Troubleshooting common issues
- Iteration through active learning loop
2026-02-15 19:08:09 +01:00

7.1 KiB

ML Pipeline Quickstart Guide

This guide walks you through training your first model from scratch.

Prerequisites

Make sure all services are running:

docker-compose up -d

This starts:

Step 1: Create Training Data (Annotate Patterns)

You need at least 20-30 annotated patterns for initial training.

1.1 Upload Candle Data

  1. Open http://localhost:3000
  2. Click "Choose CSV File" and upload your OHLCV data
  3. Verify the chart displays correctly

1.2 Annotate Patterns

  1. Click on "Manage Span Label Types" in the sidebar
  2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
  3. Return to main page
  4. Select a label type from the span tools
  5. Click and drag on the chart to create span annotations
  6. Annotate at least 20-30 patterns (more is better)

Tip: Aim for diverse patterns across different market conditions.

Step 2: Export Annotations

2.1 Export via API

# Export span annotations in ML pipeline format
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json

2.2 Verify Export

# Check the exported file
cat services/ml/data/annotations/export.json | jq '.'

You should see JSON with your annotations:

{
  "annotations": [
    {
      "start_time": 1700000000,
      "end_time": 1700003600,
      "label": "Bullish Engulfing",
      "confidence": 1.0,
      "source": "human"
    }
  ]
}

Step 3: Prepare Raw OHLCV Data

Copy your candle data to the ML pipeline:

# Export candles from the database
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv

Or manually copy your CSV file:

cp your_data.csv services/ml/data/raw/OHLCV.csv

Format required:

time,open,high,low,close,volume
1700000000,1.0500,1.0520,1.0490,1.0510,1000
1700000060,1.0510,1.0530,1.0505,1.0525,1200

Step 4: Run the ML Pipeline

4.1 Enter ML Service Container

docker-compose exec ml-service bash

4.2 Run Full Pipeline

# Run all stages: feature engineering → annotation ingestion → training
python pipeline.py --config config/pipeline.yaml

This will:

  1. Feature Engineering - Compute TA-Lib indicators (RSI, MACD, etc.)
  2. Annotation Ingestion - Convert annotations to labeled dataset
  3. Training - Train RandomForest model and log to MLflow

Expected output:

[INFO] Starting pipeline...
[INFO] Stage: feature_engineering
[INFO] Computing TA-Lib indicators...
[INFO] Computed 42 features
[INFO] Stage: annotation_ingestion
[INFO] Loaded 25 human annotations
[INFO] Created 25 training samples
[INFO] Stage: training
[INFO] Training RandomForest with 200 estimators
[INFO] Training complete. F1 macro: 0.78
[INFO] Model saved to models/best_model.pkl

4.3 Verify Model Created

ls -lh models/best_model.pkl

Step 5: Configure Inference Service

5.1 Check Model Path

The inference service looks for the model at models/best_model.pkl by default (configured in config/pipeline.yaml).

5.2 Restart Inference Service

Exit the container and restart the ml-service:

exit
docker-compose restart ml-service

5.3 Verify Model Loaded

# Check health
curl http://localhost:8001/health

# Get model info
curl http://localhost:8001/model/info | jq '.'

You should see:

{
  "model_info": {
    "model_name": "candlestick_pattern_v1",
    "model_version": "...",
    "model_type": "RandomForest",
    "trained_at": "...",
    "feature_count": 42,
    "label_names": ["Bullish Engulfing", "Doji", "Hammer"]
  },
  "metrics": {
    "accuracy": 0.85,
    "f1_macro": 0.78,
    "per_class": {...}
  }
}

Step 6: Get Predictions in UI

  1. Open http://localhost:3000
  2. Scroll down in the left sidebar to the Predictions section
  3. Toggle "Show" to enable predictions
  4. Click "Run on Visible" to predict patterns on visible candles
  5. Predictions appear as colored histogram overlays on the chart

Prediction Controls

  • Confidence Threshold - Slider to filter low-confidence predictions
  • Filter by Label - Checkboxes to show/hide specific patterns
  • Prediction Summary - Shows agreement/disagreement with human annotations

Troubleshooting

"No module named 'talib'"

TA-Lib C library not installed in container. Rebuild:

docker-compose build --no-cache ml-service
docker-compose up -d ml-service

"Model file not found"

The pipeline didn't create the model. Check:

docker-compose logs ml-service

Make sure you have enough annotations (min 10-20).

"Not enough data for training"

You need more annotated spans. Go back to Step 1 and add more annotations.

"MLflow connection refused"

MLflow service not running:

docker-compose up -d mlflow
docker-compose restart ml-service

Predictions are all wrong

The model needs more diverse training data. Add annotations for:

  • Different pattern types
  • Various market conditions (uptrend, downtrend, sideways)
  • Different timeframes

Then retrain:

docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
docker-compose restart ml-service

View Training Experiments

Open MLflow UI at http://localhost:5000 to:

  • View all training runs
  • Compare model metrics
  • Download artifacts (confusion matrix, feature importance)
  • Register models

Next Steps

Improve Model Performance

  1. Add More Annotations - Annotate 100+ patterns for better accuracy
  2. Tune Hyperparameters - Edit config/pipeline.yaml and experiment
  3. Try XGBoost - Change model_type: "xgboost" in config
  4. Add Custom Features - Write custom feature functions in features/custom_loader.py

Use Programmatic Labels

Enable TA-Lib pattern detection to auto-generate labels:

Edit config/pipeline.yaml:

programmatic_labels:
  enabled: true  # Set to true

This adds labels from TA-Lib CDL* functions alongside your human annotations.

Active Learning Loop

  1. Get predictions on new data
  2. Review disagreements (model missed patterns you saw, or vice versa)
  3. Correct predictions by adding new annotations
  4. Re-export annotations
  5. Retrain model
  6. Repeat

The model improves iteratively through this feedback cycle.

Quick Reference

# Export annotations
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json

# Export candle data
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv

# Train model
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml

# Restart inference service
docker-compose restart ml-service

# Check model loaded
curl http://localhost:8001/model/info

# View MLflow UI
open http://localhost:5000