Marko Djordjevic 228f70daf3 docs: add ML pipeline quickstart guide for training first model

Complete step-by-step guide covering:
- Creating training data through annotations
- Exporting annotations and candle data
- Running the full ML pipeline
- Verifying model creation and loading
- Getting predictions in the UI
- Troubleshooting common issues
- Iteration through active learning loop

2026-02-15 19:08:09 +01:00

7.1 KiB

Raw Blame History

ML Pipeline Quickstart Guide

This guide walks you through training your first model from scratch.

Prerequisites

Make sure all services are running:

docker-compose up -d

This starts:

candle-annotator (Next.js app) - http://localhost:3000
ml-service (FastAPI) - http://localhost:8001
mlflow (MLflow UI) - http://localhost:5000
postgres (Database)

Step 1: Create Training Data (Annotate Patterns)

You need at least 20-30 annotated patterns for initial training.

1.1 Upload Candle Data

Open http://localhost:3000
Click "Choose CSV File" and upload your OHLCV data
Verify the chart displays correctly

1.2 Annotate Patterns

Click on "Manage Span Label Types" in the sidebar
Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
Return to main page
Select a label type from the span tools
Click and drag on the chart to create span annotations
Annotate at least 20-30 patterns (more is better)

Tip: Aim for diverse patterns across different market conditions.

Step 2: Export Annotations

2.1 Export via API

# Export span annotations in ML pipeline format
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json

2.2 Verify Export

# Check the exported file
cat services/ml/data/annotations/export.json | jq '.'

You should see JSON with your annotations:

{
  "annotations": [
    {
      "start_time": 1700000000,
      "end_time": 1700003600,
      "label": "Bullish Engulfing",
      "confidence": 1.0,
      "source": "human"
    }
  ]
}

Step 3: Prepare Raw OHLCV Data

Copy your candle data to the ML pipeline:

# Export candles from the database
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv

Or manually copy your CSV file:

cp your_data.csv services/ml/data/raw/OHLCV.csv

Format required:

time,open,high,low,close,volume
1700000000,1.0500,1.0520,1.0490,1.0510,1000
1700000060,1.0510,1.0530,1.0505,1.0525,1200

Step 4: Run the ML Pipeline

4.1 Enter ML Service Container

docker-compose exec ml-service bash

4.2 Run Full Pipeline

# Run all stages: feature engineering → annotation ingestion → training
python pipeline.py --config config/pipeline.yaml

This will:

Feature Engineering - Compute TA-Lib indicators (RSI, MACD, etc.)
Annotation Ingestion - Convert annotations to labeled dataset
Training - Train RandomForest model and log to MLflow

Expected output:

[INFO] Starting pipeline...
[INFO] Stage: feature_engineering
[INFO] Computing TA-Lib indicators...
[INFO] Computed 42 features
[INFO] Stage: annotation_ingestion
[INFO] Loaded 25 human annotations
[INFO] Created 25 training samples
[INFO] Stage: training
[INFO] Training RandomForest with 200 estimators
[INFO] Training complete. F1 macro: 0.78
[INFO] Model saved to models/best_model.pkl

4.3 Verify Model Created

ls -lh models/best_model.pkl

Step 5: Configure Inference Service

5.1 Check Model Path

The inference service looks for the model at models/best_model.pkl by default (configured in config/pipeline.yaml).

5.2 Restart Inference Service

Exit the container and restart the ml-service:

exit
docker-compose restart ml-service

5.3 Verify Model Loaded

# Check health
curl http://localhost:8001/health

# Get model info
curl http://localhost:8001/model/info | jq '.'

You should see:

{
  "model_info": {
    "model_name": "candlestick_pattern_v1",
    "model_version": "...",
    "model_type": "RandomForest",
    "trained_at": "...",
    "feature_count": 42,
    "label_names": ["Bullish Engulfing", "Doji", "Hammer"]
  },
  "metrics": {
    "accuracy": 0.85,
    "f1_macro": 0.78,
    "per_class": {...}
  }
}

Step 6: Get Predictions in UI

Open http://localhost:3000
Scroll down in the left sidebar to the Predictions section
Toggle "Show" to enable predictions
Click "Run on Visible" to predict patterns on visible candles
Predictions appear as colored histogram overlays on the chart

Prediction Controls

Confidence Threshold - Slider to filter low-confidence predictions
Filter by Label - Checkboxes to show/hide specific patterns
Prediction Summary - Shows agreement/disagreement with human annotations

Troubleshooting

"No module named 'talib'"

TA-Lib C library not installed in container. Rebuild:

docker-compose build --no-cache ml-service
docker-compose up -d ml-service

"Model file not found"

The pipeline didn't create the model. Check:

docker-compose logs ml-service

Make sure you have enough annotations (min 10-20).

"Not enough data for training"

You need more annotated spans. Go back to Step 1 and add more annotations.

"MLflow connection refused"

MLflow service not running:

docker-compose up -d mlflow
docker-compose restart ml-service

Predictions are all wrong

The model needs more diverse training data. Add annotations for:

Different pattern types
Various market conditions (uptrend, downtrend, sideways)
Different timeframes

Then retrain:

docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
docker-compose restart ml-service

View Training Experiments

Open MLflow UI at http://localhost:5000 to:

View all training runs
Compare model metrics
Download artifacts (confusion matrix, feature importance)
Register models

Next Steps

Improve Model Performance

Add More Annotations - Annotate 100+ patterns for better accuracy
Tune Hyperparameters - Edit config/pipeline.yaml and experiment
Try XGBoost - Change model_type: "xgboost" in config
Add Custom Features - Write custom feature functions in features/custom_loader.py

Use Programmatic Labels

Enable TA-Lib pattern detection to auto-generate labels:

Edit config/pipeline.yaml:

programmatic_labels:
  enabled: true  # Set to true

This adds labels from TA-Lib CDL* functions alongside your human annotations.

Active Learning Loop

Get predictions on new data
Review disagreements (model missed patterns you saw, or vice versa)
Correct predictions by adding new annotations
Re-export annotations
Retrain model
Repeat

The model improves iteratively through this feedback cycle.

Quick Reference

# Export annotations
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json

# Export candle data
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv

# Train model
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml

# Restart inference service
docker-compose restart ml-service

# Check model loaded
curl http://localhost:8001/model/info

# View MLflow UI
open http://localhost:5000

7.1 KiB Raw Blame History