Complete step-by-step guide covering: - Creating training data through annotations - Exporting annotations and candle data - Running the full ML pipeline - Verifying model creation and loading - Getting predictions in the UI - Troubleshooting common issues - Iteration through active learning loop
7.1 KiB
ML Pipeline Quickstart Guide
This guide walks you through training your first model from scratch.
Prerequisites
Make sure all services are running:
docker-compose up -d
This starts:
- candle-annotator (Next.js app) - http://localhost:3000
- ml-service (FastAPI) - http://localhost:8001
- mlflow (MLflow UI) - http://localhost:5000
- postgres (Database)
Step 1: Create Training Data (Annotate Patterns)
You need at least 20-30 annotated patterns for initial training.
1.1 Upload Candle Data
- Open http://localhost:3000
- Click "Choose CSV File" and upload your OHLCV data
- Verify the chart displays correctly
1.2 Annotate Patterns
- Click on "Manage Span Label Types" in the sidebar
- Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
- Return to main page
- Select a label type from the span tools
- Click and drag on the chart to create span annotations
- Annotate at least 20-30 patterns (more is better)
Tip: Aim for diverse patterns across different market conditions.
Step 2: Export Annotations
2.1 Export via API
# Export span annotations in ML pipeline format
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
2.2 Verify Export
# Check the exported file
cat services/ml/data/annotations/export.json | jq '.'
You should see JSON with your annotations:
{
"annotations": [
{
"start_time": 1700000000,
"end_time": 1700003600,
"label": "Bullish Engulfing",
"confidence": 1.0,
"source": "human"
}
]
}
Step 3: Prepare Raw OHLCV Data
Copy your candle data to the ML pipeline:
# Export candles from the database
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
Or manually copy your CSV file:
cp your_data.csv services/ml/data/raw/OHLCV.csv
Format required:
time,open,high,low,close,volume
1700000000,1.0500,1.0520,1.0490,1.0510,1000
1700000060,1.0510,1.0530,1.0505,1.0525,1200
Step 4: Run the ML Pipeline
4.1 Enter ML Service Container
docker-compose exec ml-service bash
4.2 Run Full Pipeline
# Run all stages: feature engineering → annotation ingestion → training
python pipeline.py --config config/pipeline.yaml
This will:
- Feature Engineering - Compute TA-Lib indicators (RSI, MACD, etc.)
- Annotation Ingestion - Convert annotations to labeled dataset
- Training - Train RandomForest model and log to MLflow
Expected output:
[INFO] Starting pipeline...
[INFO] Stage: feature_engineering
[INFO] Computing TA-Lib indicators...
[INFO] Computed 42 features
[INFO] Stage: annotation_ingestion
[INFO] Loaded 25 human annotations
[INFO] Created 25 training samples
[INFO] Stage: training
[INFO] Training RandomForest with 200 estimators
[INFO] Training complete. F1 macro: 0.78
[INFO] Model saved to models/best_model.pkl
4.3 Verify Model Created
ls -lh models/best_model.pkl
Step 5: Configure Inference Service
5.1 Check Model Path
The inference service looks for the model at models/best_model.pkl by default (configured in config/pipeline.yaml).
5.2 Restart Inference Service
Exit the container and restart the ml-service:
exit
docker-compose restart ml-service
5.3 Verify Model Loaded
# Check health
curl http://localhost:8001/health
# Get model info
curl http://localhost:8001/model/info | jq '.'
You should see:
{
"model_info": {
"model_name": "candlestick_pattern_v1",
"model_version": "...",
"model_type": "RandomForest",
"trained_at": "...",
"feature_count": 42,
"label_names": ["Bullish Engulfing", "Doji", "Hammer"]
},
"metrics": {
"accuracy": 0.85,
"f1_macro": 0.78,
"per_class": {...}
}
}
Step 6: Get Predictions in UI
- Open http://localhost:3000
- Scroll down in the left sidebar to the Predictions section
- Toggle "Show" to enable predictions
- Click "Run on Visible" to predict patterns on visible candles
- Predictions appear as colored histogram overlays on the chart
Prediction Controls
- Confidence Threshold - Slider to filter low-confidence predictions
- Filter by Label - Checkboxes to show/hide specific patterns
- Prediction Summary - Shows agreement/disagreement with human annotations
Troubleshooting
"No module named 'talib'"
TA-Lib C library not installed in container. Rebuild:
docker-compose build --no-cache ml-service
docker-compose up -d ml-service
"Model file not found"
The pipeline didn't create the model. Check:
docker-compose logs ml-service
Make sure you have enough annotations (min 10-20).
"Not enough data for training"
You need more annotated spans. Go back to Step 1 and add more annotations.
"MLflow connection refused"
MLflow service not running:
docker-compose up -d mlflow
docker-compose restart ml-service
Predictions are all wrong
The model needs more diverse training data. Add annotations for:
- Different pattern types
- Various market conditions (uptrend, downtrend, sideways)
- Different timeframes
Then retrain:
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
docker-compose restart ml-service
View Training Experiments
Open MLflow UI at http://localhost:5000 to:
- View all training runs
- Compare model metrics
- Download artifacts (confusion matrix, feature importance)
- Register models
Next Steps
Improve Model Performance
- Add More Annotations - Annotate 100+ patterns for better accuracy
- Tune Hyperparameters - Edit
config/pipeline.yamland experiment - Try XGBoost - Change
model_type: "xgboost"in config - Add Custom Features - Write custom feature functions in
features/custom_loader.py
Use Programmatic Labels
Enable TA-Lib pattern detection to auto-generate labels:
Edit config/pipeline.yaml:
programmatic_labels:
enabled: true # Set to true
This adds labels from TA-Lib CDL* functions alongside your human annotations.
Active Learning Loop
- Get predictions on new data
- Review disagreements (model missed patterns you saw, or vice versa)
- Correct predictions by adding new annotations
- Re-export annotations
- Retrain model
- Repeat
The model improves iteratively through this feedback cycle.
Quick Reference
# Export annotations
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
# Export candle data
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
# Train model
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
# Restart inference service
docker-compose restart ml-service
# Check model loaded
curl http://localhost:8001/model/info
# View MLflow UI
open http://localhost:5000