docs: add ML pipeline quickstart guide for training first model

Complete step-by-step guide covering: - Creating training data through annotations - Exporting annotations and candle data - Running the full ML pipeline - Verifying model creation and loading - Getting predictions in the UI - Troubleshooting common issues - Iteration through active learning loop
2026-02-15 19:08:09 +01:00 · 2026-02-15 19:08:09 +01:00 · 228f70daf3
commit 228f70daf3
parent 21f184aa8d
1 changed files with 299 additions and 0 deletions
--- a/ML_QUICKSTART.md
+++ b/ML_QUICKSTART.md
@ -0,0 +1,299 @@
+# ML Pipeline Quickstart Guide
+
+This guide walks you through training your first model from scratch.
+
+## Prerequisites
+
+Make sure all services are running:
+
+```bash
+docker-compose up -d
+```
+
+This starts:
+- **candle-annotator** (Next.js app) - http://localhost:3000
+- **ml-service** (FastAPI) - http://localhost:8001
+- **mlflow** (MLflow UI) - http://localhost:5000
+- **postgres** (Database)
+
+## Step 1: Create Training Data (Annotate Patterns)
+
+You need at least 20-30 annotated patterns for initial training.
+
+### 1.1 Upload Candle Data
+
+1. Open http://localhost:3000
+2. Click "Choose CSV File" and upload your OHLCV data
+3. Verify the chart displays correctly
+
+### 1.2 Annotate Patterns
+
+1. Click on "Manage Span Label Types" in the sidebar
+2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
+3. Return to main page
+4. Select a label type from the span tools
+5. Click and drag on the chart to create span annotations
+6. Annotate at least 20-30 patterns (more is better)
+
+**Tip:** Aim for diverse patterns across different market conditions.
+
+## Step 2: Export Annotations
+
+### 2.1 Export via API
+
+```bash
+# Export span annotations in ML pipeline format
+curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
+```
+
+### 2.2 Verify Export
+
+```bash
+# Check the exported file
+cat services/ml/data/annotations/export.json | jq '.'
+```
+
+You should see JSON with your annotations:
+```json
+{
+  "annotations": [
+    {
+      "start_time": 1700000000,
+      "end_time": 1700003600,
+      "label": "Bullish Engulfing",
+      "confidence": 1.0,
+      "source": "human"
+    }
+  ]
+}
+```
+
+## Step 3: Prepare Raw OHLCV Data
+
+Copy your candle data to the ML pipeline:
+
+```bash
+# Export candles from the database
+docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
+```
+
+Or manually copy your CSV file:
+```bash
+cp your_data.csv services/ml/data/raw/OHLCV.csv
+```
+
+**Format required:**
+```csv
+time,open,high,low,close,volume
+1700000000,1.0500,1.0520,1.0490,1.0510,1000
+1700000060,1.0510,1.0530,1.0505,1.0525,1200
+```
+
+## Step 4: Run the ML Pipeline
+
+### 4.1 Enter ML Service Container
+
+```bash
+docker-compose exec ml-service bash
+```
+
+### 4.2 Run Full Pipeline
+
+```bash
+# Run all stages: feature engineering → annotation ingestion → training
+python pipeline.py --config config/pipeline.yaml
+```
+
+This will:
+1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.)
+2. **Annotation Ingestion** - Convert annotations to labeled dataset
+3. **Training** - Train RandomForest model and log to MLflow
+
+**Expected output:**
+```
+[INFO] Starting pipeline...
+[INFO] Stage: feature_engineering
+[INFO] Computing TA-Lib indicators...
+[INFO] Computed 42 features
+[INFO] Stage: annotation_ingestion
+[INFO] Loaded 25 human annotations
+[INFO] Created 25 training samples
+[INFO] Stage: training
+[INFO] Training RandomForest with 200 estimators
+[INFO] Training complete. F1 macro: 0.78
+[INFO] Model saved to models/best_model.pkl
+```
+
+### 4.3 Verify Model Created
+
+```bash
+ls -lh models/best_model.pkl
+```
+
+## Step 5: Configure Inference Service
+
+### 5.1 Check Model Path
+
+The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`).
+
+### 5.2 Restart Inference Service
+
+Exit the container and restart the ml-service:
+
+```bash
+exit
+docker-compose restart ml-service
+```
+
+### 5.3 Verify Model Loaded
+
+```bash
+# Check health
+curl http://localhost:8001/health
+
+# Get model info
+curl http://localhost:8001/model/info | jq '.'
+```
+
+You should see:
+```json
+{
+  "model_info": {
+    "model_name": "candlestick_pattern_v1",
+    "model_version": "...",
+    "model_type": "RandomForest",
+    "trained_at": "...",
+    "feature_count": 42,
+    "label_names": ["Bullish Engulfing", "Doji", "Hammer"]
+  },
+  "metrics": {
+    "accuracy": 0.85,
+    "f1_macro": 0.78,
+    "per_class": {...}
+  }
+}
+```
+
+## Step 6: Get Predictions in UI
+
+1. Open http://localhost:3000
+2. Scroll down in the left sidebar to the **Predictions** section
+3. Toggle "Show" to enable predictions
+4. Click "Run on Visible" to predict patterns on visible candles
+5. Predictions appear as colored histogram overlays on the chart
+
+### Prediction Controls
+
+- **Confidence Threshold** - Slider to filter low-confidence predictions
+- **Filter by Label** - Checkboxes to show/hide specific patterns
+- **Prediction Summary** - Shows agreement/disagreement with human annotations
+
+## Troubleshooting
+
+### "No module named 'talib'"
+
+TA-Lib C library not installed in container. Rebuild:
+
+```bash
+docker-compose build --no-cache ml-service
+docker-compose up -d ml-service
+```
+
+### "Model file not found"
+
+The pipeline didn't create the model. Check:
+
+```bash
+docker-compose logs ml-service
+```
+
+Make sure you have enough annotations (min 10-20).
+
+### "Not enough data for training"
+
+You need more annotated spans. Go back to Step 1 and add more annotations.
+
+### "MLflow connection refused"
+
+MLflow service not running:
+
+```bash
+docker-compose up -d mlflow
+docker-compose restart ml-service
+```
+
+### Predictions are all wrong
+
+The model needs more diverse training data. Add annotations for:
+- Different pattern types
+- Various market conditions (uptrend, downtrend, sideways)
+- Different timeframes
+
+Then retrain:
+
+```bash
+docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
+docker-compose restart ml-service
+```
+
+## View Training Experiments
+
+Open MLflow UI at http://localhost:5000 to:
+- View all training runs
+- Compare model metrics
+- Download artifacts (confusion matrix, feature importance)
+- Register models
+
+## Next Steps
+
+### Improve Model Performance
+
+1. **Add More Annotations** - Annotate 100+ patterns for better accuracy
+2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment
+3. **Try XGBoost** - Change `model_type: "xgboost"` in config
+4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py`
+
+### Use Programmatic Labels
+
+Enable TA-Lib pattern detection to auto-generate labels:
+
+Edit `config/pipeline.yaml`:
+```yaml
+programmatic_labels:
+  enabled: true  # Set to true
+```
+
+This adds labels from TA-Lib CDL* functions alongside your human annotations.
+
+### Active Learning Loop
+
+1. Get predictions on new data
+2. Review disagreements (model missed patterns you saw, or vice versa)
+3. Correct predictions by adding new annotations
+4. Re-export annotations
+5. Retrain model
+6. Repeat
+
+The model improves iteratively through this feedback cycle.
+
+## Quick Reference
+
+```bash
+# Export annotations
+curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
+
+# Export candle data
+docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
+
+# Train model
+docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
+
+# Restart inference service
+docker-compose restart ml-service
+
+# Check model loaded
+curl http://localhost:8001/model/info
+
+# View MLflow UI
+open http://localhost:5000
+```