From 228f70daf361e73fc81f3cd6495d48f66d23e83d Mon Sep 17 00:00:00 2001 From: Marko Djordjevic Date: Sun, 15 Feb 2026 19:08:09 +0100 Subject: [PATCH] docs: add ML pipeline quickstart guide for training first model Complete step-by-step guide covering: - Creating training data through annotations - Exporting annotations and candle data - Running the full ML pipeline - Verifying model creation and loading - Getting predictions in the UI - Troubleshooting common issues - Iteration through active learning loop --- ML_QUICKSTART.md | 299 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 299 insertions(+) create mode 100644 ML_QUICKSTART.md diff --git a/ML_QUICKSTART.md b/ML_QUICKSTART.md new file mode 100644 index 0000000..e2a355a --- /dev/null +++ b/ML_QUICKSTART.md @@ -0,0 +1,299 @@ +# ML Pipeline Quickstart Guide + +This guide walks you through training your first model from scratch. + +## Prerequisites + +Make sure all services are running: + +```bash +docker-compose up -d +``` + +This starts: +- **candle-annotator** (Next.js app) - http://localhost:3000 +- **ml-service** (FastAPI) - http://localhost:8001 +- **mlflow** (MLflow UI) - http://localhost:5000 +- **postgres** (Database) + +## Step 1: Create Training Data (Annotate Patterns) + +You need at least 20-30 annotated patterns for initial training. + +### 1.1 Upload Candle Data + +1. Open http://localhost:3000 +2. Click "Choose CSV File" and upload your OHLCV data +3. Verify the chart displays correctly + +### 1.2 Annotate Patterns + +1. Click on "Manage Span Label Types" in the sidebar +2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer") +3. Return to main page +4. Select a label type from the span tools +5. Click and drag on the chart to create span annotations +6. Annotate at least 20-30 patterns (more is better) + +**Tip:** Aim for diverse patterns across different market conditions. + +## Step 2: Export Annotations + +### 2.1 Export via API + +```bash +# Export span annotations in ML pipeline format +curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json +``` + +### 2.2 Verify Export + +```bash +# Check the exported file +cat services/ml/data/annotations/export.json | jq '.' +``` + +You should see JSON with your annotations: +```json +{ + "annotations": [ + { + "start_time": 1700000000, + "end_time": 1700003600, + "label": "Bullish Engulfing", + "confidence": 1.0, + "source": "human" + } + ] +} +``` + +## Step 3: Prepare Raw OHLCV Data + +Copy your candle data to the ML pipeline: + +```bash +# Export candles from the database +docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv +``` + +Or manually copy your CSV file: +```bash +cp your_data.csv services/ml/data/raw/OHLCV.csv +``` + +**Format required:** +```csv +time,open,high,low,close,volume +1700000000,1.0500,1.0520,1.0490,1.0510,1000 +1700000060,1.0510,1.0530,1.0505,1.0525,1200 +``` + +## Step 4: Run the ML Pipeline + +### 4.1 Enter ML Service Container + +```bash +docker-compose exec ml-service bash +``` + +### 4.2 Run Full Pipeline + +```bash +# Run all stages: feature engineering → annotation ingestion → training +python pipeline.py --config config/pipeline.yaml +``` + +This will: +1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.) +2. **Annotation Ingestion** - Convert annotations to labeled dataset +3. **Training** - Train RandomForest model and log to MLflow + +**Expected output:** +``` +[INFO] Starting pipeline... +[INFO] Stage: feature_engineering +[INFO] Computing TA-Lib indicators... +[INFO] Computed 42 features +[INFO] Stage: annotation_ingestion +[INFO] Loaded 25 human annotations +[INFO] Created 25 training samples +[INFO] Stage: training +[INFO] Training RandomForest with 200 estimators +[INFO] Training complete. F1 macro: 0.78 +[INFO] Model saved to models/best_model.pkl +``` + +### 4.3 Verify Model Created + +```bash +ls -lh models/best_model.pkl +``` + +## Step 5: Configure Inference Service + +### 5.1 Check Model Path + +The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`). + +### 5.2 Restart Inference Service + +Exit the container and restart the ml-service: + +```bash +exit +docker-compose restart ml-service +``` + +### 5.3 Verify Model Loaded + +```bash +# Check health +curl http://localhost:8001/health + +# Get model info +curl http://localhost:8001/model/info | jq '.' +``` + +You should see: +```json +{ + "model_info": { + "model_name": "candlestick_pattern_v1", + "model_version": "...", + "model_type": "RandomForest", + "trained_at": "...", + "feature_count": 42, + "label_names": ["Bullish Engulfing", "Doji", "Hammer"] + }, + "metrics": { + "accuracy": 0.85, + "f1_macro": 0.78, + "per_class": {...} + } +} +``` + +## Step 6: Get Predictions in UI + +1. Open http://localhost:3000 +2. Scroll down in the left sidebar to the **Predictions** section +3. Toggle "Show" to enable predictions +4. Click "Run on Visible" to predict patterns on visible candles +5. Predictions appear as colored histogram overlays on the chart + +### Prediction Controls + +- **Confidence Threshold** - Slider to filter low-confidence predictions +- **Filter by Label** - Checkboxes to show/hide specific patterns +- **Prediction Summary** - Shows agreement/disagreement with human annotations + +## Troubleshooting + +### "No module named 'talib'" + +TA-Lib C library not installed in container. Rebuild: + +```bash +docker-compose build --no-cache ml-service +docker-compose up -d ml-service +``` + +### "Model file not found" + +The pipeline didn't create the model. Check: + +```bash +docker-compose logs ml-service +``` + +Make sure you have enough annotations (min 10-20). + +### "Not enough data for training" + +You need more annotated spans. Go back to Step 1 and add more annotations. + +### "MLflow connection refused" + +MLflow service not running: + +```bash +docker-compose up -d mlflow +docker-compose restart ml-service +``` + +### Predictions are all wrong + +The model needs more diverse training data. Add annotations for: +- Different pattern types +- Various market conditions (uptrend, downtrend, sideways) +- Different timeframes + +Then retrain: + +```bash +docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml +docker-compose restart ml-service +``` + +## View Training Experiments + +Open MLflow UI at http://localhost:5000 to: +- View all training runs +- Compare model metrics +- Download artifacts (confusion matrix, feature importance) +- Register models + +## Next Steps + +### Improve Model Performance + +1. **Add More Annotations** - Annotate 100+ patterns for better accuracy +2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment +3. **Try XGBoost** - Change `model_type: "xgboost"` in config +4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py` + +### Use Programmatic Labels + +Enable TA-Lib pattern detection to auto-generate labels: + +Edit `config/pipeline.yaml`: +```yaml +programmatic_labels: + enabled: true # Set to true +``` + +This adds labels from TA-Lib CDL* functions alongside your human annotations. + +### Active Learning Loop + +1. Get predictions on new data +2. Review disagreements (model missed patterns you saw, or vice versa) +3. Correct predictions by adding new annotations +4. Re-export annotations +5. Retrain model +6. Repeat + +The model improves iteratively through this feedback cycle. + +## Quick Reference + +```bash +# Export annotations +curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json + +# Export candle data +docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv + +# Train model +docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml + +# Restart inference service +docker-compose restart ml-service + +# Check model loaded +curl http://localhost:8001/model/info + +# View MLflow UI +open http://localhost:5000 +```