# ML Pipeline Quickstart Guide This guide walks you through training your first model from scratch. ## Prerequisites Make sure all services are running: ```bash docker-compose up -d ``` This starts: - **candle-annotator** (Next.js app) - http://localhost:3000 - **ml-service** (FastAPI) - http://localhost:8001 - **mlflow** (MLflow UI) - http://localhost:5000 - **postgres** (Database) ## Step 1: Create Training Data (Annotate Patterns) You need at least 20-30 annotated patterns for initial training. ### 1.1 Upload Candle Data 1. Open http://localhost:3000 2. Click "Choose CSV File" and upload your OHLCV data 3. Verify the chart displays correctly ### 1.2 Annotate Patterns 1. Click on "Manage Span Label Types" in the sidebar 2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer") 3. Return to main page 4. Select a label type from the span tools 5. Click and drag on the chart to create span annotations 6. Annotate at least 20-30 patterns (more is better) **Tip:** Aim for diverse patterns across different market conditions. ## Step 2: Export Annotations ### 2.1 Export via API ```bash # Export span annotations in ML pipeline format curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json ``` ### 2.2 Verify Export ```bash # Check the exported file cat services/ml/data/annotations/export.json | jq '.' ``` You should see JSON with your annotations: ```json { "annotations": [ { "start_time": 1700000000, "end_time": 1700003600, "label": "Bullish Engulfing", "confidence": 1.0, "source": "human" } ] } ``` ## Step 3: Prepare Raw OHLCV Data Copy your candle data to the ML pipeline: ```bash # Export candles from the database docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv ``` Or manually copy your CSV file: ```bash cp your_data.csv services/ml/data/raw/OHLCV.csv ``` **Format required:** ```csv time,open,high,low,close,volume 1700000000,1.0500,1.0520,1.0490,1.0510,1000 1700000060,1.0510,1.0530,1.0505,1.0525,1200 ``` ## Step 4: Run the ML Pipeline ### 4.1 Enter ML Service Container ```bash docker-compose exec ml-service bash ``` ### 4.2 Run Full Pipeline ```bash # Run all stages: feature engineering → annotation ingestion → training python pipeline.py --config config/pipeline.yaml ``` This will: 1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.) 2. **Annotation Ingestion** - Convert annotations to labeled dataset 3. **Training** - Train RandomForest model and log to MLflow **Expected output:** ``` [INFO] Starting pipeline... [INFO] Stage: feature_engineering [INFO] Computing TA-Lib indicators... [INFO] Computed 42 features [INFO] Stage: annotation_ingestion [INFO] Loaded 25 human annotations [INFO] Created 25 training samples [INFO] Stage: training [INFO] Training RandomForest with 200 estimators [INFO] Training complete. F1 macro: 0.78 [INFO] Model saved to models/best_model.pkl ``` ### 4.3 Verify Model Created ```bash ls -lh models/best_model.pkl ``` ## Step 5: Configure Inference Service ### 5.1 Check Model Path The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`). ### 5.2 Restart Inference Service Exit the container and restart the ml-service: ```bash exit docker-compose restart ml-service ``` ### 5.3 Verify Model Loaded ```bash # Check health curl http://localhost:8001/health # Get model info curl http://localhost:8001/model/info | jq '.' ``` You should see: ```json { "model_info": { "model_name": "candlestick_pattern_v1", "model_version": "...", "model_type": "RandomForest", "trained_at": "...", "feature_count": 42, "label_names": ["Bullish Engulfing", "Doji", "Hammer"] }, "metrics": { "accuracy": 0.85, "f1_macro": 0.78, "per_class": {...} } } ``` ## Step 6: Get Predictions in UI 1. Open http://localhost:3000 2. Scroll down in the left sidebar to the **Predictions** section 3. Toggle "Show" to enable predictions 4. Click "Run on Visible" to predict patterns on visible candles 5. Predictions appear as colored histogram overlays on the chart ### Prediction Controls - **Confidence Threshold** - Slider to filter low-confidence predictions - **Filter by Label** - Checkboxes to show/hide specific patterns - **Prediction Summary** - Shows agreement/disagreement with human annotations ## Troubleshooting ### "No module named 'talib'" TA-Lib C library not installed in container. Rebuild: ```bash docker-compose build --no-cache ml-service docker-compose up -d ml-service ``` ### "Model file not found" The pipeline didn't create the model. Check: ```bash docker-compose logs ml-service ``` Make sure you have enough annotations (min 10-20). ### "Not enough data for training" You need more annotated spans. Go back to Step 1 and add more annotations. ### "MLflow connection refused" MLflow service not running: ```bash docker-compose up -d mlflow docker-compose restart ml-service ``` ### Predictions are all wrong The model needs more diverse training data. Add annotations for: - Different pattern types - Various market conditions (uptrend, downtrend, sideways) - Different timeframes Then retrain: ```bash docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml docker-compose restart ml-service ``` ## View Training Experiments Open MLflow UI at http://localhost:5000 to: - View all training runs - Compare model metrics - Download artifacts (confusion matrix, feature importance) - Register models ## Next Steps ### Improve Model Performance 1. **Add More Annotations** - Annotate 100+ patterns for better accuracy 2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment 3. **Try XGBoost** - Change `model_type: "xgboost"` in config 4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py` ### Use Programmatic Labels Enable TA-Lib pattern detection to auto-generate labels: Edit `config/pipeline.yaml`: ```yaml programmatic_labels: enabled: true # Set to true ``` This adds labels from TA-Lib CDL* functions alongside your human annotations. ### Active Learning Loop 1. Get predictions on new data 2. Review disagreements (model missed patterns you saw, or vice versa) 3. Correct predictions by adding new annotations 4. Re-export annotations 5. Retrain model 6. Repeat The model improves iteratively through this feedback cycle. ## Quick Reference ```bash # Export annotations curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json # Export candle data docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv # Train model docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml # Restart inference service docker-compose restart ml-service # Check model loaded curl http://localhost:8001/model/info # View MLflow UI open http://localhost:5000 ```