docs: add ML pipeline quickstart guide for training first model
Complete step-by-step guide covering: - Creating training data through annotations - Exporting annotations and candle data - Running the full ML pipeline - Verifying model creation and loading - Getting predictions in the UI - Troubleshooting common issues - Iteration through active learning loop
This commit is contained in:
parent
21f184aa8d
commit
228f70daf3
1 changed files with 299 additions and 0 deletions
299
ML_QUICKSTART.md
Normal file
299
ML_QUICKSTART.md
Normal file
|
|
@ -0,0 +1,299 @@
|
|||
# ML Pipeline Quickstart Guide
|
||||
|
||||
This guide walks you through training your first model from scratch.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Make sure all services are running:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
This starts:
|
||||
- **candle-annotator** (Next.js app) - http://localhost:3000
|
||||
- **ml-service** (FastAPI) - http://localhost:8001
|
||||
- **mlflow** (MLflow UI) - http://localhost:5000
|
||||
- **postgres** (Database)
|
||||
|
||||
## Step 1: Create Training Data (Annotate Patterns)
|
||||
|
||||
You need at least 20-30 annotated patterns for initial training.
|
||||
|
||||
### 1.1 Upload Candle Data
|
||||
|
||||
1. Open http://localhost:3000
|
||||
2. Click "Choose CSV File" and upload your OHLCV data
|
||||
3. Verify the chart displays correctly
|
||||
|
||||
### 1.2 Annotate Patterns
|
||||
|
||||
1. Click on "Manage Span Label Types" in the sidebar
|
||||
2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
|
||||
3. Return to main page
|
||||
4. Select a label type from the span tools
|
||||
5. Click and drag on the chart to create span annotations
|
||||
6. Annotate at least 20-30 patterns (more is better)
|
||||
|
||||
**Tip:** Aim for diverse patterns across different market conditions.
|
||||
|
||||
## Step 2: Export Annotations
|
||||
|
||||
### 2.1 Export via API
|
||||
|
||||
```bash
|
||||
# Export span annotations in ML pipeline format
|
||||
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
|
||||
```
|
||||
|
||||
### 2.2 Verify Export
|
||||
|
||||
```bash
|
||||
# Check the exported file
|
||||
cat services/ml/data/annotations/export.json | jq '.'
|
||||
```
|
||||
|
||||
You should see JSON with your annotations:
|
||||
```json
|
||||
{
|
||||
"annotations": [
|
||||
{
|
||||
"start_time": 1700000000,
|
||||
"end_time": 1700003600,
|
||||
"label": "Bullish Engulfing",
|
||||
"confidence": 1.0,
|
||||
"source": "human"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Step 3: Prepare Raw OHLCV Data
|
||||
|
||||
Copy your candle data to the ML pipeline:
|
||||
|
||||
```bash
|
||||
# Export candles from the database
|
||||
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
|
||||
```
|
||||
|
||||
Or manually copy your CSV file:
|
||||
```bash
|
||||
cp your_data.csv services/ml/data/raw/OHLCV.csv
|
||||
```
|
||||
|
||||
**Format required:**
|
||||
```csv
|
||||
time,open,high,low,close,volume
|
||||
1700000000,1.0500,1.0520,1.0490,1.0510,1000
|
||||
1700000060,1.0510,1.0530,1.0505,1.0525,1200
|
||||
```
|
||||
|
||||
## Step 4: Run the ML Pipeline
|
||||
|
||||
### 4.1 Enter ML Service Container
|
||||
|
||||
```bash
|
||||
docker-compose exec ml-service bash
|
||||
```
|
||||
|
||||
### 4.2 Run Full Pipeline
|
||||
|
||||
```bash
|
||||
# Run all stages: feature engineering → annotation ingestion → training
|
||||
python pipeline.py --config config/pipeline.yaml
|
||||
```
|
||||
|
||||
This will:
|
||||
1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.)
|
||||
2. **Annotation Ingestion** - Convert annotations to labeled dataset
|
||||
3. **Training** - Train RandomForest model and log to MLflow
|
||||
|
||||
**Expected output:**
|
||||
```
|
||||
[INFO] Starting pipeline...
|
||||
[INFO] Stage: feature_engineering
|
||||
[INFO] Computing TA-Lib indicators...
|
||||
[INFO] Computed 42 features
|
||||
[INFO] Stage: annotation_ingestion
|
||||
[INFO] Loaded 25 human annotations
|
||||
[INFO] Created 25 training samples
|
||||
[INFO] Stage: training
|
||||
[INFO] Training RandomForest with 200 estimators
|
||||
[INFO] Training complete. F1 macro: 0.78
|
||||
[INFO] Model saved to models/best_model.pkl
|
||||
```
|
||||
|
||||
### 4.3 Verify Model Created
|
||||
|
||||
```bash
|
||||
ls -lh models/best_model.pkl
|
||||
```
|
||||
|
||||
## Step 5: Configure Inference Service
|
||||
|
||||
### 5.1 Check Model Path
|
||||
|
||||
The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`).
|
||||
|
||||
### 5.2 Restart Inference Service
|
||||
|
||||
Exit the container and restart the ml-service:
|
||||
|
||||
```bash
|
||||
exit
|
||||
docker-compose restart ml-service
|
||||
```
|
||||
|
||||
### 5.3 Verify Model Loaded
|
||||
|
||||
```bash
|
||||
# Check health
|
||||
curl http://localhost:8001/health
|
||||
|
||||
# Get model info
|
||||
curl http://localhost:8001/model/info | jq '.'
|
||||
```
|
||||
|
||||
You should see:
|
||||
```json
|
||||
{
|
||||
"model_info": {
|
||||
"model_name": "candlestick_pattern_v1",
|
||||
"model_version": "...",
|
||||
"model_type": "RandomForest",
|
||||
"trained_at": "...",
|
||||
"feature_count": 42,
|
||||
"label_names": ["Bullish Engulfing", "Doji", "Hammer"]
|
||||
},
|
||||
"metrics": {
|
||||
"accuracy": 0.85,
|
||||
"f1_macro": 0.78,
|
||||
"per_class": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 6: Get Predictions in UI
|
||||
|
||||
1. Open http://localhost:3000
|
||||
2. Scroll down in the left sidebar to the **Predictions** section
|
||||
3. Toggle "Show" to enable predictions
|
||||
4. Click "Run on Visible" to predict patterns on visible candles
|
||||
5. Predictions appear as colored histogram overlays on the chart
|
||||
|
||||
### Prediction Controls
|
||||
|
||||
- **Confidence Threshold** - Slider to filter low-confidence predictions
|
||||
- **Filter by Label** - Checkboxes to show/hide specific patterns
|
||||
- **Prediction Summary** - Shows agreement/disagreement with human annotations
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No module named 'talib'"
|
||||
|
||||
TA-Lib C library not installed in container. Rebuild:
|
||||
|
||||
```bash
|
||||
docker-compose build --no-cache ml-service
|
||||
docker-compose up -d ml-service
|
||||
```
|
||||
|
||||
### "Model file not found"
|
||||
|
||||
The pipeline didn't create the model. Check:
|
||||
|
||||
```bash
|
||||
docker-compose logs ml-service
|
||||
```
|
||||
|
||||
Make sure you have enough annotations (min 10-20).
|
||||
|
||||
### "Not enough data for training"
|
||||
|
||||
You need more annotated spans. Go back to Step 1 and add more annotations.
|
||||
|
||||
### "MLflow connection refused"
|
||||
|
||||
MLflow service not running:
|
||||
|
||||
```bash
|
||||
docker-compose up -d mlflow
|
||||
docker-compose restart ml-service
|
||||
```
|
||||
|
||||
### Predictions are all wrong
|
||||
|
||||
The model needs more diverse training data. Add annotations for:
|
||||
- Different pattern types
|
||||
- Various market conditions (uptrend, downtrend, sideways)
|
||||
- Different timeframes
|
||||
|
||||
Then retrain:
|
||||
|
||||
```bash
|
||||
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
|
||||
docker-compose restart ml-service
|
||||
```
|
||||
|
||||
## View Training Experiments
|
||||
|
||||
Open MLflow UI at http://localhost:5000 to:
|
||||
- View all training runs
|
||||
- Compare model metrics
|
||||
- Download artifacts (confusion matrix, feature importance)
|
||||
- Register models
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Improve Model Performance
|
||||
|
||||
1. **Add More Annotations** - Annotate 100+ patterns for better accuracy
|
||||
2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment
|
||||
3. **Try XGBoost** - Change `model_type: "xgboost"` in config
|
||||
4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py`
|
||||
|
||||
### Use Programmatic Labels
|
||||
|
||||
Enable TA-Lib pattern detection to auto-generate labels:
|
||||
|
||||
Edit `config/pipeline.yaml`:
|
||||
```yaml
|
||||
programmatic_labels:
|
||||
enabled: true # Set to true
|
||||
```
|
||||
|
||||
This adds labels from TA-Lib CDL* functions alongside your human annotations.
|
||||
|
||||
### Active Learning Loop
|
||||
|
||||
1. Get predictions on new data
|
||||
2. Review disagreements (model missed patterns you saw, or vice versa)
|
||||
3. Correct predictions by adding new annotations
|
||||
4. Re-export annotations
|
||||
5. Retrain model
|
||||
6. Repeat
|
||||
|
||||
The model improves iteratively through this feedback cycle.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Export annotations
|
||||
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
|
||||
|
||||
# Export candle data
|
||||
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
|
||||
|
||||
# Train model
|
||||
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
|
||||
|
||||
# Restart inference service
|
||||
docker-compose restart ml-service
|
||||
|
||||
# Check model loaded
|
||||
curl http://localhost:8001/model/info
|
||||
|
||||
# View MLflow UI
|
||||
open http://localhost:5000
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue