# ML Pipeline Quickstart Guide

This guide walks you through training your first model from scratch.

## Prerequisites

Make sure all services are running:

```bash
docker-compose up -d
```

This starts:
- **candle-annotator** (Next.js app) - http://localhost:3000
- **ml-service** (FastAPI) - http://localhost:8001
- **mlflow** (MLflow UI) - http://localhost:5000
- **postgres** (Database)

## Step 1: Create Training Data (Annotate Patterns)

You need at least 20-30 annotated patterns for initial training.

### 1.1 Upload Candle Data

1. Open http://localhost:3000
2. Click "Choose CSV File" and upload your OHLCV data
3. Verify the chart displays correctly

### 1.2 Annotate Patterns

1. Click on "Manage Span Label Types" in the sidebar
2. Create pattern labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
3. Return to main page
4. Select a label type from the span tools
5. Click and drag on the chart to create span annotations
6. Annotate at least 20-30 patterns (more is better)

**Tip:** Aim for diverse patterns across different market conditions.

## Step 2: Export Annotations

### 2.1 Export via API

```bash
# Export span annotations in ML pipeline format
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json
```

### 2.2 Verify Export

```bash
# Check the exported file
cat services/ml/data/annotations/export.json | jq '.'
```

You should see JSON with your annotations:
```json
{
  "annotations": [
    {
      "start_time": 1700000000,
      "end_time": 1700003600,
      "label": "Bullish Engulfing",
      "confidence": 1.0,
      "source": "human"
    }
  ]
}
```

## Step 3: Prepare Raw OHLCV Data

Copy your candle data to the ML pipeline:

```bash
# Export candles from the database
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv
```

Or manually copy your CSV file:
```bash
cp your_data.csv services/ml/data/raw/OHLCV.csv
```

**Format required:**
```csv
time,open,high,low,close,volume
1700000000,1.0500,1.0520,1.0490,1.0510,1000
1700000060,1.0510,1.0530,1.0505,1.0525,1200
```

## Step 4: Run the ML Pipeline

### 4.1 Enter ML Service Container

```bash
docker-compose exec ml-service bash
```

### 4.2 Run Full Pipeline

```bash
# Run all stages: feature engineering → annotation ingestion → training
python pipeline.py --config config/pipeline.yaml
```

This will:
1. **Feature Engineering** - Compute TA-Lib indicators (RSI, MACD, etc.)
2. **Annotation Ingestion** - Convert annotations to labeled dataset
3. **Training** - Train RandomForest model and log to MLflow

**Expected output:**
```
[INFO] Starting pipeline...
[INFO] Stage: feature_engineering
[INFO] Computing TA-Lib indicators...
[INFO] Computed 42 features
[INFO] Stage: annotation_ingestion
[INFO] Loaded 25 human annotations
[INFO] Created 25 training samples
[INFO] Stage: training
[INFO] Training RandomForest with 200 estimators
[INFO] Training complete. F1 macro: 0.78
[INFO] Model saved to models/best_model.pkl
```

### 4.3 Verify Model Created

```bash
ls -lh models/best_model.pkl
```

## Step 5: Configure Inference Service

### 5.1 Check Model Path

The inference service looks for the model at `models/best_model.pkl` by default (configured in `config/pipeline.yaml`).

### 5.2 Restart Inference Service

Exit the container and restart the ml-service:

```bash
exit
docker-compose restart ml-service
```

### 5.3 Verify Model Loaded

```bash
# Check health
curl http://localhost:8001/health

# Get model info
curl http://localhost:8001/model/info | jq '.'
```

You should see:
```json
{
  "model_info": {
    "model_name": "candlestick_pattern_v1",
    "model_version": "...",
    "model_type": "RandomForest",
    "trained_at": "...",
    "feature_count": 42,
    "label_names": ["Bullish Engulfing", "Doji", "Hammer"]
  },
  "metrics": {
    "accuracy": 0.85,
    "f1_macro": 0.78,
    "per_class": {...}
  }
}
```

## Step 6: Get Predictions in UI

1. Open http://localhost:3000
2. Scroll down in the left sidebar to the **Predictions** section
3. Toggle "Show" to enable predictions
4. Click "Run on Visible" to predict patterns on visible candles
5. Predictions appear as colored histogram overlays on the chart

### Prediction Controls

- **Confidence Threshold** - Slider to filter low-confidence predictions
- **Filter by Label** - Checkboxes to show/hide specific patterns
- **Prediction Summary** - Shows agreement/disagreement with human annotations

## Troubleshooting

### "No module named 'talib'"

TA-Lib C library not installed in container. Rebuild:

```bash
docker-compose build --no-cache ml-service
docker-compose up -d ml-service
```

### "Model file not found"

The pipeline didn't create the model. Check:

```bash
docker-compose logs ml-service
```

Make sure you have enough annotations (min 10-20).

### "Not enough data for training"

You need more annotated spans. Go back to Step 1 and add more annotations.

### "MLflow connection refused"

MLflow service not running:

```bash
docker-compose up -d mlflow
docker-compose restart ml-service
```

### Predictions are all wrong

The model needs more diverse training data. Add annotations for:
- Different pattern types
- Various market conditions (uptrend, downtrend, sideways)
- Different timeframes

Then retrain:

```bash
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml
docker-compose restart ml-service
```

## View Training Experiments

Open MLflow UI at http://localhost:5000 to:
- View all training runs
- Compare model metrics
- Download artifacts (confusion matrix, feature importance)
- Register models

## Next Steps

### Improve Model Performance

1. **Add More Annotations** - Annotate 100+ patterns for better accuracy
2. **Tune Hyperparameters** - Edit `config/pipeline.yaml` and experiment
3. **Try XGBoost** - Change `model_type: "xgboost"` in config
4. **Add Custom Features** - Write custom feature functions in `features/custom_loader.py`

### Use Programmatic Labels

Enable TA-Lib pattern detection to auto-generate labels:

Edit `config/pipeline.yaml`:
```yaml
programmatic_labels:
  enabled: true  # Set to true
```

This adds labels from TA-Lib CDL* functions alongside your human annotations.

### Active Learning Loop

1. Get predictions on new data
2. Review disagreements (model missed patterns you saw, or vice versa)
3. Correct predictions by adding new annotations
4. Re-export annotations
5. Retrain model
6. Repeat

The model improves iteratively through this feedback cycle.

## Quick Reference

```bash
# Export annotations
curl http://localhost:3000/api/span-annotations/export > services/ml/data/annotations/export.json

# Export candle data
docker-compose exec candle-annotator sh -c "sqlite3 /app/data/candles.db -csv -header 'SELECT time, open, high, low, close, volume FROM candles ORDER BY time;'" > services/ml/data/raw/OHLCV.csv

# Train model
docker-compose exec ml-service python pipeline.py --config config/pipeline.yaml

# Restart inference service
docker-compose restart ml-service

# Check model loaded
curl http://localhost:8001/model/info

# View MLflow UI
open http://localhost:5000
```