archive: candle-backend change complete

2026-02-16 11:44:53 +01:00 · 2026-02-16 11:44:53 +01:00 · 28e3f83cf7
commit 28e3f83cf7
parent 7e0579f65d
11 changed files with 836 additions and 0 deletions
--- a/openspec/changes/archive/2026-02-16-candle-backend/.openspec.yaml
+++ b/openspec/changes/archive/2026-02-16-candle-backend/.openspec.yaml
@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-02-15
--- a/openspec/changes/archive/2026-02-16-candle-backend/design.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/design.md
@ -0,0 +1,133 @@
+## Context
+
+The Candle Annotator is a Next.js app with SQLite storage that lets users annotate candlestick charts with pattern labels. It currently has no ML capabilities — annotations are created manually and exported as CSV/JSON, but there's no way to train models or get predictions back into the UI.
+
+The existing stack is: Next.js 16 (App Router), React 19, lightweight-charts v4, SQLite via Drizzle ORM, Docker deployment. The app runs as a single container on port 3000.
+
+We need to add a Python ML service that sits alongside the Next.js app, connected via HTTP. The Python ecosystem (scikit-learn, XGBoost, TA-Lib, MLflow) is the right tool for this job — there's no viable way to do this in Node.js.
+
+## Goals / Non-Goals
+
+**Goals:**
+
+- Stand up a Python FastAPI service at `services/ml/` that handles feature engineering, annotation ingestion, training, and inference
+- Use TA-Lib for programmatic candlestick pattern detection (CDL* functions)
+- Train tree-based models (RandomForest, XGBoost) with MLflow tracking
+- Serve predictions via REST API on port 8001
+- Proxy inference requests through Next.js API routes to avoid CORS
+- Render model predictions on the chart as a distinct visual layer
+- Version datasets with DVC
+
+**Non-Goals:**
+
+- Deep learning models (LSTM, GRU, transformer) — architecture should accommodate them later, but not implemented now
+- Multi-user or multi-tenant support
+- Real-time streaming predictions (batch/on-demand only)
+- Automated retraining pipelines or CI/CD for model deployment
+- GPU inference or training optimization
+
+## Decisions
+
+### 1. Separate Python service vs. embedded in Next.js
+
+**Decision**: Standalone Python FastAPI service in `services/ml/`, communicating via HTTP.
+
+**Alternatives considered**:
+- Python subprocess spawned by Next.js — fragile process management, no independent scaling
+- Python WASM in browser — TA-Lib and scikit-learn don't work in WASM
+- Shared SQLite access from Python — SQLite doesn't handle concurrent writers well
+
+**Rationale**: Clean separation of concerns. The Next.js app owns the UI and annotation data; the Python service owns ML. They communicate through well-defined REST APIs. Each can be developed, tested, and deployed independently.
+
+### 2. Directory structure: `services/ml/` in the monorepo
+
+**Decision**: Place the Python service at `services/ml/` within the existing repo.
+
+**Alternatives considered**:
+- Separate repository — adds overhead for a single-developer project
+- Top-level `ml/` directory — `services/` namespace leaves room for future services
+
+**Rationale**: Monorepo keeps everything together. The `services/` prefix signals it's a separate deployable unit, not part of the Next.js app.
+
+### 3. Pipeline config via YAML
+
+**Decision**: Single `config/pipeline.yaml` controls all pipeline stages (feature engineering, annotation ingestion, training, inference). Each stage has an `enabled` flag.
+
+**Rationale**: Makes experiments reproducible — the full config is logged as an MLflow artifact with each training run. Stages can be toggled independently (e.g., skip feature engineering, use only programmatic labels).
+
+### 4. MLflow for experiment tracking, DVC for data versioning
+
+**Decision**: MLflow tracks experiments, metrics, models. DVC versions datasets.
+
+**Alternatives considered**:
+- Weights & Biases — heavier, cloud-dependent
+- Plain file logging — loses queryability and model registry
+- Git LFS for data — doesn't handle dataset lineage
+
+**Rationale**: MLflow runs locally (no cloud dependency), provides a model registry, and has native integrations with scikit-learn and XGBoost. DVC handles data versioning without bloating the git repo.
+
+### 5. Annotation export format: JSON from existing API
+
+**Decision**: The Python pipeline reads annotation data by calling the existing Next.js API endpoints (`GET /api/annotations`, span annotation exports) or from exported JSON/CSV files in `data/annotations/`.
+
+**Alternatives considered**:
+- Direct SQLite read from Python — concurrent access issues
+- Shared PostgreSQL — overkill for single-user tool
+
+**Rationale**: Using the existing API or file exports keeps the services decoupled. The annotation tool already has export functionality. For training, batch export to `data/annotations/` is sufficient.
+
+### 6. Label encoding: windowed classification first, BIO later
+
+**Decision**: Start with fixed-window classification (each annotation span → one training sample of N candles). BIO sequence labeling is designed for but not implemented in v1.
+
+**Rationale**: Window classification works with tree-based models (RandomForest, XGBoost) which are the initial model types. BIO encoding is needed for sequence models (BiLSTM-CRF) which are a non-goal for now.
+
+### 7. Next.js proxy routes for inference
+
+**Decision**: Next.js API routes at `/api/predict`, `/api/predict/batch`, `/api/model/info` proxy to the Python service.
+
+**Rationale**: Avoids CORS configuration. Lets us add auth or rate-limiting on the Next.js side later. The frontend only talks to one origin.
+
+### 8. Prediction rendering: histogram series overlay
+
+**Decision**: Use a lightweight-charts histogram series to render predictions as colored bars behind candles. Each bar's color maps to a predicted pattern label.
+
+**Alternatives considered**:
+- Custom canvas plugin — more control but significantly more code
+- Series markers only — no area highlighting, just point markers
+
+**Rationale**: Histogram series is the simplest approach that gives visual area coverage. Can upgrade to a canvas plugin later for hatched/dashed styling. Markers are added for label text with confidence scores.
+
+### 9. Docker: multi-container with docker-compose
+
+**Decision**: Add an `ml-service` container to the existing docker-compose. Add an `mlflow` container for the tracking server. Shared volume for `data/`.
+
+```
+services:
+  candle-annotator:  # existing
+  ml-service:        # new - FastAPI on 8001
+  mlflow:            # new - tracking server on 5000
+  postgres:          # new - PostgreSQL for ML service state
+```
+
+**Rationale**: Each service has its own Dockerfile and dependencies. The shared `data/` volume lets both services access OHLCV and annotation files.
+
+## Risks / Trade-offs
+
+**[TA-Lib C library dependency]** → TA-Lib requires installing a system-level C library before the Python wrapper works. Mitigated by pinning it in the Dockerfile (`apt-get install libta-lib-dev`) and providing clear setup instructions for local development.
+
+**[MLflow storage growth]** → MLflow artifacts (models, plots, configs) accumulate over time. Mitigated by using a local `mlruns/` directory with periodic manual cleanup. Not a concern at single-user scale.
+
+**[Preprocessing parity]** → Feature engineering during inference must exactly match training. If the pipeline config changes between training and inference, predictions are invalid. Mitigated by logging the full pipeline config as an MLflow artifact and loading it during inference to replicate preprocessing.
+
+**[Class imbalance]** → Pattern classes will be heavily imbalanced (mostly "no pattern"). Mitigated by using `class_weights: balanced` and tracking per-class precision/recall, not just accuracy.
+
+**[SQLite concurrent access]** → If both the Next.js app and Python service try to access the SQLite DB simultaneously, writes can fail. Mitigated by keeping Python read-only on annotation data (via API calls or file exports), never writing to the Next.js SQLite DB directly.
+
+**[Temporal data leakage]** → Random train/test splits on time series data leak future information. Mitigated by enforcing temporal splits only (configurable but defaulting to temporal).
+
+## Resolved Questions
+
+- **Python service database**: PostgreSQL — the Python service uses its own Postgres instance for storing training run references, pipeline configs, and any service-specific state. Added to docker-compose.
+- **DVC remote storage**: Local backend — datasets versioned on the local filesystem, simplest setup for single-developer workflow.
+- **Prediction persistence**: Ephemeral — predictions are fetched on demand from the inference API, not persisted in any database. The frontend caches them in memory keyed by time range + model version.
--- a/openspec/changes/archive/2026-02-16-candle-backend/proposal.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/proposal.md
@ -0,0 +1,39 @@
+## Why
+
+The annotation tool currently creates labeled datasets but has no way to train models on them or get predictions back. Adding a Python ML backend closes the loop: annotations become training data, models produce predictions, and predictions guide further annotation — creating an active learning cycle for candlestick pattern recognition.
+
+## What Changes
+
+- Add a Python service (`services/ml/`) alongside the existing Next.js app, using FastAPI for the REST API
+- Implement TA-Lib-based candlestick pattern recognition to auto-generate annotations programmatically
+- Build a configurable ML training pipeline (feature engineering → annotation ingestion → training → evaluation) with MLflow tracking and DVC for data versioning
+- Support multiple model types: RandomForest and XGBoost initially, with architecture ready for LSTM/GRU and transformer-based models later
+- Serve trained models via a FastAPI inference API that accepts OHLCV candles and returns pattern predictions with confidence scores
+- Add Next.js API proxy routes (`/api/predict`, `/api/predict/batch`, `/api/model/info`) to connect the frontend to the Python backend
+- Add prediction visualization layer on the chart (distinct from human annotations) with confidence filtering and disagreement detection
+- Add a prediction controls panel for toggling predictions, filtering by label/confidence, and viewing per-class model metrics
+- Implement a feedback loop: users can confirm, correct, or dismiss model predictions as new annotations
+
+## Capabilities
+
+### New Capabilities
+
+- `feature-engineering`: TA-Lib indicator computation and candle feature extraction from raw OHLCV data, producing enriched datasets for training and inference
+- `annotation-ingestion`: Converting span annotations (human and programmatic) into labeled training datasets with BIO or windowed encoding, including TA-Lib CDL* pattern auto-labeling
+- `ml-training`: Configurable model training pipeline with temporal splits, class balancing, MLflow experiment tracking, artifact logging, and model registry integration
+- `ml-inference`: REST API serving trained models — accepts OHLCV candles, runs preprocessing, returns predictions with confidence scores and model metadata
+- `prediction-ui`: Frontend prediction layer with chart visualization, controls panel, confidence filtering, disagreement detection, and feedback loop for active learning
+
+### Modified Capabilities
+
+- `backend-api`: New proxy routes (`/api/predict`, `/api/predict/batch`, `/api/model/info`) added to forward requests to the Python inference service
+- `span-annotation`: Span export format consumed by the ML pipeline for training; prediction-confirmed spans can be saved as new annotations
+
+## Impact
+
+- **New dependencies**: Python 3.11+, FastAPI, uvicorn, scikit-learn, XGBoost, TA-Lib (C library + Python wrapper), MLflow, DVC, pandas, numpy, joblib
+- **New service**: Python FastAPI service running on port 8001, needs to be added to docker-compose
+- **Data flow**: Annotation JSON/CSV exports feed into Python pipeline; inference results flow back to the frontend via Next.js proxy routes
+- **Infrastructure**: MLflow tracking server (port 5000), DVC remote storage for dataset versioning
+- **Existing code changes**: New API routes in Next.js, new React components for prediction panel, chart overlay modifications for prediction rendering
+- **Config**: Pipeline YAML config (`config/pipeline.yaml`) controls all ML stages; env vars for inference API URL and feature flags
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/annotation-ingestion/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/annotation-ingestion/spec.md
@ -0,0 +1,85 @@
+## ADDED Requirements
+
+### Requirement: Load annotations from JSON export
+The system SHALL load annotation data from JSON files exported by the annotation tool, located at `data.annotations_path`. The expected format is a JSON object with an `annotations` array where each annotation has: `id`, `start_time`, `end_time`, `label`, `confidence` (nullable), `outcome` (nullable), and `sub_spans` (nullable).
+
+#### Scenario: Load valid annotations JSON
+- **WHEN** `data.annotations_path` points to a valid JSON file with annotations
+- **THEN** the system loads all annotation objects into memory for processing
+
+#### Scenario: Missing annotations file
+- **WHEN** `data.annotations_path` points to a file that does not exist and annotation ingestion is enabled
+- **THEN** the system SHALL fail with an error message identifying the missing file path
+
+#### Scenario: Filter by confidence
+- **WHEN** `stages.annotation_ingestion.min_confidence` is set to 3
+- **THEN** annotations with confidence below 3 SHALL be excluded from the labeled dataset
+
+### Requirement: Windowed classification encoding
+When `stages.annotation_ingestion.label_encoding` is "window", the system SHALL convert each annotation span into a fixed-size window of candles. The window size is defined by `stages.annotation_ingestion.window_size`. If the annotation span is shorter than window_size, the system SHALL pad with context candles (centered on the span). If the span is longer, the system SHALL use the full span. Each window becomes one row in the output with flattened OHLCV + feature columns.
+
+#### Scenario: Span shorter than window
+- **WHEN** an annotation spans 10 candles and window_size is 30
+- **THEN** the system extracts 30 candles centered on the annotation (10 before, 10 span, 10 after) and flattens them into a single row
+
+#### Scenario: Span longer than window
+- **WHEN** an annotation spans 50 candles and window_size is 30
+- **THEN** the system uses all 50 candles and flattens them into a single row
+
+#### Scenario: Span near dataset boundary
+- **WHEN** an annotation is near the start of the dataset and there aren't enough candles for padding
+- **THEN** the system SHALL pad with as many candles as available (no error), filling missing positions with NaN
+
+### Requirement: BIO sequence labeling encoding
+When `stages.annotation_ingestion.label_encoding` is "bio", the system SHALL assign a BIO tag to each candle in the dataset based on annotations. The first candle of an annotation span gets `B-{label}`, subsequent candles in the span get `I-{label}`, and candles outside any annotation get `O`.
+
+#### Scenario: Single annotation BIO tags
+- **WHEN** a "bull_flag" annotation spans candles at times T5 through T8
+- **THEN** candle T5 gets tag `B-bull_flag`, candles T6-T8 get `I-bull_flag`, all other candles get `O`
+
+#### Scenario: Overlapping annotations
+- **WHEN** two annotations overlap in time range
+- **THEN** the system SHALL create multiple tag columns (`bio_tag_1`, `bio_tag_2`) to represent both annotations
+
+### Requirement: Programmatic TA-Lib pattern labels
+When `stages.annotation_ingestion.programmatic_labels.enabled` is true, the system SHALL run TA-Lib CDL* pattern recognition functions listed in `talib_patterns` on the OHLC data. Each CDL function returns +100 (bullish), -100 (bearish), or 0 (no pattern). The system SHALL convert non-zero results to label names (e.g., `CDL_ENGULFING` with +100 → `bullish_engulfing`).
+
+#### Scenario: Detect engulfing pattern
+- **WHEN** `CDL_ENGULFING` is in the talib_patterns list and the OHLC data contains an engulfing pattern
+- **THEN** the system generates a label `bullish_engulfing` or `bearish_engulfing` for the corresponding candle
+
+#### Scenario: No pattern detected
+- **WHEN** a CDL function returns 0 for a candle
+- **THEN** no programmatic label is assigned to that candle
+
+### Requirement: Human and programmatic label merge
+When both human annotations and programmatic labels exist for the same candle, the system SHALL merge them using the strategy in `stages.annotation_ingestion.merge_strategy`: "human_priority" keeps the human label, "programmatic_priority" keeps the TA-Lib label, "both" keeps both as separate label columns.
+
+#### Scenario: Human priority merge
+- **WHEN** merge_strategy is "human_priority" and a candle has human label "bull_flag" and programmatic label "bullish_engulfing"
+- **THEN** the output label for that candle is "bull_flag"
+
+#### Scenario: Both labels merge
+- **WHEN** merge_strategy is "both" and a candle has both human and programmatic labels
+- **THEN** the output has two separate label columns: `label_human` and `label_programmatic`
+
+### Requirement: Context padding
+The system SHALL include `stages.annotation_ingestion.context_padding` candles before and after each annotation span in the labeled output. This provides trend context for models.
+
+#### Scenario: Add padding candles
+- **WHEN** context_padding is 20 and an annotation spans candles T10 to T15
+- **THEN** the output includes candles from T-10 (or dataset start) through T35 (or dataset end) associated with that annotation
+
+### Requirement: Dataset statistics logging
+After annotation ingestion completes, the system SHALL log: total annotations by label, class distribution percentages, average span length per label, and agreement rate between human and programmatic labels (when both are enabled).
+
+#### Scenario: Log class distribution
+- **WHEN** annotation ingestion completes with 50 "bull_flag", 30 "bear_flag", and 200 "O" labels
+- **THEN** the system logs the counts and percentages for each class
+
+### Requirement: Labeled CSV output
+The system SHALL write the labeled dataset to `data.labeled_path` in CSV format. The output SHALL contain all feature columns plus the target label column(s).
+
+#### Scenario: Write labeled CSV
+- **WHEN** annotation ingestion completes successfully
+- **THEN** the labeled CSV is written to `data.labeled_path` with all feature and label columns
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/backend-api/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/backend-api/spec.md
@ -0,0 +1,38 @@
+## ADDED Requirements
+
+### Requirement: Predict proxy endpoint
+The system SHALL provide a `POST /api/predict` Next.js API route that proxies requests to the Python inference service at `${INFERENCE_API_URL}/predict`. The route SHALL forward the request body (pair, timeframe, candles array) and return the Python service's response. If the inference service is unreachable, the route SHALL return HTTP 503 with `{ "error": "Inference service unavailable" }`.
+
+#### Scenario: Successful prediction proxy
+- **WHEN** POST /api/predict is called with valid candle data and the Python service is running
+- **THEN** the route forwards the request to the inference service and returns the prediction response with HTTP 200
+
+#### Scenario: Inference service down
+- **WHEN** POST /api/predict is called but the Python inference service is unreachable
+- **THEN** the route returns HTTP 503 with `{ "error": "Inference service unavailable" }`
+
+#### Scenario: Inference service error
+- **WHEN** the Python inference service returns an error status (4xx or 5xx)
+- **THEN** the route forwards the error status and message to the client
+
+### Requirement: Batch predict proxy endpoint
+The system SHALL provide a `POST /api/predict/batch` Next.js API route that proxies batch prediction requests to `${INFERENCE_API_URL}/predict/batch`. The route SHALL forward pair, timeframe, start_date, and end_date.
+
+#### Scenario: Successful batch prediction
+- **WHEN** POST /api/predict/batch is called with valid parameters
+- **THEN** the route forwards to the inference service and returns the batch prediction response
+
+#### Scenario: Timeout on large batch
+- **WHEN** the batch prediction takes longer than INFERENCE_BATCH_TIMEOUT
+- **THEN** the route returns HTTP 504 with `{ "error": "Batch prediction timed out" }`
+
+### Requirement: Model info proxy endpoint
+The system SHALL provide a `GET /api/model/info` Next.js API route that proxies to `${INFERENCE_API_URL}/model/info`. This endpoint returns model metadata and per-class metrics.
+
+#### Scenario: Successful model info
+- **WHEN** GET /api/model/info is called and the inference service is running
+- **THEN** the route returns the model metadata JSON
+
+#### Scenario: No model available
+- **WHEN** GET /api/model/info is called and the inference service returns 503
+- **THEN** the route returns HTTP 503 with `{ "error": "No model available" }`
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/feature-engineering/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/feature-engineering/spec.md
@ -0,0 +1,60 @@
+## ADDED Requirements
+
+### Requirement: TA-Lib indicator computation
+The system SHALL compute technical indicators from raw OHLCV data using TA-Lib. The pipeline config's `stages.feature_engineering.talib_indicators` list defines which indicators to compute. Each indicator entry specifies a `name` (TA-Lib function name) and `params` (dictionary of function parameters). Computed indicators SHALL be appended as new columns to the output CSV using lowercase naming: `{indicator}_{param}` (e.g., `rsi_14`, `ema_20`, `macd`, `macd_signal`, `macd_hist`, `bbands_upper`, `bbands_middle`, `bbands_lower`).
+
+#### Scenario: Compute RSI indicator
+- **WHEN** the config includes `{ name: "RSI", params: { timeperiod: 14 } }` and feature engineering is enabled
+- **THEN** the system computes RSI with period 14 and appends a `rsi_14` column to the enriched CSV
+
+#### Scenario: Compute multi-output indicator
+- **WHEN** the config includes `{ name: "MACD", params: { fastperiod: 12, slowperiod: 26, signalperiod: 9 } }`
+- **THEN** the system appends `macd`, `macd_signal`, and `macd_hist` columns to the enriched CSV
+
+#### Scenario: TA-Lib not installed
+- **WHEN** feature engineering is enabled but the TA-Lib C library is not installed on the system
+- **THEN** the system SHALL fail with a clear error message including installation instructions for the user's platform, and SHALL NOT silently skip the stage
+
+#### Scenario: Feature engineering disabled
+- **WHEN** `stages.feature_engineering.enabled` is false
+- **THEN** the system SHALL skip indicator computation entirely and pass raw OHLCV data to the next stage
+
+### Requirement: Candle feature extraction
+When `stages.feature_engineering.candle_features` is true, the system SHALL compute derived candle features for each row: `body_size` (abs(close - open)), `body_direction` (1 if close >= open, else -1), `upper_wick` (high - max(open, close)), `lower_wick` (min(open, close) - low), `wick_ratio` (upper_wick / lower_wick), `body_to_range` (body_size / (high - low)), `gap` (open - previous close), and `range` (high - low).
+
+#### Scenario: Compute candle features
+- **WHEN** `candle_features` is true and feature engineering is enabled
+- **THEN** the system appends columns `body_size`, `body_direction`, `upper_wick`, `lower_wick`, `wick_ratio`, `body_to_range`, `gap`, `range` to the enriched CSV
+
+#### Scenario: Division by zero handling
+- **WHEN** a candle has `lower_wick` equal to 0 (for `wick_ratio`) or `high` equal to `low` (for `body_to_range`)
+- **THEN** the system SHALL set the result to 0.0 instead of raising an error
+
+#### Scenario: Gap for first candle
+- **WHEN** computing `gap` for the first candle in the dataset (no previous close)
+- **THEN** the system SHALL set gap to 0.0
+
+### Requirement: Custom feature functions
+When `stages.feature_engineering.custom_features` is configured, the system SHALL dynamically import each listed Python module path and call it as a function. Each custom feature function SHALL accept a pandas DataFrame (the full OHLCV + computed features so far) and return a pandas Series. The returned Series SHALL be appended as a new column named after the function.
+
+#### Scenario: Load custom feature
+- **WHEN** the config includes `custom_features: ["features.custom.trend_slope"]`
+- **THEN** the system imports `features.custom.trend_slope`, calls it with the DataFrame, and appends the result as a `trend_slope` column
+
+#### Scenario: Custom feature import error
+- **WHEN** a custom feature module path cannot be imported
+- **THEN** the system SHALL fail with an error message naming the unresolvable module path
+
+### Requirement: NaN handling for warmup periods
+After computing all indicators, the system SHALL handle NaN values introduced by indicator warmup periods. Rows with NaN values in indicator columns SHALL be dropped from the output. The system SHALL log how many rows were dropped.
+
+#### Scenario: Drop warmup rows
+- **WHEN** RSI with period 14 produces NaN for the first 14 rows
+- **THEN** those rows are dropped from the enriched CSV and a log message reports "Dropped 14 rows due to indicator warmup"
+
+### Requirement: Enriched CSV output
+The system SHALL write the enriched dataset (original OHLCV columns + all computed feature columns) to the path specified by `data.enriched_path` in CSV format. The output SHALL preserve the original column order with new feature columns appended.
+
+#### Scenario: Write enriched CSV
+- **WHEN** feature engineering completes successfully
+- **THEN** the enriched CSV is written to `data.enriched_path` with all original and computed columns
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/ml-inference/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/ml-inference/spec.md
@ -0,0 +1,107 @@
+## ADDED Requirements
+
+### Requirement: Model loading from MLflow registry
+When `stages.inference.model_source` is "mlflow", the system SHALL load the model from the MLflow model registry using the model name (`stages.inference.mlflow_model_name`) and stage (`stages.inference.mlflow_model_stage`).
+
+#### Scenario: Load production model
+- **WHEN** model_source is "mlflow", model name is "candlestick_pattern_v1", and stage is "Production"
+- **THEN** the system loads the model registered as "candlestick_pattern_v1" at the "Production" stage from MLflow
+
+#### Scenario: Model not found in registry
+- **WHEN** the specified model name or stage does not exist in the MLflow registry
+- **THEN** the system SHALL return a clear error indicating the model was not found
+
+### Requirement: Model loading from local file
+When `stages.inference.model_source` is "local", the system SHALL load the model from the file path specified by `stages.inference.local_model_path` using joblib.
+
+#### Scenario: Load local model
+- **WHEN** model_source is "local" and local_model_path is "models/best_model.pkl"
+- **THEN** the system loads the model from that file path
+
+#### Scenario: Local model file missing
+- **WHEN** the local_model_path does not exist
+- **THEN** the system SHALL return an error indicating the model file was not found
+
+### Requirement: Preprocessing parity
+The inference service SHALL replicate the exact preprocessing (feature engineering) used during training. The system SHALL load the pipeline config artifact from the MLflow run that produced the model and apply the same feature engineering steps (TA-Lib indicators, candle features) with the same parameters.
+
+#### Scenario: Matching preprocessing
+- **WHEN** the model was trained with RSI(14) and EMA(20) features
+- **THEN** inference SHALL compute RSI(14) and EMA(20) on the input candles before running the model
+
+#### Scenario: Config mismatch warning
+- **WHEN** the current pipeline config differs from the config stored with the model
+- **THEN** the system SHALL log a warning about the mismatch
+
+### Requirement: Predict endpoint
+The system SHALL provide a `POST /predict` endpoint on the FastAPI service (port 8001). The endpoint SHALL accept a JSON body with `pair` (string), `timeframe` (string), and `candles` (array of objects with `time`, `open`, `high`, `low`, `close`, `volume`). It SHALL return predictions with per-candle labels and confidence scores, prediction spans (grouped continuous predictions), and model metadata.
+
+#### Scenario: Successful prediction
+- **WHEN** POST /predict is called with 100 valid candle objects
+- **THEN** the system returns a JSON response with `predictions` array (one entry per candle with `time`, `label`, `confidence`), `spans` array (continuous same-label predictions grouped with `start_time`, `end_time`, `label`, `avg_confidence`), and `model_info` object
+
+#### Scenario: Empty candles array
+- **WHEN** POST /predict is called with an empty candles array
+- **THEN** the system returns HTTP 400 with an error message
+
+#### Scenario: Invalid candle data
+- **WHEN** POST /predict is called with candle objects missing required fields
+- **THEN** the system returns HTTP 422 with validation error details
+
+### Requirement: Batch predict endpoint
+The system SHALL provide a `POST /predict/batch` endpoint that accepts `pair`, `timeframe`, `start_date`, and `end_date`. The system SHALL load OHLCV data from its own data store for the specified range, process in chunks of `stages.inference.batch_size`, and return predictions for the full range.
+
+#### Scenario: Batch prediction
+- **WHEN** POST /predict/batch is called with pair "EURUSD", timeframe "1H", start_date and end_date spanning 6 months
+- **THEN** the system loads the data, processes in batches, and returns predictions for the full range
+
+#### Scenario: No data for range
+- **WHEN** the requested date range has no OHLCV data available
+- **THEN** the system returns HTTP 404 with a message indicating no data found for the range
+
+### Requirement: Model info endpoint
+The system SHALL provide a `GET /model/info` endpoint that returns metadata about the currently loaded model: model_name, model_version, model_type, trained_at, dataset_version, feature_engineering enabled status, list of all labels the model knows, and per-class metrics (precision, recall, F1, training sample count for each label).
+
+#### Scenario: Get model info
+- **WHEN** GET /model/info is called and a model is loaded
+- **THEN** the system returns JSON with model metadata and per-class metrics
+
+#### Scenario: No model loaded
+- **WHEN** GET /model/info is called and no model has been loaded
+- **THEN** the system returns HTTP 503 with a message indicating no model is available
+
+### Requirement: Model labels endpoint
+The system SHALL provide a `GET /model/labels` endpoint that returns the list of all pattern labels the current model can predict, along with their display colors.
+
+#### Scenario: Get model labels
+- **WHEN** GET /model/labels is called
+- **THEN** the system returns a JSON array of label objects with `name` and `color` fields
+
+### Requirement: Health check endpoint
+The system SHALL provide a `GET /health` endpoint that returns the service status including whether a model is loaded, the MLflow connection status, and the PostgreSQL connection status.
+
+#### Scenario: Healthy service
+- **WHEN** GET /health is called and all dependencies are available
+- **THEN** the system returns HTTP 200 with `{ "status": "healthy", "model_loaded": true, "mlflow": "connected", "database": "connected" }`
+
+#### Scenario: Degraded service
+- **WHEN** GET /health is called but the MLflow server is unreachable
+- **THEN** the system returns HTTP 200 with `{ "status": "degraded", "model_loaded": true, "mlflow": "disconnected", "database": "connected" }`
+
+### Requirement: Prediction confidence scores
+Each prediction SHALL include a confidence score between 0.0 and 1.0 derived from the model's probability output. For tree-based models, this is the max class probability from `predict_proba()`.
+
+#### Scenario: Confidence from predict_proba
+- **WHEN** the model predicts class "bull_flag" with probability 0.87
+- **THEN** the prediction confidence for that candle is 0.87
+
+### Requirement: Prediction span grouping
+The system SHALL group consecutive candle predictions with the same non-"O" label into prediction spans. Each span SHALL have `start_time`, `end_time`, `label`, and `avg_confidence` (mean confidence of candles in the span).
+
+#### Scenario: Group consecutive predictions
+- **WHEN** candles at T1, T2, T3 are all predicted as "bull_flag" with confidences 0.85, 0.90, 0.80
+- **THEN** the system creates one span: `{ start_time: T1, end_time: T3, label: "bull_flag", avg_confidence: 0.85 }`
+
+#### Scenario: Break on label change
+- **WHEN** candle T1 is "bull_flag" and candle T2 is "bear_flag"
+- **THEN** the system creates two separate spans
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/ml-training/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/ml-training/spec.md
@ -0,0 +1,92 @@
+## ADDED Requirements
+
+### Requirement: Temporal train/test splitting
+The system SHALL split the labeled dataset into train, validation, and test sets using temporal ordering. Data SHALL be sorted by time. The first portion is training, middle is validation, last is test. Split ratios are defined by `stages.training.test_split` and `stages.training.validation_split`. The system SHALL NOT shuffle financial time series data.
+
+#### Scenario: Temporal split
+- **WHEN** test_split is 0.2, validation_split is 0.1, and the dataset has 1000 rows sorted by time
+- **THEN** the first 700 rows are training, next 100 are validation, last 200 are test
+
+#### Scenario: Random split option
+- **WHEN** split_method is "random"
+- **THEN** the system uses standard random splitting (sklearn train_test_split) but logs a warning that this is not recommended for financial data
+
+### Requirement: Class weight balancing
+The system SHALL apply class weighting to handle imbalanced pattern labels. When `stages.training.class_weights` is "balanced", the system SHALL compute inverse-frequency weights so rare pattern classes receive higher training weight.
+
+#### Scenario: Balanced weights
+- **WHEN** class_weights is "balanced" and the dataset has 500 "O" labels and 50 "bull_flag" labels
+- **THEN** the model trains with class weights inversely proportional to class frequency
+
+### Requirement: Model training dispatch
+The system SHALL train the model type specified in `stages.training.model_type` using the hyperparameters in `stages.training.hyperparameters`. Supported model types for v1: "random_forest" (scikit-learn RandomForestClassifier) and "xgboost" (XGBClassifier).
+
+#### Scenario: Train XGBoost model
+- **WHEN** model_type is "xgboost" with hyperparameters n_estimators=500, max_depth=6, learning_rate=0.01
+- **THEN** the system trains an XGBClassifier with those parameters on the training set
+
+#### Scenario: Train RandomForest model
+- **WHEN** model_type is "random_forest"
+- **THEN** the system trains a RandomForestClassifier with the configured hyperparameters
+
+#### Scenario: Unsupported model type
+- **WHEN** model_type is a value not supported in v1 (e.g., "lstm", "transformer")
+- **THEN** the system SHALL fail with an error message listing the supported model types
+
+### Requirement: MLflow experiment tracking
+The system SHALL log all training runs to MLflow. Each run SHALL log: the full pipeline YAML config as an artifact, dataset version (DVC hash if available), total samples, number of classes, model type, window size, per-class sample counts, and all hyperparameters.
+
+#### Scenario: Log training run
+- **WHEN** a training run starts
+- **THEN** the system creates an MLflow run under the experiment name from `stages.training.mlflow.experiment_name` and logs all parameters
+
+#### Scenario: MLflow server unavailable
+- **WHEN** the MLflow tracking URI is unreachable
+- **THEN** the system SHALL fail with an error message indicating the MLflow server cannot be reached at the configured URI
+
+### Requirement: Training metrics logging
+After training, the system SHALL evaluate the model on the test set and log metrics to MLflow: overall accuracy, macro F1, weighted F1, and per-class precision, recall, and F1 for each label.
+
+#### Scenario: Log overall metrics
+- **WHEN** model evaluation completes
+- **THEN** the system logs accuracy, f1_macro, and f1_weighted to MLflow
+
+#### Scenario: Log per-class metrics
+- **WHEN** model evaluation completes with labels "bull_flag", "bear_flag", and "O"
+- **THEN** the system logs precision_bull_flag, recall_bull_flag, f1_bull_flag (and same for each other label) to MLflow
+
+### Requirement: Training artifact logging
+When `stages.training.mlflow.log_artifacts` is true, the system SHALL log to MLflow: a confusion matrix plot (PNG), a feature importance plot (PNG, for tree-based models), and a classification report (text).
+
+#### Scenario: Log confusion matrix
+- **WHEN** log_artifacts is true and training completes
+- **THEN** the system generates and logs a confusion matrix plot as "confusion_matrix.png" to MLflow
+
+#### Scenario: Log feature importance
+- **WHEN** log_artifacts is true and the model has `feature_importances_` attribute
+- **THEN** the system generates and logs a feature importance plot as "feature_importance.png" to MLflow
+
+### Requirement: Model registration
+When `stages.training.mlflow.register_model` is true, the system SHALL register the trained model in the MLflow model registry under the name specified by `stages.inference.mlflow_model_name`.
+
+#### Scenario: Register model
+- **WHEN** register_model is true and training completes
+- **THEN** the system registers the model in MLflow registry with the configured model name
+
+### Requirement: PostgreSQL training metadata storage
+The system SHALL store training run metadata in the PostgreSQL database. Each training run record SHALL include: run_id (MLflow run ID), model_type, experiment_name, pipeline_config_hash, dataset_version, metrics summary (JSON), status, and timestamps (created_at, completed_at).
+
+#### Scenario: Store training run record
+- **WHEN** a training run completes successfully
+- **THEN** the system inserts a record into the PostgreSQL `training_runs` table with the run metadata
+
+#### Scenario: Query training history
+- **WHEN** the system queries training runs
+- **THEN** it returns records from PostgreSQL ordered by created_at descending
+
+### Requirement: Pipeline config logging
+The system SHALL log the full pipeline YAML config as an MLflow artifact with each training run. This config SHALL be used during inference to replicate the exact preprocessing steps.
+
+#### Scenario: Config artifact logged
+- **WHEN** a training run starts
+- **THEN** the full pipeline.yaml content is logged as "pipeline_config.yaml" artifact in the MLflow run
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/prediction-ui/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/prediction-ui/spec.md
@ -0,0 +1,130 @@
+## ADDED Requirements
+
+### Requirement: Prediction state management
+The system SHALL maintain a separate prediction state alongside the existing annotation state. The prediction state SHALL include: spans (array of prediction spans), isLoading, error, modelInfo, visible (toggle), confidenceThreshold (filter), selectedLabels (filter), and autoPredict (toggle). Prediction state SHALL be independent from annotation state.
+
+#### Scenario: Initial prediction state
+- **WHEN** the app loads
+- **THEN** predictions are empty, visible is true, confidenceThreshold defaults to 0.70, autoPredict is false, and selectedLabels includes all labels
+
+### Requirement: On-demand prediction fetching
+The system SHALL fetch predictions on demand when the user clicks "Run on Visible". The system SHALL send the currently visible candles to `/api/predict` and update the prediction state with results. Predictions are ephemeral — not persisted, re-fetched on demand.
+
+#### Scenario: Run on visible candles
+- **WHEN** user clicks "Run on Visible" button
+- **THEN** the system sends the visible candle range to /api/predict, shows a loading state, and renders returned predictions on the chart
+
+#### Scenario: Batch predict all
+- **WHEN** user clicks "Predict All" button
+- **THEN** the system sends a batch request to /api/predict/batch for the full dataset and renders all returned predictions
+
+### Requirement: Prediction caching
+The system SHALL cache predictions in memory keyed by `${pair}_${timeframe}_${startTime}_${endTime}_${modelVersion}`. When the user scrolls to a range with cached predictions, the system SHALL use the cache instead of re-fetching. Cache SHALL be invalidated when the model version changes.
+
+#### Scenario: Cache hit
+- **WHEN** user scrolls back to a previously predicted range with the same model version
+- **THEN** the system renders cached predictions without making an API call
+
+#### Scenario: Cache invalidation on model change
+- **WHEN** the model version changes (detected via /api/model/info)
+- **THEN** all cached predictions are cleared
+
+### Requirement: Prediction rendering on chart
+The system SHALL render model predictions as a visual layer on the lightweight-charts instance, visually distinct from human annotations. Predictions SHALL use a histogram series with per-bar colors mapped to predicted pattern labels at reduced opacity (10-20%). Series markers SHALL be added at the start of each prediction span showing `{label} ({confidence}%)` positioned below bars.
+
+#### Scenario: Render prediction spans
+- **WHEN** predictions are loaded and visible is true
+- **THEN** colored histogram bars appear behind candles for predicted patterns, with markers showing labels and confidence
+
+#### Scenario: Predictions hidden
+- **WHEN** the user toggles predictions off (visible = false)
+- **THEN** the prediction histogram series and markers are removed from the chart
+
+#### Scenario: Visual distinction from annotations
+- **WHEN** both human annotations and model predictions exist for the same range
+- **THEN** human annotations render as solid colored rectangles (above bars) and predictions render as low-opacity histogram bars (below bars) — they are visually distinguishable
+
+### Requirement: Confidence threshold filter
+The system SHALL filter displayed predictions by confidence. Only predictions with confidence >= `confidenceThreshold` SHALL be rendered. The threshold is adjustable via a slider in the controls panel (range 0.0 to 1.0).
+
+#### Scenario: Filter low confidence
+- **WHEN** confidenceThreshold is 0.70 and a prediction has confidence 0.55
+- **THEN** that prediction is not rendered on the chart
+
+#### Scenario: Adjust threshold
+- **WHEN** user moves the confidence slider from 0.70 to 0.50
+- **THEN** previously hidden predictions with confidence between 0.50 and 0.70 become visible
+
+### Requirement: Label type filter
+The system SHALL allow users to toggle visibility of individual pattern labels via checkboxes in the controls panel. Only predictions for checked labels are rendered.
+
+#### Scenario: Hide specific label
+- **WHEN** user unchecks "double_bottom" in the label filter
+- **THEN** all "double_bottom" predictions are hidden from the chart
+
+### Requirement: Prediction controls panel
+The system SHALL display a prediction controls panel in the sidebar with: master on/off toggle, model info (name, version, type, training date), action buttons ("Run on Visible", "Predict All"), auto-predict toggle, confidence threshold slider, label checkboxes with per-class precision/recall metrics, prediction count, agreement count, and a "Show only disagreements" filter.
+
+#### Scenario: Display model info
+- **WHEN** the prediction panel loads and the inference API is available
+- **THEN** the panel fetches /api/model/info and displays model name, version, type, and training date
+
+#### Scenario: Inference API unavailable
+- **WHEN** the prediction panel loads and /api/model/info returns an error
+- **THEN** the panel shows "Model server offline — predictions unavailable" and all controls are disabled
+
+#### Scenario: Per-class metrics display
+- **WHEN** model info includes per-class metrics
+- **THEN** each label checkbox shows precision and recall values (e.g., "bull_flag (P:0.89 R:0.76)")
+
+### Requirement: Disagreement detection
+The system SHALL compare human annotation spans with model prediction spans to identify disagreements. For each human annotation, check if any prediction span overlaps (>50% time overlap). Disagreement types: "missed_by_model" (human annotated, model predicted "O"), "missed_by_human" (model predicted pattern, no human annotation), "label_mismatch" (both see a pattern but different labels).
+
+#### Scenario: Missed by model
+- **WHEN** a human annotation exists at T10-T20 but no prediction span overlaps it
+- **THEN** the system identifies this as "missed_by_model"
+
+#### Scenario: Missed by human
+- **WHEN** a prediction span exists at T30-T40 with no overlapping human annotation
+- **THEN** the system identifies this as "missed_by_human"
+
+#### Scenario: Label mismatch
+- **WHEN** a human annotation labels T10-T20 as "bull_flag" and the prediction labels the same range as "wedge_up"
+- **THEN** the system identifies this as "label_mismatch"
+
+### Requirement: Disagreement rendering
+The system SHALL render disagreements with distinct visual styles: "missed_by_model" shows a red dashed border around the human annotation, "missed_by_human" shows a yellow highlight around the prediction, "label_mismatch" shows an orange border with both labels displayed.
+
+#### Scenario: Render missed_by_human highlight
+- **WHEN** a "missed_by_human" disagreement is detected and disagreement rendering is enabled
+- **THEN** the prediction span is highlighted with a yellow border/glow to draw attention
+
+#### Scenario: Show only disagreements
+- **WHEN** user clicks "Show only disagreements" filter
+- **THEN** only prediction spans involved in disagreements are rendered, hiding agreement spans
+
+### Requirement: Prediction-to-annotation feedback
+When a user clicks on a "missed_by_human" prediction, the system SHALL open the span annotation dialog pre-filled with the prediction's start_time, end_time, and label. The user can confirm (save as new annotation), correct (change label, then save), or dismiss.
+
+#### Scenario: Confirm prediction as annotation
+- **WHEN** user clicks a "missed_by_human" prediction and clicks Save in the pre-filled dialog
+- **THEN** the system creates a new span annotation with the model's suggested label and timestamps
+
+#### Scenario: Correct and save
+- **WHEN** user clicks a "missed_by_human" prediction, changes the label in the dialog, and clicks Save
+- **THEN** the system creates a new span annotation with the corrected label
+
+#### Scenario: Dismiss as not-a-pattern
+- **WHEN** user clicks a "missed_by_human" prediction and clicks "Not a pattern"
+- **THEN** the system saves a negative annotation with label "O", source "human_correction", and records the model's original prediction and confidence
+
+### Requirement: Inference API connection monitoring
+The system SHALL poll `/api/model/info` every 30 seconds when the inference API is unavailable. When the API becomes available, the system SHALL auto-reconnect and enable prediction controls. Human annotation SHALL never be blocked by inference API availability.
+
+#### Scenario: Auto-reconnect
+- **WHEN** the inference API was unavailable and becomes reachable
+- **THEN** the prediction panel re-enables controls and shows "Model server online"
+
+#### Scenario: Annotation independence
+- **WHEN** the inference API is unavailable
+- **THEN** all human annotation tools continue to work normally
--- a/openspec/changes/archive/2026-02-16-candle-backend/specs/span-annotation/spec.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/specs/span-annotation/spec.md
@ -0,0 +1,34 @@
+## ADDED Requirements
+
+### Requirement: Span annotation JSON export for ML pipeline
+The system SHALL provide a `GET /api/span-annotations/export` endpoint that exports all span annotations for a given chart as JSON in the format expected by the ML pipeline. The output SHALL be a JSON object with an `annotations` array where each entry has: `id`, `start_time` (Unix timestamp), `end_time` (Unix timestamp), `label`, `confidence` (nullable), `outcome` (nullable), and `sub_spans` (nullable). The endpoint SHALL accept an optional `chartId` query parameter.
+
+#### Scenario: Export span annotations as JSON
+- **WHEN** GET /api/span-annotations/export?chartId=3 is called
+- **THEN** the system returns a JSON object with all span annotations for chart 3 in the ML pipeline format
+
+#### Scenario: Export without chartId
+- **WHEN** GET /api/span-annotations/export is called without chartId
+- **THEN** the system exports span annotations for the most recently created chart
+
+### Requirement: Prediction-sourced span annotation creation
+The system SHALL support creating span annotations with a `source` field indicating whether the annotation was created by a human ("human"), confirmed from a model prediction ("model_confirmed"), or corrected from a model prediction ("model_corrected"). The existing POST endpoint for span annotations SHALL accept an optional `source` field (default: "human") and optional `model_prediction` field (object with `label` and `confidence` from the original prediction).
+
+#### Scenario: Create human annotation
+- **WHEN** a span annotation is created without a source field
+- **THEN** the source defaults to "human"
+
+#### Scenario: Confirm model prediction
+- **WHEN** a user confirms a model prediction as an annotation
+- **THEN** the span annotation is created with source "model_confirmed" and model_prediction containing the original predicted label and confidence
+
+#### Scenario: Correct model prediction
+- **WHEN** a user changes the label of a model prediction before saving
+- **THEN** the span annotation is created with source "model_corrected" and model_prediction containing the original predicted label and confidence
+
+### Requirement: Negative annotation for dismissed predictions
+The system SHALL support saving negative annotations when a user dismisses a model prediction as "not a pattern". A negative annotation SHALL have label "O", source "human_correction", and a `model_prediction` field recording what the model originally predicted.
+
+#### Scenario: Save negative annotation
+- **WHEN** user dismisses a "bull_flag" prediction with confidence 0.72
+- **THEN** the system creates a span annotation with label "O", source "human_correction", and model_prediction `{ "label": "bull_flag", "confidence": 0.72 }`
--- a/openspec/changes/archive/2026-02-16-candle-backend/tasks.md
+++ b/openspec/changes/archive/2026-02-16-candle-backend/tasks.md
@ -0,0 +1,116 @@
+## 1. Project Scaffolding & Infrastructure
+
+- [x] 1.1 Create `services/ml/` directory structure: `config/`, `features/`, `features/custom/`, `training/`, `training/models/`, `inference/`, `data/raw/`, `data/enriched/`, `data/labeled/`, `data/annotations/`
+- [x] 1.2 Create `services/ml/pyproject.toml` (or `requirements.txt`) with dependencies: fastapi, uvicorn, scikit-learn, xgboost, pandas, numpy, joblib, mlflow, pyyaml, ta-lib, dvc, sqlalchemy, psycopg2-binary, pydantic
+- [x] 1.3 Create `services/ml/Dockerfile` with Python 3.11, TA-Lib C library installation (`libta-lib-dev`), and pip install of dependencies
+- [x] 1.4 Create `config/pipeline.yaml` with the full pipeline configuration (all stages, default hyperparameters, MLflow/DVC settings)
+- [x] 1.5 Add PostgreSQL, ml-service, and mlflow containers to `docker-compose.yml` with shared data volume
+- [x] 1.6 Initialize DVC in `services/ml/` with local remote storage backend
+- [x] 1.7 Create PostgreSQL database schema: `training_runs` table (run_id, model_type, experiment_name, pipeline_config_hash, dataset_version, metrics_summary JSON, status, created_at, completed_at)
+- [x] 1.8 Create `services/ml/app/db.py` — SQLAlchemy engine and session setup for PostgreSQL connection
+
+## 2. Pipeline Config & Entry Point
+
+- [x] 2.1 Create `services/ml/app/config.py` — Pydantic model for pipeline YAML config with validation (stages, data paths, hyperparameters)
+- [x] 2.2 Create `services/ml/pipeline.py` — main orchestrator that reads config and runs enabled stages in sequence
+- [x] 2.3 Add CLI argument parsing: `--config`, `--stage` (run individual stage), support for `python pipeline.py --config config/pipeline.yaml`
+
+## 3. Feature Engineering Stage
+
+- [x] 3.1 Create `services/ml/features/talib_features.py` — compute TA-Lib indicators from config list, append columns with `{indicator}_{param}` naming, fail with clear error if TA-Lib not installed
+- [x] 3.2 Create `services/ml/features/candle_features.py` — compute body_size, body_direction, upper_wick, lower_wick, wick_ratio, body_to_range, gap, range with division-by-zero handling
+- [x] 3.3 Create `services/ml/features/custom_loader.py` — dynamic import of custom feature functions from config paths, call with DataFrame, append result as column
+- [x] 3.4 Implement NaN warmup row handling — drop rows with NaN in indicator columns, log count of dropped rows
+- [x] 3.5 Wire feature engineering into `pipeline.py` — read raw OHLCV CSV, run enabled feature steps, write enriched CSV to `data.enriched_path`
+
+## 4. Annotation Ingestion Stage
+
+- [x] 4.1 Create `services/ml/app/annotation_ingestion.py` — load annotations JSON from `data.annotations_path`, filter by min_confidence
+- [x] 4.2 Implement windowed classification encoding — extract fixed-size windows centered on each annotation span, flatten into single rows, handle boundary padding
+- [x] 4.3 Implement BIO sequence labeling encoding — assign B-{label}/I-{label}/O tags per candle, handle overlapping annotations with multiple tag columns
+- [x] 4.4 Implement TA-Lib CDL* programmatic labeling — run configured CDL functions, convert +100/-100 to label names (bullish_/bearish_ prefix)
+- [x] 4.5 Implement human/programmatic label merge strategies — human_priority, programmatic_priority, both (separate columns)
+- [x] 4.6 Implement context padding — include N candles before/after each annotation span
+- [x] 4.7 Add dataset statistics logging — counts per label, class distribution %, avg span length, human/programmatic agreement rate
+- [x] 4.8 Wire annotation ingestion into `pipeline.py` — read enriched CSV + annotations JSON, run encoding, write labeled CSV to `data.labeled_path`
+
+## 5. Training Stage
+
+- [x] 5.1 Create `services/ml/training/train.py` — main training entry point: load labeled CSV, split, train, evaluate, log to MLflow
+- [x] 5.2 Implement temporal train/validation/test splitting with configurable ratios, warn on random split
+- [x] 5.3 Create `services/ml/training/models/random_forest.py` — RandomForestClassifier wrapper with class_weights support
+- [x] 5.4 Create `services/ml/training/models/xgboost_model.py` — XGBClassifier wrapper with class_weights support
+- [x] 5.5 Implement model dispatch — select model class based on `model_type` config, fail with supported types list for unknown types
+- [x] 5.6 Implement MLflow experiment tracking — create run, log config artifact, dataset params, per-class sample counts, all hyperparameters
+- [x] 5.7 Implement metrics logging — accuracy, f1_macro, f1_weighted, per-class precision/recall/F1
+- [x] 5.8 Create `services/ml/training/evaluation.py` — generate confusion matrix plot, feature importance plot, classification report text
+- [x] 5.9 Implement MLflow artifact logging — log confusion_matrix.png, feature_importance.png, classification_report.txt, pipeline_config.yaml
+- [x] 5.10 Implement MLflow model registration — log model with sklearn/xgboost flavor, register in registry if configured
+- [x] 5.11 Store training run metadata in PostgreSQL `training_runs` table
+- [x] 5.12 Wire training into `pipeline.py`
+
+## 6. Inference Service (FastAPI)
+
+- [x] 6.1 Create `services/ml/app/main.py` — FastAPI app with CORS, startup event to load model
+- [x] 6.2 Implement model loading — from MLflow registry (by name + stage) or from local .pkl file via joblib
+- [x] 6.3 Implement preprocessing parity — load pipeline config from MLflow artifact, apply same feature engineering as training
+- [x] 6.4 Create `POST /predict` endpoint — accept candles array, run preprocessing, predict, return per-candle labels + confidence + spans + model_info
+- [x] 6.5 Implement prediction span grouping — group consecutive same-label non-"O" predictions into spans with avg_confidence
+- [x] 6.6 Create `POST /predict/batch` endpoint — accept pair/timeframe/date range, load data, process in batch_size chunks, return predictions
+- [x] 6.7 Create `GET /model/info` endpoint — return model metadata, per-class metrics from MLflow
+- [x] 6.8 Create `GET /model/labels` endpoint — return label names and colors
+- [x] 6.9 Create `GET /health` endpoint — check model loaded status, MLflow connection, PostgreSQL connection
+- [x] 6.10 Add Pydantic request/response models for all endpoints (PredictRequest, PredictResponse, BatchPredictRequest, ModelInfoResponse)
+
+## 7. Next.js API Proxy Routes
+
+- [x] 7.1 Create `src/app/api/predict/route.ts` — POST proxy to `${INFERENCE_API_URL}/predict` with timeout handling
+- [x] 7.2 Create `src/app/api/predict/batch/route.ts` — POST proxy to `${INFERENCE_API_URL}/predict/batch` with INFERENCE_BATCH_TIMEOUT
+- [x] 7.3 Create `src/app/api/model/info/route.ts` — GET proxy to `${INFERENCE_API_URL}/model/info`
+- [x] 7.4 Add environment variables to `.env.local`: INFERENCE_API_URL, INFERENCE_API_TIMEOUT, INFERENCE_BATCH_TIMEOUT, NEXT_PUBLIC_PREDICTIONS_ENABLED
+
+## 8. Span Annotation Export & Feedback
+
+- [x] 8.1 Create `src/app/api/span-annotations/export/route.ts` — GET endpoint exporting span annotations as JSON in ML pipeline format
+- [x] 8.2 Add `source` and `model_prediction` fields to span annotation schema (Drizzle migration) — source defaults to "human", model_prediction is nullable JSON
+- [x] 8.3 Update span annotation POST endpoint to accept optional `source` and `model_prediction` fields
+- [x] 8.4 Support negative annotations — span with label "O", source "human_correction", and model_prediction metadata
+
+## 9. Prediction UI — State & Controls
+
+- [x] 9.1 Create `src/types/predictions.ts` — PredictionSpan, PredictionState, ModelInfoResponse interfaces
+- [x] 9.2 Create prediction state management in page.tsx (or dedicated context) — spans, isLoading, error, modelInfo, visible, confidenceThreshold, selectedLabels, autoPredict
+- [x] 9.3 Create `src/components/PredictionPanel.tsx` — controls panel with master toggle, model info display, action buttons, confidence slider, label checkboxes with metrics
+- [x] 9.4 Implement on-demand prediction fetching — "Run on Visible" sends visible candles to /api/predict, "Predict All" sends batch request
+- [x] 9.5 Implement prediction caching — Map keyed by pair_timeframe_range_modelVersion, invalidate on model version change
+
+## 10. Prediction UI — Chart Rendering
+
+- [x] 10.1 Add histogram series to CandleChart for prediction rendering — per-bar colors from label config at 10-20% opacity
+- [x] 10.2 Add series markers for prediction span labels — show `{label} ({confidence}%)` below bars at span start
+- [x] 10.3 Implement confidence threshold filtering — only render predictions above threshold
+- [x] 10.4 Implement label type filtering — toggle visibility per label from PredictionPanel checkboxes
+- [x] 10.5 Implement prediction layer visibility toggle — show/hide histogram series and markers
+
+## 11. Prediction UI — Disagreements & Feedback
+
+- [x] 11.1 Implement disagreement detection — compare human spans vs prediction spans with >50% overlap, classify as missed_by_model, missed_by_human, label_mismatch
+- [x] 11.2 Render disagreement highlights — red dashed border (missed_by_model), yellow highlight (missed_by_human), orange border (label_mismatch)
+- [x] 11.3 Add "Show only disagreements" filter toggle in PredictionPanel
+- [x] 11.4 Implement prediction-to-annotation feedback — click missed_by_human prediction opens span annotation dialog pre-filled with predicted label/times
+- [x] 11.5 Add "Not a pattern" dismiss action — saves negative annotation with label "O" and model_prediction metadata
+- [x] 11.6 Display prediction summary in PredictionPanel — prediction count, agreement count, disagreement count
+
+## 12. Inference API Connection & Error Handling
+
+- [x] 12.1 Implement inference API health polling — poll /api/model/info every 30 seconds when API unavailable, auto-reconnect
+- [x] 12.2 Show "Model server offline" banner when inference API unavailable, disable prediction controls
+- [x] 12.3 Ensure annotation tools work independently — prediction API errors never block human annotation
+- [x] 12.4 Add loading states for prediction fetching — skeleton/shimmer overlay during prediction requests
+
+## 13. Documentation & Deployment
+
+- [x] 13.1 Update docker-compose.yml with all service environment variables and health checks
+- [x] 13.2 Update DEPLOYMENT.md with Python service setup instructions, TA-Lib installation, MLflow server, PostgreSQL, DVC init
+- [x] 13.3 Update README.md with ML pipeline overview, architecture diagram, and usage instructions
+- [x] 13.4 Update CLAUDE_DESCRIPTION.md with new ML service capabilities and file structure