candle-annotator/openspec/changes/archive/2026-02-18-ml-ui-connection/design.md
2026-02-18 10:21:05 +01:00

87 lines
6.4 KiB
Markdown

## Context
The Candle Annotator has a fully implemented ML pipeline (feature engineering, annotation ingestion, training, inference) in a separate FastAPI service (`services/ml/`, port 8001), and a standalone TA-Lib pattern detection script (`generate_talib_annotations.py`). The Next.js frontend already has a `PredictionPanel` that can run predictions and display results, but TA-Lib pattern detection and model training are CLI-only operations.
The sidebar layout is a fixed 240px column with vertically stacked sections: ChartSelector → FileUpload → Toolbox → SpanAnnotationList → PredictionPanel → Export. All state lives in the root `Home` component and flows down as props.
## Goals / Non-Goals
**Goals:**
- Let users run TA-Lib CDL pattern detection from the sidebar and see results as span annotations on the chart
- Let users trigger ML training runs from the UI with model type and basic parameter selection
- Let users switch between trained models for prediction
- Keep the sidebar usable — panels must be collapsible since we're adding significant new UI surface
**Non-Goals:**
- Advanced hyperparameter tuning UI (grid search, cross-validation config)
- Real-time training progress (websockets) — polling is sufficient for now
- MLflow dashboard embedding — just link out to it
- Custom TA-Lib indicator configuration (RSI, EMA, etc.) — only CDL pattern functions
- Multi-chart training (training always uses the exported annotation dataset)
## Decisions
### 1. TA-Lib pattern detection runs server-side via new FastAPI endpoint
**Decision:** Add a `POST /patterns/detect` endpoint to the FastAPI service that accepts candle data and a list of pattern names, runs TA-Lib CDL functions, and returns span annotations.
**Rationale:** The `generate_talib_annotations.py` script already has the detection logic. Extracting it into a FastAPI endpoint reuses that code and keeps TA-Lib (C library) on the Python side only. The frontend sends candles the same way it does for `/predict`.
**Alternatives considered:**
- Running TA-Lib in a Next.js API route via child process — adds complexity, fragile
- Pre-computing all patterns on upload — wasteful, users want to select specific patterns
### 2. Detected patterns become span annotations with `source: "talib"`
**Decision:** When patterns are detected, save them directly as span annotations in the Next.js database via `POST /api/span-annotations` with `source: "talib"`. This allows bulk delete by source and makes them immediately visible on the chart.
**Rationale:** The `span_annotations` table already has a `source` field supporting arbitrary strings. Using `"talib"` as a new source value enables filtering and bulk operations without schema changes.
**Alternatives considered:**
- Separate "patterns" table — unnecessary duplication, patterns are just a special kind of span annotation
- Client-side only (no persistence) — lose patterns on page reload
### 3. Training triggered via new FastAPI endpoint, status via polling
**Decision:** Add `POST /training/start` (triggers training in a background thread) and `GET /training/runs` (returns training history from `training_runs` table). The frontend polls `/training/runs` every 5s while a run is active.
**Rationale:** Training takes minutes, not seconds. A background thread with DB status tracking is the simplest approach. The `training_runs` table already tracks status (`running`/`completed`). Polling is simpler than websockets and sufficient for an operation that runs infrequently.
**Alternatives considered:**
- Synchronous training endpoint — would time out (training takes minutes)
- Celery/Redis task queue — over-engineered for single-user tool
- WebSocket progress — complex, not worth it for rare operations
### 4. Model selection via training run list
**Decision:** The model selector shows completed training runs from the `training_runs` table. Selecting a model sends its `run_id` to a new `POST /model/load` endpoint that loads the model (from MLflow or local path). The prediction panel then reflects the newly loaded model.
**Rationale:** Training runs are already tracked with metadata (model type, metrics, timestamps). Each run produces a model artifact. Loading by run_id is unambiguous.
**Alternatives considered:**
- Scanning filesystem for `.pkl` files — fragile, no metadata
- MLflow model registry stages (staging/production) — too formal for this use case
### 5. Collapsible sidebar sections with accordion pattern
**Decision:** Wrap TA-Lib panel, Training panel, and Prediction panel in collapsible `<details>`/accordion sections. Only one ML-related panel open at a time is recommended but not enforced.
**Rationale:** Adding two new panels to an already full sidebar requires space management. Collapsible sections are the simplest approach, consistent with the existing compact sidebar design.
### 6. Next.js API routes as proxies (existing pattern)
**Decision:** All new FastAPI endpoints are proxied through Next.js API routes (`/api/patterns/*`, `/api/training/*`, `/api/model/*`), following the existing pattern used by `/api/predict` and `/api/model/info`.
**Rationale:** Keeps the frontend hitting a single origin, avoids CORS issues, and maintains consistency with the established architecture.
## Risks / Trade-offs
- **Training blocks the FastAPI process** — Training runs in a background thread sharing the same process. A long training run could affect prediction latency. → Mitigation: Training is rare and the tool is single-user. If it becomes a problem, move to subprocess.
- **Model hot-swap during prediction** — Loading a new model while a prediction request is in-flight could cause errors. → Mitigation: Use a lock around model swap; prediction requests wait briefly.
- **TA-Lib pattern detection on large datasets** — Running 50 patterns on 10k+ candles could be slow. → Mitigation: Let users select specific patterns rather than "run all". Show loading state.
- **Training data preparation** — Training requires an exported/labeled dataset CSV. The UI must make clear that training uses the annotation export, not the raw chart data. → Mitigation: Show which dataset file will be used, with candle/annotation counts.
## Open Questions
- Should we add a "run full pipeline" button (feature engineering → annotation ingestion → training) or keep these as separate steps?
- Should pattern detection results be auto-saved or require explicit "save to chart" action?