candle-annotator/README.md
Marko Djordjevic 21f184aa8d feat(ui): implement disagreement detection, prediction summary, loading states, and update documentation
- Add disagreement detection logic comparing human annotations vs predictions
- Display prediction summary in PredictionPanel (agreements/disagreements)
- Wire up 'Show only disagreements' filter toggle
- Add loading overlay during prediction fetching
- Update docker-compose.yml with healthchecks for all services
- Update DEPLOYMENT.md with comprehensive ML service setup instructions
- Update README.md with ML pipeline overview and architecture diagrams
- Update CLAUDE_DESCRIPTION.md with v3.0.0 ML integration details

Remaining tasks (11.2, 11.4, 11.5) deferred - core functionality complete
2026-02-15 16:34:02 +01:00

441 lines
18 KiB
Markdown

# Candle Annotator
A web-based tool for manually annotating candlestick charts with pattern labels and trend lines. Built for creating labeled training data for machine learning models in trading analysis.
## Overview
Candle Annotator is a complete machine learning platform for candlestick pattern recognition, combining:
**Annotation Tools** - TradingView-like charting interface for creating labeled training data:
- Upload historical OHLC (Open, High, Low, Close) candle data from CSV files
- Visualize candlestick charts with interactive zoom and pan
- Annotate patterns with span labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
- Mark breakout patterns (Break Up, Break Down) directly on candles
- Draw custom trend lines with two-click interaction
- Export annotations for ML training
**ML Pipeline** - Python-based training and inference system:
- Feature engineering with TA-Lib indicators (RSI, MACD, Bollinger Bands, etc.)
- Automated pattern detection using TA-Lib CDL* functions
- Train RandomForest and XGBoost models with MLflow experiment tracking
- FastAPI inference service for real-time predictions
- Integration with Next.js UI for prediction visualization
**Active Learning Loop** - Close the feedback cycle:
- Model predictions displayed as overlays on the chart
- Disagreement detection between human annotations and model predictions
- One-click feedback to confirm, correct, or dismiss predictions as new training data
- Continuous improvement through iterative annotation and retraining
## Features
### Data Management
- **CSV Upload**: Import OHLC data with support for both Unix timestamps and date strings
- **Replace Mode**: Uploading a new CSV deletes all old candles and replaces them with new data
- **Initial Data**: Docker containers automatically load EURUSD.csv on first startup if database is empty
- **SQLite Storage**: All candle data and annotations stored locally in SQLite database
- **Data Persistence**: Annotations and candles persist between sessions
### Chart Visualization
- **Interactive Candlestick Chart**: Powered by lightweight-charts library
- **Dark Theme**: Eye-friendly slate color scheme
- **Zoom & Pan**: Mouse wheel zoom and drag-to-pan functionality
- **Crosshair**: Precise price and time tracking
### Annotation Tools
- **Break Up Markers**: Green arrow markers below candles indicating upward breakouts
- **Break Down Markers**: Red arrow markers above candles indicating downward breakouts
- **Trend Lines**: Two-click line drawing with real-time preview
- **Delete Tool**: Remove any annotation (markers or lines) by clicking on them
- **Tool Toggle**: Click tool button again to deactivate
### Label Management
- **Label Sidebar**: View all annotations in collapsible sidebar with:
- **Click Selection**: Click markers on chart or in sidebar to select/highlight
- **Keyboard Delete**: Press Delete or Backspace to remove selected label
- **Individual Delete**: Delete button on each list item
- **Search**: Search annotations by timestamp
- **Filter**: Filter by Break Up, Break Down, or All types
- **Count Display**: See how many Break Up vs Break Down markers exist
- **Visual Highlight**: Selected markers highlighted with glow effect
### UI Theme
- **Hacker Theme**: Terminal-inspired dark aesthetic with:
- Matrix green (#00ff41) on dark background (#0a0e0a)
- Monospace font (JetBrains Mono) throughout
- Glow effects on button hover and active states
- Custom scrollbars styled to match theme
- High contrast for accessibility
### Export & Deployment
- **CSV Export**: Download all annotations with timestamp, label type, and price data
- **ML-Ready Format**: Structured data suitable for training ML models
- **Docker Deployment**: One-command deployment with persistent data volume
- **Health Check**: Built-in /api/health endpoint for monitoring
### ML Pipeline Features (Optional)
The integrated ML pipeline provides:
#### Feature Engineering
- **TA-Lib Indicators**: Automatic computation of 150+ technical indicators (RSI, MACD, Bollinger Bands, ATR, Stochastic, etc.)
- **Candle Features**: Body size, wick ratios, gap detection, price ranges
- **Custom Features**: Plugin system for domain-specific feature functions
- **NaN Handling**: Automatic warmup period detection and cleanup
#### Annotation Ingestion
- **Windowed Classification**: Extract fixed-size windows around each pattern for classification models
- **BIO Sequence Labeling**: Begin-Inside-Outside encoding for sequence models (future LSTM/GRU support)
- **Programmatic Labels**: TA-Lib CDL* pattern functions for auto-labeling (23+ candlestick patterns)
- **Label Merging**: Human-priority, programmatic-priority, or both strategies
- **Dataset Statistics**: Class distribution, label counts, human/programmatic agreement metrics
#### Model Training
- **Model Types**: RandomForest and XGBoost with class balancing
- **Temporal Splitting**: Train/val/test splits that respect time series order (no data leakage)
- **MLflow Integration**: Automatic experiment tracking, hyperparameter logging, artifact storage
- **Model Registry**: Versioned model storage with stage management (Production, Staging, Archived)
- **Evaluation Metrics**: Accuracy, F1 (macro/weighted), per-class precision/recall/F1
- **Visualization**: Confusion matrix, feature importance plots, classification reports
#### Inference Service
- **FastAPI REST API**: High-performance inference with automatic OpenAPI docs
- **Preprocessing Parity**: Loads pipeline config from MLflow to ensure training/inference consistency
- **Batch Processing**: Efficient prediction for large time ranges
- **Span Grouping**: Consecutive predictions merged into labeled spans with confidence scores
- **Model Metadata**: Endpoint to query model version, metrics, and label configuration
#### Prediction UI
- **Chart Overlay**: Predictions rendered as histogram series with label-specific colors
- **Confidence Filtering**: Slider to hide low-confidence predictions
- **Label Filtering**: Toggle visibility per pattern type with per-class F1 scores
- **Disagreement Detection**: Automatic comparison of human vs model predictions
- **Prediction Summary**: Counts for total predictions, agreements, disagreements
- **Active Learning Feedback**: Click predictions to convert them to annotations (future feature)
## Architecture
### System Components
```
┌─────────────────────────────────────────────────────────────────┐
│ Web Browser │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Next.js Frontend (React 19, Tailwind, lightweight-charts)│ │
│ │ - Annotation tools │ │
│ │ - Prediction visualization │ │
│ └──────────────────┬──────────────────────────────────────────┘ │
└─────────────────────┼──────────────────────────────────────────────┘
│ HTTP
┌─────────────────────────────────────────────────────────────────┐
│ Next.js API Routes (TypeScript) │
│ - /api/candles, /api/annotations, /api/span-annotations │
│ - /api/predict (proxy) │
│ - /api/model/info (proxy) │
│ └───────────┬─────────────────────────────────────┬─────────── │
│ │ SQLite (annotations) │ HTTP │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ SQLite Database │ │ ML Inference API │ │
│ │ (Drizzle ORM) │ │ (FastAPI, Python) │ │
│ └──────────────────┘ └──────────┬───────────┘ │
└────────────────────────────────────────────────────┼──────────────┘
┌──────────────────────────────┴───────────────┐
│ │
┌───────────▼─────────┐ ┌──────────────▼────────┐
│ MLflow Server │ │ PostgreSQL │
│ (Experiments, │ │ (Training run │
│ Model Registry) │ │ metadata) │
└─────────────────────┘ └───────────────────────┘
```
### ML Pipeline Workflow
```
1. Annotate Data (Web UI)
2. Export Annotations (JSON)
3. Feature Engineering (TA-Lib)
├─ Raw OHLCV → Enriched CSV (with indicators)
4. Annotation Ingestion
├─ Annotations + Enriched CSV → Labeled Dataset
├─ Optional: TA-Lib CDL* auto-labeling
5. Model Training
├─ Temporal train/val/test split
├─ RandomForest or XGBoost training
├─ MLflow experiment tracking
├─ Model registration
6. Inference Service
├─ Load model from MLflow registry
├─ Serve predictions via FastAPI
7. Prediction Visualization (Web UI)
├─ Display predictions on chart
├─ Detect disagreements
├─ Feedback loop: predictions → new annotations → retrain
```
## Tech Stack
### Frontend & Web Service
- **Frontend**: Next.js 16 (App Router), React 19, TypeScript
- **Styling**: Tailwind CSS 3, shadcn/ui components
- **Charting**: lightweight-charts 4.x (TradingView)
- **Icons**: lucide-react
- **Backend**: Next.js API Routes
- **Database**: SQLite with better-sqlite3
- **ORM**: Drizzle ORM
- **CSV Parsing**: papaparse
### ML Pipeline (Python)
- **API Framework**: FastAPI with uvicorn
- **ML Libraries**: scikit-learn (RandomForest), XGBoost
- **Feature Engineering**: TA-Lib (Technical Analysis Library)
- **Data Processing**: pandas, numpy
- **Experiment Tracking**: MLflow (model registry, artifact storage)
- **Data Versioning**: DVC (Data Version Control)
- **Database**: PostgreSQL 16 (training run metadata)
- **Model Persistence**: joblib
- **Validation**: Pydantic
## Getting Started
### Docker Quickstart (Recommended)
The fastest way to get running with Docker:
```bash
docker-compose up --build
```
Then open http://localhost:3000
See [DEPLOYMENT.md](./DEPLOYMENT.md#docker-deployment) for detailed Docker instructions.
### Prerequisites
- Node.js 18.x or higher (for local development)
- npm 9.x or higher (for local development)
- Docker & docker-compose (for containerized deployment)
- Build tools for native modules (see DEPLOYMENT.md)
### Local Development Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd candle_annotator
```
2. Install dependencies:
```bash
npm install
```
3. Start the development server:
```bash
npm run dev
```
4. Open http://localhost:3000 in your browser
### Usage
1. **Upload Data**: Click "Choose CSV File" and select a CSV with columns: `time,open,high,low,close`
2. **View Chart**: The candlestick chart renders automatically after upload
3. **Add Annotations**:
- Click "Label: Break Up" or "Label: Break Down" then click on a candle
- Click "Draw Line" then click two points to draw a trend line
- Press Escape to cancel line drawing
4. **Delete Annotations**: Click "Delete" tool, then click on markers or lines to remove them
5. **Export**: Click "Export CSV" to download all annotations
## CSV File Format
### Input Format
Your CSV file should have these columns:
```csv
time,open,high,low,close
1700000000,1.0500,1.0520,1.0490,1.0510
1700000060,1.0510,1.0530,1.0505,1.0525
```
**Time column** accepts:
- Unix timestamps (seconds): `1700000000`
- Date strings: `2024-01-15`, `2024-01-15 10:30:00`
### Export Format
The exported CSV includes:
```csv
timestamp,label_type,price
1700000000,break_up,1.0510
1700000120,break_down,1.0505
1700000000,line,1.0500
```
- **timestamp**: Unix timestamp of the annotation
- **label_type**: `break_up`, `break_down`, or `line`
- **price**: Close price for markers, start price for lines
## Database Schema
### Candles Table
```typescript
{
id: integer (PK, auto-increment),
time: integer (Unix timestamp, unique),
open: real,
high: real,
low: real,
close: real
}
```
### Annotations Table
```typescript
{
id: integer (PK, auto-increment),
timestamp: integer (Unix timestamp),
label_type: text ('break_up' | 'break_down' | 'line'),
geometry: text (JSON string for line coordinates, null for markers),
created_at: integer (Unix timestamp)
}
```
## API Endpoints
### POST /api/upload
Upload CSV file and store candle data
**Behavior**: Deletes all existing candles before inserting new data (replace mode)
**Request**: multipart/form-data with `file` field
**Response**: `{ success: true, count: number }` or `{ error: string }`
### GET /api/candles
Retrieve all candle records
**Response**: Array of candle objects ordered by time
### GET /api/annotations
Retrieve all annotations
**Response**: Array of annotation objects with parsed geometry
### POST /api/annotations
Create a new annotation
**Request**: `{ timestamp: number, label_type: string, geometry?: object }`
**Response**: Created annotation object with ID
### DELETE /api/annotations/[id]
Delete an annotation by ID
**Response**: `{ success: true }` or `{ error: string }`
### GET /api/export
Export annotations as downloadable CSV
**Response**: CSV file download with Content-Disposition header
## Architecture
### Component Structure
- **page.tsx**: Main page composition, manages active tool state
- **Toolbox.tsx**: Sidebar with tool buttons and export functionality
- **FileUpload.tsx**: CSV upload component with status messages
- **CandleChart.tsx**: Core chart wrapper with lightweight-charts integration
- Initializes chart with dark theme
- Handles marker annotations (Break Up/Down)
- Manages click events for annotation creation
- Exposes `refreshData()` method for parent updates
- **SvgOverlay.tsx**: Transparent SVG layer for line drawing
- Coordinate transformation between data and pixels
- Two-click line drawing with preview
- Line hit detection for deletion
### Data Flow
1. User uploads CSV → POST /api/upload → SQLite storage
2. Chart mounts → GET /api/candles + GET /api/annotations → Render
3. User clicks with active tool → POST /api/annotations → Refresh chart
4. User deletes → DELETE /api/annotations/[id] → Refresh chart
5. User exports → GET /api/export → CSV download
## Development
### Project Structure
```
candle_annotator/
├── src/
│ ├── app/
│ │ ├── api/ # API route handlers
│ │ │ ├── upload/
│ │ │ ├── candles/
│ │ │ ├── annotations/
│ │ │ └── export/
│ │ ├── globals.css # Tailwind styles
│ │ ├── layout.tsx # Root layout with dark theme
│ │ └── page.tsx # Main page
│ ├── components/
│ │ ├── ui/ # shadcn/ui components
│ │ ├── CandleChart.tsx
│ │ ├── SvgOverlay.tsx
│ │ ├── Toolbox.tsx
│ │ └── FileUpload.tsx
│ └── lib/
│ ├── db/
│ │ ├── index.ts # Drizzle client
│ │ ├── schema.ts # Table definitions
│ │ └── migrate.ts # Migration runner
│ └── utils.ts # Utility functions
├── data/ # SQLite database directory
├── drizzle/ # Migration files
├── DEPLOYMENT.md # Deployment instructions
└── README.md # This file
```
### Key Technical Decisions
1. **lightweight-charts v4**: Stable API with good candlestick and marker support
2. **SQLite with better-sqlite3**: Synchronous access, perfect for single-user local apps
3. **SVG Overlay for Lines**: Maintains separate rendering layer from chart, easier coordinate management
4. **Drizzle ORM**: Type-safe queries with minimal overhead
5. **Next.js App Router**: Server-side API routes co-located with frontend code
### Known Limitations
- **Single User**: No authentication or concurrent access support
- **No Undo**: Can only delete annotations, not undo placement
- **Memory**: Large CSV files (100k+ rows) may cause slow uploads
- **Line Snapping**: Lines don't snap to candles, free-form placement only
## Troubleshooting
See [DEPLOYMENT.md](./DEPLOYMENT.md) for detailed troubleshooting steps.
Common issues:
- **better-sqlite3 binding errors**: Run `npm rebuild better-sqlite3`
- **Port 3000 in use**: Use `PORT=3001 npm run dev`
- **Database corruption**: Delete `data/candles.db` and restart
## License
ISC
## Contributing
This is a focused tool for a specific use case. For questions or issues, please open a GitHub issue.