candle-annotator/README.md

# Candle Annotator

A web-based tool for manually annotating candlestick charts with pattern labels and trend lines. Built for creating labeled training data for machine learning models in trading analysis.

## Overview

Candle Annotator is a complete machine learning platform for candlestick pattern recognition, combining:

**Annotation Tools** - TradingView-like charting interface for creating labeled training data:
- Upload historical OHLC (Open, High, Low, Close) candle data from CSV files
- Visualize candlestick charts with interactive zoom and pan
- Annotate patterns with span labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
- Mark breakout patterns (Break Up, Break Down) directly on candles
- Draw custom trend lines with two-click interaction
- Export annotations for ML training

**ML Pipeline** - Python-based training and inference system:
- Feature engineering with TA-Lib indicators (RSI, MACD, Bollinger Bands, etc.)
- Automated pattern detection using TA-Lib CDL* functions
- Train RandomForest and XGBoost models with MLflow experiment tracking
- FastAPI inference service for real-time predictions
- Integration with Next.js UI for prediction visualization

**Active Learning Loop** - Close the feedback cycle:
- Model predictions displayed as overlays on the chart
- Disagreement detection between human annotations and model predictions
- One-click feedback to confirm, correct, or dismiss predictions as new training data
- Continuous improvement through iterative annotation and retraining

## Features

### Data Management
- **CSV Upload**: Import OHLC data with support for both Unix timestamps and date strings
  - **Replace Mode**: Uploading a new CSV deletes all old candles and replaces them with new data
  - **Initial Data**: Docker containers automatically load EURUSD.csv on first startup if database is empty
- **SQLite Storage**: All candle data and annotations stored locally in SQLite database
- **Data Persistence**: Annotations and candles persist between sessions

### Chart Visualization
- **Interactive Candlestick Chart**: Powered by lightweight-charts library
- **Dark Theme**: Eye-friendly slate color scheme
- **Zoom & Pan**: Mouse wheel zoom and drag-to-pan functionality
- **Crosshair**: Precise price and time tracking

### Annotation Tools
- **Break Up Markers**: Green arrow markers below candles indicating upward breakouts
- **Break Down Markers**: Red arrow markers above candles indicating downward breakouts
- **Trend Lines**: Two-click line drawing with real-time preview
- **Delete Tool**: Remove any annotation (markers or lines) by clicking on them
- **Tool Toggle**: Click tool button again to deactivate

### Label Management
- **Label Sidebar**: View all annotations in collapsible sidebar with:
  - **Click Selection**: Click markers on chart or in sidebar to select/highlight
  - **Keyboard Delete**: Press Delete or Backspace to remove selected label
  - **Individual Delete**: Delete button on each list item
  - **Search**: Search annotations by timestamp
  - **Filter**: Filter by Break Up, Break Down, or All types
  - **Count Display**: See how many Break Up vs Break Down markers exist
  - **Visual Highlight**: Selected markers highlighted with glow effect

### UI Theme
- **Hacker Theme**: Terminal-inspired dark aesthetic with:
  - Matrix green (#00ff41) on dark background (#0a0e0a)
  - Monospace font (JetBrains Mono) throughout
  - Glow effects on button hover and active states
  - Custom scrollbars styled to match theme
  - High contrast for accessibility

### Export & Deployment
- **CSV Export**: Download all annotations with timestamp, label type, and price data
- **ML-Ready Format**: Structured data suitable for training ML models
- **Docker Deployment**: One-command deployment with persistent data volume
- **Health Check**: Built-in /api/health endpoint for monitoring

### ML Pipeline Features (Optional)

The integrated ML pipeline provides:

#### Feature Engineering
- **TA-Lib Indicators**: Automatic computation of 150+ technical indicators (RSI, MACD, Bollinger Bands, ATR, Stochastic, etc.)
- **Candle Features**: Body size, wick ratios, gap detection, price ranges
- **Custom Features**: Plugin system for domain-specific feature functions
- **NaN Handling**: Automatic warmup period detection and cleanup

#### Annotation Ingestion
- **Windowed Classification**: Extract fixed-size windows around each pattern for classification models
- **BIO Sequence Labeling**: Begin-Inside-Outside encoding for sequence models (future LSTM/GRU support)
- **Programmatic Labels**: TA-Lib CDL* pattern functions for auto-labeling (23+ candlestick patterns)
- **Label Merging**: Human-priority, programmatic-priority, or both strategies
- **Dataset Statistics**: Class distribution, label counts, human/programmatic agreement metrics

#### Model Training
- **Model Types**: RandomForest and XGBoost with class balancing
- **Temporal Splitting**: Train/val/test splits that respect time series order (no data leakage)
- **MLflow Integration**: Automatic experiment tracking, hyperparameter logging, artifact storage
- **Model Registry**: Versioned model storage with stage management (Production, Staging, Archived)
- **Evaluation Metrics**: Accuracy, F1 (macro/weighted), per-class precision/recall/F1
- **Visualization**: Confusion matrix, feature importance plots, classification reports

#### Inference Service
- **FastAPI REST API**: High-performance inference with automatic OpenAPI docs
- **Preprocessing Parity**: Loads pipeline config from MLflow to ensure training/inference consistency
- **Batch Processing**: Efficient prediction for large time ranges
- **Span Grouping**: Consecutive predictions merged into labeled spans with confidence scores
- **Model Metadata**: Endpoint to query model version, metrics, and label configuration

#### Prediction UI
- **Chart Overlay**: Predictions rendered as histogram series with label-specific colors
- **Confidence Filtering**: Slider to hide low-confidence predictions
- **Label Filtering**: Toggle visibility per pattern type with per-class F1 scores
- **Disagreement Detection**: Automatic comparison of human vs model predictions
- **Prediction Summary**: Counts for total predictions, agreements, disagreements
- **Active Learning Feedback**: Click predictions to convert them to annotations (future feature)

## Architecture

### System Components

```
┌─────────────────────────────────────────────────────────────────┐
│                         Web Browser                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Next.js Frontend (React 19, Tailwind, lightweight-charts)│  │
│  │  - Annotation tools                                        │  │
│  │  - Prediction visualization                                │  │
│  └──────────────────┬──────────────────────────────────────────┘  │
└─────────────────────┼──────────────────────────────────────────────┘
                      │ HTTP
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Next.js API Routes (TypeScript)                                 │
│  - /api/candles, /api/annotations, /api/span-annotations        │
│  - /api/predict (proxy)                                          │
│  - /api/model/info (proxy)                                       │
│  └───────────┬─────────────────────────────────────┬───────────  │
│              │ SQLite (annotations)                │ HTTP        │
│              ▼                                     ▼             │
│  ┌──────────────────┐                 ┌──────────────────────┐  │
│  │  SQLite Database │                 │  ML Inference API    │  │
│  │  (Drizzle ORM)   │                 │  (FastAPI, Python)   │  │
│  └──────────────────┘                 └──────────┬───────────┘  │
└────────────────────────────────────────────────────┼──────────────┘
                                                     │
                      ┌──────────────────────────────┴───────────────┐
                      │                                              │
          ┌───────────▼─────────┐                    ┌──────────────▼────────┐
          │  MLflow Server      │                    │  PostgreSQL           │
          │  (Experiments,      │                    │  (Training run        │
          │   Model Registry)   │                    │   metadata)           │
          └─────────────────────┘                    └───────────────────────┘
```

### ML Pipeline Workflow

```
1. Annotate Data (Web UI)
   ↓
2. Export Annotations (JSON)
   ↓
3. Feature Engineering (TA-Lib)
   ├─ Raw OHLCV → Enriched CSV (with indicators)
   ↓
4. Annotation Ingestion
   ├─ Annotations + Enriched CSV → Labeled Dataset
   ├─ Optional: TA-Lib CDL* auto-labeling
   ↓
5. Model Training
   ├─ Temporal train/val/test split
   ├─ RandomForest or XGBoost training
   ├─ MLflow experiment tracking
   ├─ Model registration
   ↓
6. Inference Service
   ├─ Load model from MLflow registry
   ├─ Serve predictions via FastAPI
   ↓
7. Prediction Visualization (Web UI)
   ├─ Display predictions on chart
   ├─ Detect disagreements
   ├─ Feedback loop: predictions → new annotations → retrain
```

## Tech Stack

### Frontend & Web Service
- **Frontend**: Next.js 16 (App Router), React 19, TypeScript
- **Styling**: Tailwind CSS 3, shadcn/ui components
- **Charting**: lightweight-charts 4.x (TradingView)
- **Icons**: lucide-react
- **Backend**: Next.js API Routes
- **Database**: SQLite with better-sqlite3
- **ORM**: Drizzle ORM
- **CSV Parsing**: papaparse

### ML Pipeline (Python)
- **API Framework**: FastAPI with uvicorn
- **ML Libraries**: scikit-learn (RandomForest), XGBoost
- **Feature Engineering**: TA-Lib (Technical Analysis Library)
- **Data Processing**: pandas, numpy
- **Experiment Tracking**: MLflow (model registry, artifact storage)
- **Data Versioning**: DVC (Data Version Control)
- **Database**: PostgreSQL 16 (training run metadata)
- **Model Persistence**: joblib
- **Validation**: Pydantic

## Getting Started

### Docker Quickstart (Recommended)

The fastest way to get running with Docker:

```bash
docker-compose up --build
```

Then open http://localhost:3000

See [DEPLOYMENT.md](./DEPLOYMENT.md#docker-deployment) for detailed Docker instructions.

### Prerequisites

- Node.js 18.x or higher (for local development)
- npm 9.x or higher (for local development)
- Docker & docker-compose (for containerized deployment)
- Build tools for native modules (see DEPLOYMENT.md)

### Local Development Installation

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd candle_annotator
   ```

2. Install dependencies:
   ```bash
   npm install
   ```

3. Start the development server:
   ```bash
   npm run dev
   ```

4. Open http://localhost:3000 in your browser

### Usage

1. **Upload Data**: Click "Choose CSV File" and select a CSV with columns: `time,open,high,low,close`
2. **View Chart**: The candlestick chart renders automatically after upload
3. **Add Annotations**:
   - Click "Label: Break Up" or "Label: Break Down" then click on a candle
   - Click "Draw Line" then click two points to draw a trend line
   - Press Escape to cancel line drawing
4. **Delete Annotations**: Click "Delete" tool, then click on markers or lines to remove them
5. **Export**: Click "Export CSV" to download all annotations

## CSV File Format

### Input Format

Your CSV file should have these columns:

```csv
time,open,high,low,close
1700000000,1.0500,1.0520,1.0490,1.0510
1700000060,1.0510,1.0530,1.0505,1.0525
```

**Time column** accepts:
- Unix timestamps (seconds): `1700000000`
- Date strings: `2024-01-15`, `2024-01-15 10:30:00`

### Export Format

The exported CSV includes:

```csv
timestamp,label_type,price
1700000000,break_up,1.0510
1700000120,break_down,1.0505
1700000000,line,1.0500
```

- **timestamp**: Unix timestamp of the annotation
- **label_type**: `break_up`, `break_down`, or `line`
- **price**: Close price for markers, start price for lines

## Database Schema

### Candles Table

```typescript
{
  id: integer (PK, auto-increment),
  time: integer (Unix timestamp, unique),
  open: real,
  high: real,
  low: real,
  close: real
}
```

### Annotations Table

```typescript
{
  id: integer (PK, auto-increment),
  timestamp: integer (Unix timestamp),
  label_type: text ('break_up' | 'break_down' | 'line'),
  geometry: text (JSON string for line coordinates, null for markers),
  created_at: integer (Unix timestamp)
}
```

## API Endpoints

### POST /api/upload
Upload CSV file and store candle data

**Behavior**: Deletes all existing candles before inserting new data (replace mode)
**Request**: multipart/form-data with `file` field
**Response**: `{ success: true, count: number }` or `{ error: string }`

### GET /api/candles
Retrieve all candle records

**Response**: Array of candle objects ordered by time

### GET /api/annotations
Retrieve all annotations

**Response**: Array of annotation objects with parsed geometry

### POST /api/annotations
Create a new annotation

**Request**: `{ timestamp: number, label_type: string, geometry?: object }`
**Response**: Created annotation object with ID

### DELETE /api/annotations/[id]
Delete an annotation by ID

**Response**: `{ success: true }` or `{ error: string }`

### GET /api/export
Export annotations as downloadable CSV

**Response**: CSV file download with Content-Disposition header

## Architecture

### Component Structure

- **page.tsx**: Main page composition, manages active tool state
- **Toolbox.tsx**: Sidebar with tool buttons and export functionality
- **FileUpload.tsx**: CSV upload component with status messages
- **CandleChart.tsx**: Core chart wrapper with lightweight-charts integration
  - Initializes chart with dark theme
  - Handles marker annotations (Break Up/Down)
  - Manages click events for annotation creation
  - Exposes `refreshData()` method for parent updates
- **SvgOverlay.tsx**: Transparent SVG layer for line drawing
  - Coordinate transformation between data and pixels
  - Two-click line drawing with preview
  - Line hit detection for deletion

### Data Flow

1. User uploads CSV → POST /api/upload → SQLite storage
2. Chart mounts → GET /api/candles + GET /api/annotations → Render
3. User clicks with active tool → POST /api/annotations → Refresh chart
4. User deletes → DELETE /api/annotations/[id] → Refresh chart
5. User exports → GET /api/export → CSV download

## Development

### Project Structure

```
candle_annotator/
├── src/
│   ├── app/
│   │   ├── api/              # API route handlers
│   │   │   ├── upload/
│   │   │   ├── candles/
│   │   │   ├── annotations/
│   │   │   └── export/
│   │   ├── globals.css       # Tailwind styles
│   │   ├── layout.tsx        # Root layout with dark theme
│   │   └── page.tsx          # Main page
│   ├── components/
│   │   ├── ui/               # shadcn/ui components
│   │   ├── CandleChart.tsx
│   │   ├── SvgOverlay.tsx
│   │   ├── Toolbox.tsx
│   │   └── FileUpload.tsx
│   └── lib/
│       ├── db/
│       │   ├── index.ts      # Drizzle client
│       │   ├── schema.ts     # Table definitions
│       │   └── migrate.ts    # Migration runner
│       └── utils.ts          # Utility functions
├── data/                     # SQLite database directory
├── drizzle/                  # Migration files
├── DEPLOYMENT.md             # Deployment instructions
└── README.md                 # This file
```

### Key Technical Decisions

1. **lightweight-charts v4**: Stable API with good candlestick and marker support
2. **SQLite with better-sqlite3**: Synchronous access, perfect for single-user local apps
3. **SVG Overlay for Lines**: Maintains separate rendering layer from chart, easier coordinate management
4. **Drizzle ORM**: Type-safe queries with minimal overhead
5. **Next.js App Router**: Server-side API routes co-located with frontend code

### Known Limitations

- **Single User**: No authentication or concurrent access support
- **No Undo**: Can only delete annotations, not undo placement
- **Memory**: Large CSV files (100k+ rows) may cause slow uploads
- **Line Snapping**: Lines don't snap to candles, free-form placement only

## Troubleshooting

See [DEPLOYMENT.md](./DEPLOYMENT.md) for detailed troubleshooting steps.

Common issues:
- **better-sqlite3 binding errors**: Run `npm rebuild better-sqlite3`
- **Port 3000 in use**: Use `PORT=3001 npm run dev`
- **Database corruption**: Delete `data/candles.db` and restart

## License

ISC

## Contributing

This is a focused tool for a specific use case. For questions or issues, please open a GitHub issue.