candle-annotator/README.md

# Candle Annotator

A web-based tool for manually annotating candlestick charts with pattern labels and trend lines. Built for creating labeled training data for machine learning models in trading analysis.

## Overview

Candle Annotator is a complete machine learning platform for candlestick pattern recognition, combining:

**Annotation Tools** - TradingView-like charting interface for creating labeled training data:
- Upload historical OHLC (Open, High, Low, Close) candle data from CSV files
- Visualize candlestick charts with interactive zoom and pan
- Annotate patterns with span labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
- Mark breakout patterns (Break Up, Break Down) directly on candles
- Draw custom trend lines with two-click interaction
- Export annotations for ML training

**ML Pipeline** - Python-based training and inference system:
- Feature engineering with TA-Lib indicators (RSI, MACD, Bollinger Bands, etc.)
- Automated pattern detection using TA-Lib CDL* functions
- Train RandomForest and XGBoost models with MLflow experiment tracking
- FastAPI inference service for real-time predictions
- Integration with Next.js UI for prediction visualization

**Active Learning Loop** - Close the feedback cycle:
- Model predictions displayed as overlays on the chart
- Disagreement detection between human annotations and model predictions
- One-click feedback to confirm, correct, or dismiss predictions as new training data
- Continuous improvement through iterative annotation and retraining

## Features

### User Accounts & Authentication
- **Multi-User Support**: Each user has isolated data with per-user workspace
- **Auth.js v5 Integration**: Flexible authentication system supporting multiple sign-in methods
- **Sign-In Methods**:
  - **Credentials (Email/Password)**: Traditional email and password authentication with bcryptjs hashing
  - **Google OAuth**: One-click sign-in with Google accounts
- **Registration**: Self-service account creation with email validation and password requirements (minimum 8 characters)
- **Settings Page**: Update display name, change password for credential users, or delete account with confirmation
- **Default Admin Account**: Database seeding with default admin credentials for initial setup
- **Per-User Data Isolation**: All charts, annotations, and ML models are scoped to individual users

### Data Management
- **CSV Upload**: Import OHLC data with support for both Unix timestamps and date strings
  - **Replace Mode**: Uploading a new CSV deletes all old candles and replaces them with new data
  - **Initial Data**: Docker containers automatically load EURUSD.csv on first startup if database is empty
- **PostgreSQL Storage**: All candle data and annotations stored in PostgreSQL database
- **Shared Database**: Frontend and ML service use the same database for seamless data access
- **Data Persistence**: Annotations and candles persist between sessions

### Chart Visualization
- **Interactive Candlestick Chart**: Powered by lightweight-charts library
- **Dark Theme**: Eye-friendly slate color scheme
- **Zoom & Pan**: Mouse wheel zoom and drag-to-pan functionality
- **Crosshair**: Precise price and time tracking

### Annotation Tools
- **Break Up Markers**: Green arrow markers below candles indicating upward breakouts
- **Break Down Markers**: Red arrow markers above candles indicating downward breakouts
- **Trend Lines**: Two-click line drawing with real-time preview
- **Delete Tool**: Remove any annotation (markers or lines) by clicking on them
- **Tool Toggle**: Click tool button again to deactivate

### Label Management
- **Label Sidebar**: View all annotations in collapsible sidebar with:
  - **Click Selection**: Click markers on chart or in sidebar to select/highlight
  - **Keyboard Delete**: Press Delete or Backspace to remove selected label
  - **Individual Delete**: Delete button on each list item
  - **Search**: Search annotations by timestamp
  - **Filter**: Filter by Break Up, Break Down, or All types
  - **Count Display**: See how many Break Up vs Break Down markers exist
  - **Visual Highlight**: Selected markers highlighted with glow effect

### UI Theme
- **Hacker Theme**: Terminal-inspired dark aesthetic with:
  - Matrix green (#00ff41) on dark background (#0a0e0a)
  - Monospace font (JetBrains Mono) throughout
  - Glow effects on button hover and active states
  - Custom scrollbars styled to match theme
  - High contrast for accessibility

### Export & Deployment
- **CSV Export**: Download all annotations with timestamp, label type, and price data
- **ML-Ready Format**: Structured data suitable for training ML models
- **Docker Deployment**: One-command deployment with persistent data volume
- **Health Check**: Built-in /api/health endpoint for monitoring

### ML Pipeline Features (Optional)

The integrated ML pipeline provides:

#### Feature Engineering
- **TA-Lib Indicators**: Automatic computation of 150+ technical indicators (RSI, MACD, Bollinger Bands, ATR, Stochastic, etc.)
- **Candle Features**: Body size, wick ratios, gap detection, price ranges
- **Custom Features**: Plugin system for domain-specific feature functions
- **NaN Handling**: Automatic warmup period detection and cleanup

#### Annotation Ingestion
- **Windowed Classification**: Extract fixed-size windows around each pattern for classification models
- **BIO Sequence Labeling**: Begin-Inside-Outside encoding for sequence models (future LSTM/GRU support)
- **Programmatic Labels**: TA-Lib CDL* pattern functions for auto-labeling (23+ candlestick patterns)
- **Label Merging**: Human-priority, programmatic-priority, or both strategies
- **Dataset Statistics**: Class distribution, label counts, human/programmatic agreement metrics

#### Model Training
- **Model Types**: RandomForest and XGBoost with class balancing
- **Temporal Splitting**: Train/val/test splits that respect time series order (no data leakage)
- **MLflow Integration**: Automatic experiment tracking, hyperparameter logging, artifact storage
- **Model Registry**: Versioned model storage with stage management (Production, Staging, Archived)
- **Evaluation Metrics**: Accuracy, F1 (macro/weighted), per-class precision/recall/F1
- **Visualization**: Confusion matrix, feature importance plots, classification reports

#### Inference Service
- **FastAPI REST API**: High-performance inference with automatic OpenAPI docs
- **Preprocessing Parity**: Loads pipeline config from MLflow to ensure training/inference consistency
- **Batch Processing**: Efficient prediction for large time ranges
- **Span Grouping**: Consecutive predictions merged into labeled spans with confidence scores
- **Model Metadata**: Endpoint to query model version, metrics, and label configuration

#### Prediction UI
- **Chart Overlay**: Predictions rendered as histogram series with label-specific colors
- **Confidence Filtering**: Slider to hide low-confidence predictions
- **Label Filtering**: Toggle visibility per pattern type with per-class F1 scores
- **Disagreement Detection**: Automatic comparison of human vs model predictions
- **Prediction Summary**: Counts for total predictions, agreements, disagreements
- **Active Learning Feedback**: Click predictions to convert them to annotations (future feature)

## Architecture

### System Components

```
┌─────────────────────────────────────────────────────────────────┐
│                         Web Browser                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Next.js Frontend (React 19, Tailwind, lightweight-charts)│  │
│  │  - Annotation tools                                        │  │
│  │  - Prediction visualization                                │  │
│  └──────────────────┬──────────────────────────────────────────┘  │
└─────────────────────┼──────────────────────────────────────────────┘
                      │ HTTP
                      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Next.js API Routes (TypeScript)                                 │
│  - /api/candles, /api/annotations, /api/span-annotations        │
│  - /api/predict (proxy)                                          │
│  - /api/model/info (proxy)                                       │
│  └───────────┬─────────────────────────────────────┬───────────  │
│              │ PostgreSQL                          │ HTTP        │
│              ▼                                     ▼             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │          PostgreSQL Database (Shared)                    │   │
│  │  - Frontend tables (candles, annotations, span_annotations) │
│  │  - ML tables (training_runs)                             │   │
│  │  - Accessed by: Next.js (Drizzle ORM)                    │   │
│  │                 ML Service (SQLAlchemy)                  │   │
│  └──────────────────────┬──────────────────────────────────┘    │
│                         │                                        │
│              ┌──────────┴──────────┐                             │
│              │                     │                             │
│  ┌───────────▼─────────┐  ┌───────▼───────────────────┐         │
│  │  ML Inference API   │  │    MLflow Server          │         │
│  │  (FastAPI, Python)  │  │  (Experiments, Registry)  │         │
│  └─────────────────────┘  └───────────────────────────┘         │
└─────────────────────────────────────────────────────────────────┘
```

### ML Pipeline Workflow

```
1. Annotate Data (Web UI)
   ↓
2. Export Annotations (JSON)
   ↓
3. Feature Engineering (TA-Lib)
   ├─ Raw OHLCV → Enriched CSV (with indicators)
   ↓
4. Annotation Ingestion
   ├─ Annotations + Enriched CSV → Labeled Dataset
   ├─ Optional: TA-Lib CDL* auto-labeling
   ↓
5. Model Training
   ├─ Temporal train/val/test split
   ├─ RandomForest or XGBoost training
   ├─ MLflow experiment tracking
   ├─ Model registration
   ↓
6. Inference Service
   ├─ Load model from MLflow registry
   ├─ Serve predictions via FastAPI
   ↓
7. Prediction Visualization (Web UI)
   ├─ Display predictions on chart
   ├─ Detect disagreements
   ├─ Feedback loop: predictions → new annotations → retrain
```

## Tech Stack

### Frontend & Web Service
- **Frontend**: Next.js 16 (App Router), React 19, TypeScript
- **Styling**: Tailwind CSS 3, shadcn/ui components
- **Charting**: lightweight-charts 4.x (TradingView)
- **Icons**: lucide-react
- **Backend**: Next.js API Routes
- **Database**: PostgreSQL 16 with pg driver
- **ORM**: Drizzle ORM (PostgreSQL dialect)
- **CSV Parsing**: papaparse

### ML Pipeline (Python)
- **API Framework**: FastAPI with uvicorn
- **ML Libraries**: scikit-learn (RandomForest), XGBoost
- **Feature Engineering**: TA-Lib (Technical Analysis Library)
- **Data Processing**: pandas, numpy
- **Experiment Tracking**: MLflow (model registry, artifact storage)
- **Data Versioning**: DVC (Data Version Control)
- **Database**: PostgreSQL 16 (shared with frontend - reads candles/annotations, writes training runs)
- **ORM**: SQLAlchemy (for training runs) + table reflection (for frontend data)
- **Model Persistence**: joblib
- **Validation**: Pydantic

## Getting Started

### Docker Quickstart (Recommended)

The fastest way to get running with Docker:

```bash
docker-compose up --build
```

Then open http://localhost:3000

See [DEPLOYMENT.md](./DEPLOYMENT.md#docker-deployment) for detailed Docker instructions.

### Prerequisites

- Node.js 18.x or higher (for local development)
- npm 9.x or higher (for local development)
- PostgreSQL 16 or higher (for local development)
- Docker & docker-compose (for containerized deployment)

### Local Development Installation

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd candle_annotator
   ```

2. Install dependencies:
   ```bash
   npm install
   ```

3. Setup PostgreSQL database:
   ```bash
   createdb candle_annotator
   createuser -P ml_user
   # Enter password: ml_password
   psql -c "GRANT ALL PRIVILEGES ON DATABASE candle_annotator TO ml_user;"
   ```

4. Create `.env` file:
   ```bash
   cp .env.example .env
   # Edit .env to set DATABASE_URL=postgresql://ml_user:ml_password@localhost:5432/candle_annotator
   ```

5. Start the development server:
   ```bash
   npm run dev
   ```

6. Open http://localhost:3000 in your browser

### Usage

1. **Upload Data**: Click "Choose CSV File" and select a CSV with columns: `time,open,high,low,close`
2. **View Chart**: The candlestick chart renders automatically after upload
3. **Add Annotations**:
   - Click "Label: Break Up" or "Label: Break Down" then click on a candle
   - Click "Draw Line" then click two points to draw a trend line
   - Press Escape to cancel line drawing
4. **Delete Annotations**: Click "Delete" tool, then click on markers or lines to remove them
5. **Export**: Click "Export CSV" to download all annotations

## CSV File Format

### Input Format

Your CSV file should have these columns:

```csv
time,open,high,low,close
1700000000,1.0500,1.0520,1.0490,1.0510
1700000060,1.0510,1.0530,1.0505,1.0525
```

**Time column** accepts:
- Unix timestamps (seconds): `1700000000`
- Date strings: `2024-01-15`, `2024-01-15 10:30:00`

### Export Format

The exported CSV includes:

```csv
timestamp,label_type,price
1700000000,break_up,1.0510
1700000120,break_down,1.0505
1700000000,line,1.0500
```

- **timestamp**: Unix timestamp of the annotation
- **label_type**: `break_up`, `break_down`, or `line`
- **price**: Close price for markers, start price for lines

## Database Schema

### Candles Table (PostgreSQL)

```typescript
{
  id: serial (PK, auto-increment),
  chart_id: integer (FK to charts.id),
  time: timestamp (not null, indexed with chart_id),
  open: double precision,
  high: double precision,
  low: double precision,
  close: double precision
}
```

### Annotations Table (Point Annotations)

```typescript
{
  id: serial (PK, auto-increment),
  chart_id: integer (FK to charts.id),
  timestamp: timestamp (not null),
  label_type: text ('line' | 'rectangle'),
  geometry: jsonb (for line/rectangle coordinates, nullable),
  color: text (default '#3b82f6'),
  created_at: timestamp (default now())
}
```

### Span Annotations Table (Pattern Labels)

```typescript
{
  id: serial (PK, auto-increment),
  chart_id: integer (FK to charts.id),
  start_time: timestamp (not null),
  end_time: timestamp (not null),
  label: text (pattern name, e.g., 'Bullish Engulfing'),
  confidence: integer (nullable),
  outcome: text (nullable),
  notes: text (nullable),
  sub_spans: jsonb (nullable),
  color: text (default '#2196F3'),
  source: text (default 'human'),  # 'human' | 'model' | 'hybrid'
  model_prediction: jsonb (nullable),
  created_at: timestamp (default now())
}
```

### Training Runs Table (ML Service)

```typescript
{
  id: serial (PK, auto-increment),
  run_id: text (unique, MLflow run ID),
  model_type: text (e.g., 'RandomForest', 'XGBoost'),
  experiment_name: text,
  pipeline_config_hash: text,
  dataset_version: text,
  metrics_summary: jsonb,
  status: text (e.g., 'running', 'completed', 'failed'),
  created_at: timestamp (default now()),
  completed_at: timestamp (nullable)
}
```

## API Endpoints

### POST /api/upload
Upload CSV file and store candle data

**Behavior**: Deletes all existing candles before inserting new data (replace mode)
**Request**: multipart/form-data with `file` field
**Response**: `{ success: true, count: number }` or `{ error: string }`

### GET /api/candles
Retrieve all candle records

**Response**: Array of candle objects ordered by time

### GET /api/annotations
Retrieve all annotations

**Response**: Array of annotation objects with parsed geometry

### POST /api/annotations
Create a new annotation

**Request**: `{ timestamp: number, label_type: string, geometry?: object }`
**Response**: Created annotation object with ID

### DELETE /api/annotations/[id]
Delete an annotation by ID

**Response**: `{ success: true }` or `{ error: string }`

### GET /api/export
Export annotations as downloadable CSV

**Response**: CSV file download with Content-Disposition header

## Architecture

### Component Structure

- **page.tsx**: Main page composition, manages active tool state
- **Toolbox.tsx**: Sidebar with tool buttons and export functionality
- **FileUpload.tsx**: CSV upload component with status messages
- **CandleChart.tsx**: Core chart wrapper with lightweight-charts integration
  - Initializes chart with dark theme
  - Handles marker annotations (Break Up/Down)
  - Manages click events for annotation creation
  - Exposes `refreshData()` method for parent updates
- **SvgOverlay.tsx**: Transparent SVG layer for line drawing
  - Coordinate transformation between data and pixels
  - Two-click line drawing with preview
  - Line hit detection for deletion

### Data Flow

1. User uploads CSV → POST /api/upload → SQLite storage
2. Chart mounts → GET /api/candles + GET /api/annotations → Render
3. User clicks with active tool → POST /api/annotations → Refresh chart
4. User deletes → DELETE /api/annotations/[id] → Refresh chart
5. User exports → GET /api/export → CSV download

## Development

### Project Structure

```
candle_annotator/
├── src/
│   ├── app/
│   │   ├── api/              # API route handlers
│   │   │   ├── upload/
│   │   │   ├── candles/
│   │   │   ├── annotations/
│   │   │   └── export/
│   │   ├── globals.css       # Tailwind styles
│   │   ├── layout.tsx        # Root layout with dark theme
│   │   └── page.tsx          # Main page
│   ├── components/
│   │   ├── ui/               # shadcn/ui components
│   │   ├── CandleChart.tsx
│   │   ├── SvgOverlay.tsx
│   │   ├── Toolbox.tsx
│   │   └── FileUpload.tsx
│   └── lib/
│       ├── db/
│       │   ├── index.ts      # Drizzle client
│       │   ├── schema.ts     # Table definitions
│       │   └── migrate.ts    # Migration runner
│       └── utils.ts          # Utility functions
├── data/                     # SQLite database directory
├── drizzle/                  # Migration files
├── DEPLOYMENT.md             # Deployment instructions
└── README.md                 # This file
```

### Key Technical Decisions

1. **lightweight-charts v4**: Stable API with good candlestick and marker support
2. **PostgreSQL**: Shared database enables ML service to directly query candle/annotation data without CSV exports
3. **SVG Overlay for Lines**: Maintains separate rendering layer from chart, easier coordinate management
4. **Drizzle ORM**: Type-safe queries with minimal overhead, PostgreSQL dialect for production-grade features
5. **Next.js App Router**: Server-side API routes co-located with frontend code

### Known Limitations

- **No Undo**: Can only delete annotations, not undo placement
- **Memory**: Large CSV files (100k+ rows) may cause slow uploads
- **Line Snapping**: Lines don't snap to candles, free-form placement only

## Troubleshooting

See [DEPLOYMENT.md](./DEPLOYMENT.md) for detailed troubleshooting steps.

Common issues:
- **PostgreSQL connection errors**: Check `DATABASE_URL` environment variable and verify PostgreSQL is running
- **Port 3000 in use**: Use `PORT=3001 npm run dev`
- **Migration errors**: Ensure PostgreSQL is accessible before starting the application

## License

ISC

## Contributing

This is a focused tool for a specific use case. For questions or issues, please open a GitHub issue.