candle-annotator/README.md
Marko Djordjevic e783c9cbe9 Update README.md with user accounts feature description (Task 15.2)
Add new 'User Accounts & Authentication' section describing:
- Multi-user support with per-user data isolation
- Auth.js v5 with credentials and Google OAuth sign-in
- Registration with email validation and password requirements
- Settings page for profile management and account deletion
- Default admin account seeding

Remove outdated single-user limitation from Known Limitations section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 18:39:43 +01:00

509 lines
21 KiB
Markdown

# Candle Annotator
A web-based tool for manually annotating candlestick charts with pattern labels and trend lines. Built for creating labeled training data for machine learning models in trading analysis.
## Overview
Candle Annotator is a complete machine learning platform for candlestick pattern recognition, combining:
**Annotation Tools** - TradingView-like charting interface for creating labeled training data:
- Upload historical OHLC (Open, High, Low, Close) candle data from CSV files
- Visualize candlestick charts with interactive zoom and pan
- Annotate patterns with span labels (e.g., "Bullish Engulfing", "Doji", "Hammer")
- Mark breakout patterns (Break Up, Break Down) directly on candles
- Draw custom trend lines with two-click interaction
- Export annotations for ML training
**ML Pipeline** - Python-based training and inference system:
- Feature engineering with TA-Lib indicators (RSI, MACD, Bollinger Bands, etc.)
- Automated pattern detection using TA-Lib CDL* functions
- Train RandomForest and XGBoost models with MLflow experiment tracking
- FastAPI inference service for real-time predictions
- Integration with Next.js UI for prediction visualization
**Active Learning Loop** - Close the feedback cycle:
- Model predictions displayed as overlays on the chart
- Disagreement detection between human annotations and model predictions
- One-click feedback to confirm, correct, or dismiss predictions as new training data
- Continuous improvement through iterative annotation and retraining
## Features
### User Accounts & Authentication
- **Multi-User Support**: Each user has isolated data with per-user workspace
- **Auth.js v5 Integration**: Flexible authentication system supporting multiple sign-in methods
- **Sign-In Methods**:
- **Credentials (Email/Password)**: Traditional email and password authentication with bcryptjs hashing
- **Google OAuth**: One-click sign-in with Google accounts
- **Registration**: Self-service account creation with email validation and password requirements (minimum 8 characters)
- **Settings Page**: Update display name, change password for credential users, or delete account with confirmation
- **Default Admin Account**: Database seeding with default admin credentials for initial setup
- **Per-User Data Isolation**: All charts, annotations, and ML models are scoped to individual users
### Data Management
- **CSV Upload**: Import OHLC data with support for both Unix timestamps and date strings
- **Replace Mode**: Uploading a new CSV deletes all old candles and replaces them with new data
- **Initial Data**: Docker containers automatically load EURUSD.csv on first startup if database is empty
- **PostgreSQL Storage**: All candle data and annotations stored in PostgreSQL database
- **Shared Database**: Frontend and ML service use the same database for seamless data access
- **Data Persistence**: Annotations and candles persist between sessions
### Chart Visualization
- **Interactive Candlestick Chart**: Powered by lightweight-charts library
- **Dark Theme**: Eye-friendly slate color scheme
- **Zoom & Pan**: Mouse wheel zoom and drag-to-pan functionality
- **Crosshair**: Precise price and time tracking
### Annotation Tools
- **Break Up Markers**: Green arrow markers below candles indicating upward breakouts
- **Break Down Markers**: Red arrow markers above candles indicating downward breakouts
- **Trend Lines**: Two-click line drawing with real-time preview
- **Delete Tool**: Remove any annotation (markers or lines) by clicking on them
- **Tool Toggle**: Click tool button again to deactivate
### Label Management
- **Label Sidebar**: View all annotations in collapsible sidebar with:
- **Click Selection**: Click markers on chart or in sidebar to select/highlight
- **Keyboard Delete**: Press Delete or Backspace to remove selected label
- **Individual Delete**: Delete button on each list item
- **Search**: Search annotations by timestamp
- **Filter**: Filter by Break Up, Break Down, or All types
- **Count Display**: See how many Break Up vs Break Down markers exist
- **Visual Highlight**: Selected markers highlighted with glow effect
### UI Theme
- **Hacker Theme**: Terminal-inspired dark aesthetic with:
- Matrix green (#00ff41) on dark background (#0a0e0a)
- Monospace font (JetBrains Mono) throughout
- Glow effects on button hover and active states
- Custom scrollbars styled to match theme
- High contrast for accessibility
### Export & Deployment
- **CSV Export**: Download all annotations with timestamp, label type, and price data
- **ML-Ready Format**: Structured data suitable for training ML models
- **Docker Deployment**: One-command deployment with persistent data volume
- **Health Check**: Built-in /api/health endpoint for monitoring
### ML Pipeline Features (Optional)
The integrated ML pipeline provides:
#### Feature Engineering
- **TA-Lib Indicators**: Automatic computation of 150+ technical indicators (RSI, MACD, Bollinger Bands, ATR, Stochastic, etc.)
- **Candle Features**: Body size, wick ratios, gap detection, price ranges
- **Custom Features**: Plugin system for domain-specific feature functions
- **NaN Handling**: Automatic warmup period detection and cleanup
#### Annotation Ingestion
- **Windowed Classification**: Extract fixed-size windows around each pattern for classification models
- **BIO Sequence Labeling**: Begin-Inside-Outside encoding for sequence models (future LSTM/GRU support)
- **Programmatic Labels**: TA-Lib CDL* pattern functions for auto-labeling (23+ candlestick patterns)
- **Label Merging**: Human-priority, programmatic-priority, or both strategies
- **Dataset Statistics**: Class distribution, label counts, human/programmatic agreement metrics
#### Model Training
- **Model Types**: RandomForest and XGBoost with class balancing
- **Temporal Splitting**: Train/val/test splits that respect time series order (no data leakage)
- **MLflow Integration**: Automatic experiment tracking, hyperparameter logging, artifact storage
- **Model Registry**: Versioned model storage with stage management (Production, Staging, Archived)
- **Evaluation Metrics**: Accuracy, F1 (macro/weighted), per-class precision/recall/F1
- **Visualization**: Confusion matrix, feature importance plots, classification reports
#### Inference Service
- **FastAPI REST API**: High-performance inference with automatic OpenAPI docs
- **Preprocessing Parity**: Loads pipeline config from MLflow to ensure training/inference consistency
- **Batch Processing**: Efficient prediction for large time ranges
- **Span Grouping**: Consecutive predictions merged into labeled spans with confidence scores
- **Model Metadata**: Endpoint to query model version, metrics, and label configuration
#### Prediction UI
- **Chart Overlay**: Predictions rendered as histogram series with label-specific colors
- **Confidence Filtering**: Slider to hide low-confidence predictions
- **Label Filtering**: Toggle visibility per pattern type with per-class F1 scores
- **Disagreement Detection**: Automatic comparison of human vs model predictions
- **Prediction Summary**: Counts for total predictions, agreements, disagreements
- **Active Learning Feedback**: Click predictions to convert them to annotations (future feature)
## Architecture
### System Components
```
┌─────────────────────────────────────────────────────────────────┐
│ Web Browser │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Next.js Frontend (React 19, Tailwind, lightweight-charts)│ │
│ │ - Annotation tools │ │
│ │ - Prediction visualization │ │
│ └──────────────────┬──────────────────────────────────────────┘ │
└─────────────────────┼──────────────────────────────────────────────┘
│ HTTP
┌─────────────────────────────────────────────────────────────────┐
│ Next.js API Routes (TypeScript) │
│ - /api/candles, /api/annotations, /api/span-annotations │
│ - /api/predict (proxy) │
│ - /api/model/info (proxy) │
│ └───────────┬─────────────────────────────────────┬─────────── │
│ │ PostgreSQL │ HTTP │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PostgreSQL Database (Shared) │ │
│ │ - Frontend tables (candles, annotations, span_annotations) │
│ │ - ML tables (training_runs) │ │
│ │ - Accessed by: Next.js (Drizzle ORM) │ │
│ │ ML Service (SQLAlchemy) │ │
│ └──────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ ┌───────────▼─────────┐ ┌───────▼───────────────────┐ │
│ │ ML Inference API │ │ MLflow Server │ │
│ │ (FastAPI, Python) │ │ (Experiments, Registry) │ │
│ └─────────────────────┘ └───────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### ML Pipeline Workflow
```
1. Annotate Data (Web UI)
2. Export Annotations (JSON)
3. Feature Engineering (TA-Lib)
├─ Raw OHLCV → Enriched CSV (with indicators)
4. Annotation Ingestion
├─ Annotations + Enriched CSV → Labeled Dataset
├─ Optional: TA-Lib CDL* auto-labeling
5. Model Training
├─ Temporal train/val/test split
├─ RandomForest or XGBoost training
├─ MLflow experiment tracking
├─ Model registration
6. Inference Service
├─ Load model from MLflow registry
├─ Serve predictions via FastAPI
7. Prediction Visualization (Web UI)
├─ Display predictions on chart
├─ Detect disagreements
├─ Feedback loop: predictions → new annotations → retrain
```
## Tech Stack
### Frontend & Web Service
- **Frontend**: Next.js 16 (App Router), React 19, TypeScript
- **Styling**: Tailwind CSS 3, shadcn/ui components
- **Charting**: lightweight-charts 4.x (TradingView)
- **Icons**: lucide-react
- **Backend**: Next.js API Routes
- **Database**: PostgreSQL 16 with pg driver
- **ORM**: Drizzle ORM (PostgreSQL dialect)
- **CSV Parsing**: papaparse
### ML Pipeline (Python)
- **API Framework**: FastAPI with uvicorn
- **ML Libraries**: scikit-learn (RandomForest), XGBoost
- **Feature Engineering**: TA-Lib (Technical Analysis Library)
- **Data Processing**: pandas, numpy
- **Experiment Tracking**: MLflow (model registry, artifact storage)
- **Data Versioning**: DVC (Data Version Control)
- **Database**: PostgreSQL 16 (shared with frontend - reads candles/annotations, writes training runs)
- **ORM**: SQLAlchemy (for training runs) + table reflection (for frontend data)
- **Model Persistence**: joblib
- **Validation**: Pydantic
## Getting Started
### Docker Quickstart (Recommended)
The fastest way to get running with Docker:
```bash
docker-compose up --build
```
Then open http://localhost:3000
See [DEPLOYMENT.md](./DEPLOYMENT.md#docker-deployment) for detailed Docker instructions.
### Prerequisites
- Node.js 18.x or higher (for local development)
- npm 9.x or higher (for local development)
- PostgreSQL 16 or higher (for local development)
- Docker & docker-compose (for containerized deployment)
### Local Development Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd candle_annotator
```
2. Install dependencies:
```bash
npm install
```
3. Setup PostgreSQL database:
```bash
createdb candle_annotator
createuser -P ml_user
# Enter password: ml_password
psql -c "GRANT ALL PRIVILEGES ON DATABASE candle_annotator TO ml_user;"
```
4. Create `.env` file:
```bash
cp .env.example .env
# Edit .env to set DATABASE_URL=postgresql://ml_user:ml_password@localhost:5432/candle_annotator
```
5. Start the development server:
```bash
npm run dev
```
6. Open http://localhost:3000 in your browser
### Usage
1. **Upload Data**: Click "Choose CSV File" and select a CSV with columns: `time,open,high,low,close`
2. **View Chart**: The candlestick chart renders automatically after upload
3. **Add Annotations**:
- Click "Label: Break Up" or "Label: Break Down" then click on a candle
- Click "Draw Line" then click two points to draw a trend line
- Press Escape to cancel line drawing
4. **Delete Annotations**: Click "Delete" tool, then click on markers or lines to remove them
5. **Export**: Click "Export CSV" to download all annotations
## CSV File Format
### Input Format
Your CSV file should have these columns:
```csv
time,open,high,low,close
1700000000,1.0500,1.0520,1.0490,1.0510
1700000060,1.0510,1.0530,1.0505,1.0525
```
**Time column** accepts:
- Unix timestamps (seconds): `1700000000`
- Date strings: `2024-01-15`, `2024-01-15 10:30:00`
### Export Format
The exported CSV includes:
```csv
timestamp,label_type,price
1700000000,break_up,1.0510
1700000120,break_down,1.0505
1700000000,line,1.0500
```
- **timestamp**: Unix timestamp of the annotation
- **label_type**: `break_up`, `break_down`, or `line`
- **price**: Close price for markers, start price for lines
## Database Schema
### Candles Table (PostgreSQL)
```typescript
{
id: serial (PK, auto-increment),
chart_id: integer (FK to charts.id),
time: timestamp (not null, indexed with chart_id),
open: double precision,
high: double precision,
low: double precision,
close: double precision
}
```
### Annotations Table (Point Annotations)
```typescript
{
id: serial (PK, auto-increment),
chart_id: integer (FK to charts.id),
timestamp: timestamp (not null),
label_type: text ('line' | 'rectangle'),
geometry: jsonb (for line/rectangle coordinates, nullable),
color: text (default '#3b82f6'),
created_at: timestamp (default now())
}
```
### Span Annotations Table (Pattern Labels)
```typescript
{
id: serial (PK, auto-increment),
chart_id: integer (FK to charts.id),
start_time: timestamp (not null),
end_time: timestamp (not null),
label: text (pattern name, e.g., 'Bullish Engulfing'),
confidence: integer (nullable),
outcome: text (nullable),
notes: text (nullable),
sub_spans: jsonb (nullable),
color: text (default '#2196F3'),
source: text (default 'human'), # 'human' | 'model' | 'hybrid'
model_prediction: jsonb (nullable),
created_at: timestamp (default now())
}
```
### Training Runs Table (ML Service)
```typescript
{
id: serial (PK, auto-increment),
run_id: text (unique, MLflow run ID),
model_type: text (e.g., 'RandomForest', 'XGBoost'),
experiment_name: text,
pipeline_config_hash: text,
dataset_version: text,
metrics_summary: jsonb,
status: text (e.g., 'running', 'completed', 'failed'),
created_at: timestamp (default now()),
completed_at: timestamp (nullable)
}
```
## API Endpoints
### POST /api/upload
Upload CSV file and store candle data
**Behavior**: Deletes all existing candles before inserting new data (replace mode)
**Request**: multipart/form-data with `file` field
**Response**: `{ success: true, count: number }` or `{ error: string }`
### GET /api/candles
Retrieve all candle records
**Response**: Array of candle objects ordered by time
### GET /api/annotations
Retrieve all annotations
**Response**: Array of annotation objects with parsed geometry
### POST /api/annotations
Create a new annotation
**Request**: `{ timestamp: number, label_type: string, geometry?: object }`
**Response**: Created annotation object with ID
### DELETE /api/annotations/[id]
Delete an annotation by ID
**Response**: `{ success: true }` or `{ error: string }`
### GET /api/export
Export annotations as downloadable CSV
**Response**: CSV file download with Content-Disposition header
## Architecture
### Component Structure
- **page.tsx**: Main page composition, manages active tool state
- **Toolbox.tsx**: Sidebar with tool buttons and export functionality
- **FileUpload.tsx**: CSV upload component with status messages
- **CandleChart.tsx**: Core chart wrapper with lightweight-charts integration
- Initializes chart with dark theme
- Handles marker annotations (Break Up/Down)
- Manages click events for annotation creation
- Exposes `refreshData()` method for parent updates
- **SvgOverlay.tsx**: Transparent SVG layer for line drawing
- Coordinate transformation between data and pixels
- Two-click line drawing with preview
- Line hit detection for deletion
### Data Flow
1. User uploads CSV → POST /api/upload → SQLite storage
2. Chart mounts → GET /api/candles + GET /api/annotations → Render
3. User clicks with active tool → POST /api/annotations → Refresh chart
4. User deletes → DELETE /api/annotations/[id] → Refresh chart
5. User exports → GET /api/export → CSV download
## Development
### Project Structure
```
candle_annotator/
├── src/
│ ├── app/
│ │ ├── api/ # API route handlers
│ │ │ ├── upload/
│ │ │ ├── candles/
│ │ │ ├── annotations/
│ │ │ └── export/
│ │ ├── globals.css # Tailwind styles
│ │ ├── layout.tsx # Root layout with dark theme
│ │ └── page.tsx # Main page
│ ├── components/
│ │ ├── ui/ # shadcn/ui components
│ │ ├── CandleChart.tsx
│ │ ├── SvgOverlay.tsx
│ │ ├── Toolbox.tsx
│ │ └── FileUpload.tsx
│ └── lib/
│ ├── db/
│ │ ├── index.ts # Drizzle client
│ │ ├── schema.ts # Table definitions
│ │ └── migrate.ts # Migration runner
│ └── utils.ts # Utility functions
├── data/ # SQLite database directory
├── drizzle/ # Migration files
├── DEPLOYMENT.md # Deployment instructions
└── README.md # This file
```
### Key Technical Decisions
1. **lightweight-charts v4**: Stable API with good candlestick and marker support
2. **PostgreSQL**: Shared database enables ML service to directly query candle/annotation data without CSV exports
3. **SVG Overlay for Lines**: Maintains separate rendering layer from chart, easier coordinate management
4. **Drizzle ORM**: Type-safe queries with minimal overhead, PostgreSQL dialect for production-grade features
5. **Next.js App Router**: Server-side API routes co-located with frontend code
### Known Limitations
- **No Undo**: Can only delete annotations, not undo placement
- **Memory**: Large CSV files (100k+ rows) may cause slow uploads
- **Line Snapping**: Lines don't snap to candles, free-form placement only
## Troubleshooting
See [DEPLOYMENT.md](./DEPLOYMENT.md) for detailed troubleshooting steps.
Common issues:
- **PostgreSQL connection errors**: Check `DATABASE_URL` environment variable and verify PostgreSQL is running
- **Port 3000 in use**: Use `PORT=3001 npm run dev`
- **Migration errors**: Ensure PostgreSQL is accessible before starting the application
## License
ISC
## Contributing
This is a focused tool for a specific use case. For questions or issues, please open a GitHub issue.