candle-annotator/openspec/changes/ml-db-consolidation/specs/ml-training/spec.md
Marko Djordjevic 5f70f13da3 feat: migrate from SQLite to PostgreSQL - complete schema and API updates
- Remove better-sqlite3, add pg driver
- Convert schema to PostgreSQL types (serial, timestamp, boolean, jsonb)
- Generate fresh PostgreSQL migrations
- Update database connection layer with pg.Pool
- Fix all API routes: remove JSON.parse/stringify, use native timestamps and booleans
- Update drizzle.config.ts and .env.example for PostgreSQL
2026-02-17 13:43:06 +01:00

37 lines
2 KiB
Markdown

## MODIFIED Requirements
### Requirement: PostgreSQL training metadata storage
The system SHALL store training run metadata in the PostgreSQL database. Each training run record SHALL include: run_id (MLflow run ID), model_type, experiment_name, pipeline_config_hash, dataset_version, metrics summary (JSON), status, and timestamps (created_at, completed_at).
#### Scenario: Store training run record
- **WHEN** a training run completes successfully
- **THEN** the system inserts a record into the PostgreSQL `training_runs` table with the run metadata
#### Scenario: Query training history
- **WHEN** the system queries training runs
- **THEN** it returns records from PostgreSQL ordered by created_at descending
#### Scenario: Database name updated
- **WHEN** the ML service connects to PostgreSQL
- **THEN** it connects to the `candle_annotator` database (not `ml_db`)
## ADDED Requirements
### Requirement: Direct annotation data access
The ML service SHALL read candle and annotation data directly from PostgreSQL instead of requiring CSV/JSON file exports. The ML service SHALL query the `candles`, `annotations`, `span_annotations`, and `charts` tables for training data.
#### Scenario: Query candle data for training
- **WHEN** the ML training pipeline needs OHLC data for a chart
- **THEN** it queries the `candles` table in PostgreSQL filtered by `chart_id`, ordered by `time`
#### Scenario: Query span annotations for labels
- **WHEN** the ML training pipeline needs labeled spans for training
- **THEN** it queries the `span_annotations` table in PostgreSQL filtered by `chart_id` and optionally by `source`
#### Scenario: No CSV/JSON export required
- **WHEN** the ML training pipeline starts
- **THEN** it does not require pre-exported CSV or JSON files — all data is read from PostgreSQL
#### Scenario: Shared database connection
- **WHEN** the ML service reads candle/annotation data
- **THEN** it uses the same PostgreSQL connection (same database, same credentials) as for `training_runs`