chore: archive ml-db-consolidation change and sync specs

- Archived change to openspec/changes/archive/2026-02-17-ml-db-consolidation/
- Created new postgres-data-layer spec with PostgreSQL connection, schema definitions, Drizzle migrations, npm deps, and SQLite migration requirements
- Updated docker-deployment spec: Docker Compose now PostgreSQL-based (postgres dependency, ml-data volume, DATABASE_URL); env vars updated (DATABASE_URL added, DATABASE_PATH removed); database persistence updated to PostgreSQL volumes; health check updated to PostgreSQL
- Updated ml-training spec: added database name scenario (candle_annotator) and new direct annotation data access requirement

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Marko Djordjevic 2026-02-17 18:22:28 +01:00
parent 0e8dcc6707
commit 38df874255
10 changed files with 532 additions and 31 deletions

View file

@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-17

View file

@ -0,0 +1,110 @@
## Context
The candle annotator runs two databases:
1. **SQLite** (`data/candles.db`) — serves the Next.js frontend via Drizzle ORM (`better-sqlite3` driver). Contains 6 tables: charts, candles, annotations, annotation_types, span_annotations, span_label_types.
2. **PostgreSQL** (`postgres:5432/ml_db`) — serves the Python ML service via SQLAlchemy. Contains 1 table: training_runs.
The ML service cannot directly query annotation/candle data. Data flows through CSV/JSON file exports. PostgreSQL already runs in Docker for the ML service, so consolidating means adding frontend tables there — not introducing a new service.
## Goals / Non-Goals
**Goals:**
- Single PostgreSQL instance for all application data
- Drizzle ORM continues to manage frontend schema (just switches dialect)
- ML service gains direct read access to candle/annotation tables
- Simplified Docker setup (one fewer volume, one database to back up)
- One-time data migration path from SQLite to PostgreSQL
**Non-Goals:**
- Changing the ML service ORM (SQLAlchemy stays)
- Merging Drizzle and SQLAlchemy migration systems (each manages its own tables)
- Changing API route logic or query patterns beyond what's needed for the dialect switch
- Multi-tenant or schema separation (all tables go in the `public` schema)
- Migrating away from Drizzle ORM
## Decisions
### 1. Drizzle PostgreSQL driver: `drizzle-orm/node-postgres` with `pg`
**Choice**: Use `pg` (node-postgres) as the driver.
**Why**: `pg` is the most mature PostgreSQL driver for Node.js. Drizzle supports it natively via `drizzle-orm/node-postgres`. The `postgres` (postgres.js) driver is also an option but `pg` has broader ecosystem support and is easier to debug.
**Alternative considered**: `postgres` (postgres.js) — lighter, promise-native, but less battle-tested with Drizzle migrations.
### 2. Shared database, single `public` schema
**Choice**: All tables (frontend + ML) live in the same database (`ml_db`) and the default `public` schema.
**Why**: The table sets don't overlap (frontend has charts/candles/annotations, ML has training_runs). Separate schemas add complexity with no benefit for 7 total tables. The ML service already connects to `ml_db`.
**Alternative considered**: Separate PostgreSQL schemas (`app` and `ml`) — cleaner isolation but adds schema-prefix complexity to queries and cross-schema references. Not worth it at this scale.
### 3. Rename database from `ml_db` to `candle_annotator`
**Choice**: Rename the PostgreSQL database to `candle_annotator` since it now serves the whole application, not just ML.
**Why**: `ml_db` is misleading when the database holds frontend data too. Renaming during consolidation is the natural time to do it.
**Alternative considered**: Keep `ml_db` — avoids a rename step but creates lasting confusion.
### 4. Fresh Drizzle migrations (drop SQLite migrations)
**Choice**: Delete all existing SQLite migrations in `drizzle/`, rewrite the schema file with `pgTable` equivalents, and run `drizzle-kit generate` to produce a fresh initial PostgreSQL migration.
**Why**: SQLite migrations are dialect-specific (e.g., `integer` for booleans, no native timestamps). Converting them one-by-one is fragile. A clean start from the PostgreSQL schema is simpler and produces idiomatic SQL.
**Alternative considered**: Manually converting each SQLite migration to PostgreSQL — error-prone and provides no benefit since there's no production data that needs incremental migration history.
### 5. Type mappings: SQLite → PostgreSQL
| SQLite type | PostgreSQL type | Notes |
|---|---|---|
| `integer` (PK, autoIncrement) | `serial` | Auto-incrementing integer |
| `integer` (timestamps) | `timestamp` | Use `defaultNow()` where applicable |
| `integer` (booleans like `is_active`) | `boolean` | True PostgreSQL booleans |
| `real` | `doublePrecision` | OHLC price data |
| `text` | `text` | No change |
| `text` (JSON strings) | `jsonb` | For `geometry`, `sub_spans`, `model_prediction` |
### 6. Connection management for Next.js
**Choice**: Use a connection pool via `pg.Pool` with `max: 10` connections. Connection string from `DATABASE_URL` env var.
**Why**: SQLite was single-file, no pooling needed. PostgreSQL requires connection pooling for concurrent API requests. 10 connections is reasonable for the frontend workload.
### 7. ML service direct access to frontend tables
**Choice**: The ML service reads frontend tables (candles, annotations, span_annotations) directly via SQLAlchemy using its existing connection. No new SQLAlchemy models needed — raw SQL queries or lightweight table reflections are sufficient for read-only access.
**Why**: The ML service only needs to read training data. Adding full SQLAlchemy models for tables owned by Drizzle creates a dual-ownership problem. Raw queries or `Table` reflections keep it simple.
## Risks / Trade-offs
**[Schema drift between Drizzle and SQLAlchemy]** → Both ORMs manage tables in the same database. Drizzle owns frontend tables, SQLAlchemy owns ML tables. Neither should modify the other's tables. This is enforced by convention, not tooling.
**[Connection pool exhaustion]** → Adding the frontend's database traffic to the same PostgreSQL instance increases load. Mitigation: PostgreSQL 16 handles far more concurrent connections than SQLite. The `pg.Pool` max of 10 plus SQLAlchemy's pool of 5 is well within PostgreSQL's default `max_connections` of 100.
**[Data loss during migration]** → SQLite data must be migrated before switching. Mitigation: Write a migration script that exports SQLite data and imports to PostgreSQL. Run before deploying the new code. Keep the SQLite file as backup.
**[Drizzle push/generate differences]** → PostgreSQL dialect may generate slightly different migration SQL than expected. Mitigation: Review generated migrations before applying. Use `drizzle-kit push` for development, `drizzle-kit generate` + `drizzle-kit migrate` for production.
**[Boolean conversion]** → SQLite uses `0/1` for booleans, PostgreSQL uses `true/false`. Mitigation: The migration script handles conversion. Drizzle's `boolean()` type handles this transparently at the ORM level going forward.
## Migration Plan
1. **Update schema and dependencies** — Rewrite Drizzle schema for PostgreSQL, swap npm packages
2. **Generate fresh migrations**`drizzle-kit generate` from the new PostgreSQL schema
3. **Update docker-compose.yml** — Rename database, add frontend dependency on postgres, remove `candle-data` volume
4. **Update environment variables**`DATABASE_URL` for the frontend service
5. **Write data migration script**`scripts/migrate-sqlite-to-postgres.ts` that reads SQLite and inserts into PostgreSQL with type conversions
6. **Update db/index.ts** — Switch from `better-sqlite3` to `pg` pool, update migration runner
7. **Test locally** — Run migrations, migrate data, verify API routes work
8. **Deploy** — Stop current services, run PostgreSQL migrations, run data migration, deploy new code
9. **Rollback** — If issues arise, revert docker-compose and code, restore SQLite volume. The SQLite file is kept as backup for 1 week post-migration.
## Open Questions
- Should the ML service user (`ml_user`) have write access to frontend tables, or should we create a separate read-only role? (Recommendation: keep `ml_user` with full access for simplicity, revisit if the team grows.)
- Do we need to preserve SQLite migration history in git for reference, or delete the `drizzle/` folder contents entirely? (Recommendation: delete and start fresh.)

View file

@ -0,0 +1,35 @@
## Why
The project currently runs two separate database servers: SQLite (via Drizzle ORM) for the Next.js frontend and PostgreSQL for the ML service. This creates unnecessary operational complexity — two different ORMs, two migration systems, two backup strategies, and no ability for the ML service to directly query annotation/candle data. Consolidating to PostgreSQL as the single database simplifies deployment, enables direct cross-service data access, and reduces the infrastructure footprint.
## What Changes
- **BREAKING**: Replace SQLite/better-sqlite3/Drizzle with PostgreSQL/Drizzle (pg driver) for the Next.js frontend
- Remove the `candle-data` Docker volume (SQLite file storage) and `DATABASE_PATH` env var
- Migrate all frontend tables (charts, candles, annotations, annotation_types, span_annotations, span_label_types) into the existing PostgreSQL instance
- Update Drizzle schema and config to target PostgreSQL instead of SQLite
- Regenerate Drizzle migrations for PostgreSQL dialect (column types change: `integer``serial`, `real``double precision`, timestamps as proper `timestamp` types, etc.)
- Update the ML service to share the same PostgreSQL database (or a separate schema within it) so it can directly query candle/annotation data instead of relying on CSV/JSON exports
- Update docker-compose.yml to remove SQLite volume dependency and point the frontend at PostgreSQL
- Update environment variables: frontend gets `DATABASE_URL` pointing to PostgreSQL
## Capabilities
### New Capabilities
- `postgres-data-layer`: Unified PostgreSQL data access layer for the Next.js frontend, replacing the SQLite/better-sqlite3 setup with Drizzle's PostgreSQL driver
### Modified Capabilities
- `docker-deployment`: Container configuration changes — remove SQLite volume, add PostgreSQL dependency for the frontend service, update environment variables
- `ml-training`: ML service can now query annotations and candle data directly from PostgreSQL instead of requiring CSV/JSON file exports
## Impact
- **Database schema**: All 6 frontend tables move to PostgreSQL with type adaptations (SQLite integers → PostgreSQL serial/integer/timestamp)
- **ORM layer**: `src/lib/db/index.ts` switches from `better-sqlite3` to `postgres` driver; schema types in `src/lib/db/schema.ts` change to PostgreSQL equivalents
- **Dependencies**: Remove `better-sqlite3`, add `postgres` (or `pg`) npm package for Drizzle's PostgreSQL adapter
- **Migrations**: Existing SQLite migrations become obsolete; new PostgreSQL migrations needed
- **Docker**: `candle-annotator` service gains `depends_on: postgres`, loses `candle-data` volume mount
- **Environment**: `.env` and `.env.example` updated with PostgreSQL connection string for frontend
- **ML service**: `services/ml/app/db.py` gains access to frontend tables (candles, annotations) for direct querying
- **Data migration**: Existing SQLite data needs a one-time migration script to PostgreSQL
- **API routes**: All Next.js API routes using `db` from `src/lib/db` continue working (Drizzle abstracts the driver change), but queries using SQLite-specific syntax may need adjustment

View file

@ -0,0 +1,81 @@
## MODIFIED Requirements
### Requirement: Docker Compose configuration
The project SHALL include docker-compose.yml for simplified deployment orchestration.
#### Scenario: Service definition
- **WHEN** docker-compose.yml is parsed
- **THEN** defines service named 'candle-annotator' using Dockerfile from current directory
#### Scenario: Port mapping
- **WHEN** docker-compose up runs
- **THEN** maps host port 3000 to container port 3000
#### Scenario: Volume mounting for ML data
- **WHEN** docker-compose up runs
- **THEN** mounts named volume 'ml-data' to /app/ml-data in the candle-annotator container
#### Scenario: Frontend depends on PostgreSQL
- **WHEN** docker-compose up runs
- **THEN** the candle-annotator service starts only after the postgres service is healthy (`depends_on: postgres: condition: service_healthy`)
#### Scenario: Frontend DATABASE_URL
- **WHEN** the candle-annotator service starts
- **THEN** the `DATABASE_URL` environment variable is set to `postgresql://ml_user:ml_password@postgres:5432/candle_annotator`
#### Scenario: Restart policy
- **WHEN** container crashes or stops
- **THEN** docker-compose automatically restarts container unless explicitly stopped (restart: unless-stopped)
#### Scenario: No SQLite volume
- **WHEN** docker-compose.yml is parsed
- **THEN** there is no `candle-data` volume defined or mounted
### Requirement: Environment variable configuration
The project SHALL use environment variables for runtime configuration.
#### Scenario: .env.example file
- **WHEN** repository is cloned
- **THEN** includes .env.example file documenting all configurable environment variables with example values
#### Scenario: DATABASE_URL configuration
- **WHEN** `DATABASE_URL` environment variable is set
- **THEN** the Next.js application connects to the PostgreSQL database at the specified URL
#### Scenario: No DATABASE_PATH variable
- **WHEN** environment variables are inspected
- **THEN** there is no `DATABASE_PATH` variable (SQLite path is removed)
#### Scenario: PORT configuration
- **WHEN** PORT environment variable is set
- **THEN** Next.js server listens on specified port (default: 3000)
#### Scenario: NODE_ENV configuration
- **WHEN** NODE_ENV environment variable is set to 'production'
- **THEN** Next.js runs in production mode with optimizations enabled
### Requirement: Database persistence
The deployment SHALL ensure PostgreSQL data persists across container restarts.
#### Scenario: PostgreSQL volume
- **WHEN** docker-compose up runs
- **THEN** the `postgres-data` named volume is mounted to `/var/lib/postgresql/data` in the postgres container
#### Scenario: Container restart preserves data
- **WHEN** the postgres container is stopped and restarted
- **THEN** all database tables and data remain intact
#### Scenario: PostgreSQL database name
- **WHEN** the postgres service starts
- **THEN** the `POSTGRES_DB` environment variable is set to `candle_annotator`
### Requirement: Health check endpoint
The API SHALL provide a health check endpoint for container orchestration.
#### Scenario: Health check endpoint responds
- **WHEN** GET request sent to `/api/health`
- **THEN** system returns 200 status with JSON `{ status: 'ok', timestamp: <unix_timestamp> }`
#### Scenario: Database connection check
- **WHEN** GET request sent to `/api/health?check=db`
- **THEN** system attempts a PostgreSQL query and returns 200 if successful, 503 if database unavailable

View file

@ -0,0 +1,37 @@
## MODIFIED Requirements
### Requirement: PostgreSQL training metadata storage
The system SHALL store training run metadata in the PostgreSQL database. Each training run record SHALL include: run_id (MLflow run ID), model_type, experiment_name, pipeline_config_hash, dataset_version, metrics summary (JSON), status, and timestamps (created_at, completed_at).
#### Scenario: Store training run record
- **WHEN** a training run completes successfully
- **THEN** the system inserts a record into the PostgreSQL `training_runs` table with the run metadata
#### Scenario: Query training history
- **WHEN** the system queries training runs
- **THEN** it returns records from PostgreSQL ordered by created_at descending
#### Scenario: Database name updated
- **WHEN** the ML service connects to PostgreSQL
- **THEN** it connects to the `candle_annotator` database (not `ml_db`)
## ADDED Requirements
### Requirement: Direct annotation data access
The ML service SHALL read candle and annotation data directly from PostgreSQL instead of requiring CSV/JSON file exports. The ML service SHALL query the `candles`, `annotations`, `span_annotations`, and `charts` tables for training data.
#### Scenario: Query candle data for training
- **WHEN** the ML training pipeline needs OHLC data for a chart
- **THEN** it queries the `candles` table in PostgreSQL filtered by `chart_id`, ordered by `time`
#### Scenario: Query span annotations for labels
- **WHEN** the ML training pipeline needs labeled spans for training
- **THEN** it queries the `span_annotations` table in PostgreSQL filtered by `chart_id` and optionally by `source`
#### Scenario: No CSV/JSON export required
- **WHEN** the ML training pipeline starts
- **THEN** it does not require pre-exported CSV or JSON files — all data is read from PostgreSQL
#### Scenario: Shared database connection
- **WHEN** the ML service reads candle/annotation data
- **THEN** it uses the same PostgreSQL connection (same database, same credentials) as for `training_runs`

View file

@ -0,0 +1,80 @@
## ADDED Requirements
### Requirement: PostgreSQL connection via Drizzle ORM
The Next.js application SHALL connect to PostgreSQL using Drizzle ORM with the `node-postgres` (`pg`) driver. The connection SHALL use a pool with a configurable maximum number of connections (default: 10). The connection string SHALL be read from the `DATABASE_URL` environment variable.
#### Scenario: Successful connection
- **WHEN** the application starts with a valid `DATABASE_URL` pointing to a running PostgreSQL instance
- **THEN** Drizzle ORM establishes a connection pool and the `db` export is ready for queries
#### Scenario: Missing DATABASE_URL
- **WHEN** the `DATABASE_URL` environment variable is not set
- **THEN** the application SHALL fail to start with an error message indicating the missing variable
#### Scenario: Database unreachable
- **WHEN** the PostgreSQL instance is not reachable at the configured URL
- **THEN** the application SHALL fail to start with a connection error
### Requirement: PostgreSQL schema definitions
The Drizzle schema SHALL define all frontend tables using `pgTable` from `drizzle-orm/pg-core`. The following tables SHALL be defined: `charts`, `candles`, `annotation_types`, `annotations`, `span_label_types`, `span_annotations`.
#### Scenario: Charts table schema
- **WHEN** the schema is loaded
- **THEN** the `charts` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `created_at` (timestamp, not null, default now)
#### Scenario: Candles table schema
- **WHEN** the schema is loaded
- **THEN** the `candles` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `time` (timestamp, not null), `open` (double precision, not null), `high` (double precision, not null), `low` (double precision, not null), `close` (double precision, not null), with a unique index on `(chart_id, time)`
#### Scenario: Annotation types table schema
- **WHEN** the schema is loaded
- **THEN** the `annotation_types` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `display_name` (text, not null), `color` (text, not null), `category` (text, not null), `icon` (text, nullable), `is_active` (boolean, not null, default true), `created_at` (timestamp, not null, default now)
#### Scenario: Annotations table schema
- **WHEN** the schema is loaded
- **THEN** the `annotations` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `timestamp` (timestamp, not null), `label_type` (text, not null), `geometry` (jsonb, nullable), `color` (text, default '#3b82f6'), `created_at` (timestamp, not null, default now)
#### Scenario: Span label types table schema
- **WHEN** the schema is loaded
- **THEN** the `span_label_types` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `display_name` (text, not null), `color` (text, not null), `hotkey` (text, nullable), `is_active` (boolean, not null, default true), `sort_order` (integer, not null, default 0), `created_at` (timestamp, not null, default now)
#### Scenario: Span annotations table schema
- **WHEN** the schema is loaded
- **THEN** the `span_annotations` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `start_time` (timestamp, not null), `end_time` (timestamp, not null), `label` (text, not null), `confidence` (integer, nullable), `outcome` (text, nullable), `notes` (text, nullable), `sub_spans` (jsonb, nullable), `color` (text, not null, default '#2196F3'), `source` (text, not null, default 'human'), `model_prediction` (jsonb, nullable), `created_at` (timestamp, not null, default now)
### Requirement: PostgreSQL migrations via Drizzle Kit
The project SHALL use Drizzle Kit to generate and apply PostgreSQL migrations. The `drizzle.config.ts` SHALL target the `postgresql` dialect. Existing SQLite migrations SHALL be removed.
#### Scenario: Generate migrations
- **WHEN** `drizzle-kit generate` is executed
- **THEN** a new SQL migration file is created in the `drizzle/` directory with PostgreSQL-dialect DDL
#### Scenario: Apply migrations at startup
- **WHEN** the application starts (not during build phase)
- **THEN** Drizzle runs pending migrations against the PostgreSQL database
#### Scenario: Skip migrations during build
- **WHEN** `NEXT_PHASE` is `phase-production-build` or `phase-development-build`
- **THEN** migration execution is skipped
### Requirement: npm dependency changes
The project SHALL remove `better-sqlite3` and `@types/better-sqlite3` from dependencies and add `pg` and `@types/pg`.
#### Scenario: Dependencies updated
- **WHEN** `package.json` is inspected
- **THEN** `better-sqlite3` and `@types/better-sqlite3` are absent, and `pg` and `@types/pg` are present in dependencies
### Requirement: Data migration from SQLite to PostgreSQL
The project SHALL include a one-time migration script at `scripts/migrate-sqlite-to-postgres.ts` that reads all data from the SQLite database and inserts it into PostgreSQL with appropriate type conversions.
#### Scenario: Migrate all tables
- **WHEN** the migration script is executed with both databases accessible
- **THEN** all rows from charts, candles, annotation_types, annotations, span_label_types, and span_annotations are transferred to PostgreSQL
#### Scenario: Type conversions applied
- **WHEN** data is migrated
- **THEN** SQLite integer timestamps are converted to PostgreSQL timestamps, integer booleans (0/1) are converted to PostgreSQL booleans, and text JSON fields are inserted as jsonb
#### Scenario: Idempotent execution
- **WHEN** the migration script is run a second time on an already-migrated database
- **THEN** the script either skips existing data or clears and re-inserts (with a flag), without creating duplicates

View file

@ -0,0 +1,53 @@
## 1. Dependencies and Configuration
- [x] 1.1 Remove `better-sqlite3` and `@types/better-sqlite3` from package.json
- [x] 1.2 Add `pg` and `@types/pg` to package.json dependencies
- [x] 1.3 Run `npm install` to update node_modules and lockfile
- [x] 1.4 Update `drizzle.config.ts` to target `postgresql` dialect with `DATABASE_URL` env var
- [x] 1.5 Update `.env.example` — replace `DATABASE_PATH` with `DATABASE_URL=postgresql://ml_user:ml_password@postgres:5432/candle_annotator`
## 2. Drizzle Schema Migration (SQLite → PostgreSQL)
- [x] 2.1 Rewrite `src/lib/db/schema.ts` — replace all `sqliteTable` with `pgTable`, apply type mappings (integer→serial, integer→timestamp, integer→boolean, real→doublePrecision, text JSON→jsonb)
- [x] 2.2 Delete all existing SQLite migration files in `drizzle/` directory
- [x] 2.3 Run `drizzle-kit generate` to produce fresh PostgreSQL migration SQL
- [x] 2.4 Review generated migration SQL for correctness
## 3. Database Connection Layer
- [x] 3.1 Rewrite `src/lib/db/index.ts` — replace `better-sqlite3` driver with `pg.Pool` (max: 10), read `DATABASE_URL` from env, fail if missing
- [x] 3.2 Update migration runner to use PostgreSQL-compatible execution (skip during build phase via `NEXT_PHASE` check)
- [x] 3.3 Update all imports if any changed (verify `db` export still works for API routes)
## 4. API Route Adjustments
- [x] 4.1 Audit all Next.js API routes using `db` for SQLite-specific syntax (e.g., integer booleans, raw SQL fragments)
- [x] 4.2 Fix any SQLite-specific query patterns to work with PostgreSQL (boolean handling, timestamp handling, jsonb operations)
- [x] 4.3 Update health check endpoint (`/api/health`) to verify PostgreSQL connectivity
## 5. Docker and Deployment
- [x] 5.1 Update `docker-compose.yml` — rename `POSTGRES_DB` to `candle_annotator`, add `DATABASE_URL` env to candle-annotator service, add `depends_on: postgres` with health check condition
- [x] 5.2 Remove `candle-data` volume from `docker-compose.yml` (SQLite volume)
- [x] 5.3 Update `Dockerfile` if it references SQLite or `DATABASE_PATH`
- [x] 5.4 Update ML service database connection — change database name from `ml_db` to `candle_annotator` in environment config
## 6. ML Service Direct Data Access
- [x] 6.1 Add SQLAlchemy table reflections or raw queries in the ML service for reading `candles`, `annotations`, `span_annotations`, `charts` tables
- [x] 6.2 Update ML training pipeline to query candle/annotation data from PostgreSQL instead of CSV/JSON exports
- [x] 6.3 Remove or deprecate any CSV/JSON export code paths that are no longer needed
## 7. Data Migration Script
- [x] 7.1 Create `scripts/migrate-sqlite-to-postgres.ts` — read all 6 tables from SQLite, apply type conversions (timestamps, booleans, JSON→jsonb), insert into PostgreSQL
- [x] 7.2 Make the script idempotent (skip or clear+re-insert with flag)
- [x] 7.3 Test migration script with existing SQLite data
## 8. Testing and Verification
- [x] 8.1 Run the full application locally with PostgreSQL — verify all API routes work
- [x] 8.2 Verify ML service can query candle/annotation data from shared database
- [x] 8.3 Run `docker compose up` and verify all services start correctly with new configuration
- [x] 8.4 Update `DEPLOYMENT.md` with new deployment steps (PostgreSQL migration, data migration script, rollback procedure)
- [x] 8.5 Update `README.md` and `CLAUDE_DESCRIPTION.md` with database architecture changes

View file

@ -32,27 +32,31 @@ The project SHALL include docker-compose.yml for simplified deployment orchestra
#### Scenario: Service definition
- **WHEN** docker-compose.yml is parsed
- **THEN** defines single service named 'candle-annotator' using Dockerfile from current directory
- **THEN** defines service named 'candle-annotator' using Dockerfile from current directory
#### Scenario: Port mapping
- **WHEN** docker-compose up runs
- **THEN** maps host port 3000 to container port 3000 (configurable via PORT environment variable)
- **THEN** maps host port 3000 to container port 3000
#### Scenario: Volume mounting for database
#### Scenario: Volume mounting for ML data
- **WHEN** docker-compose up runs
- **THEN** mounts named volume 'candle-data' to /app/data in container for SQLite database persistence
- **THEN** mounts named volume 'ml-data' to /app/ml-data in the candle-annotator container
#### Scenario: Environment variable configuration
- **WHEN** docker-compose.yml is used
- **THEN** supports loading environment variables from .env file for NODE_ENV, PORT, and other configs
#### Scenario: Frontend depends on PostgreSQL
- **WHEN** docker-compose up runs
- **THEN** the candle-annotator service starts only after the postgres service is healthy (`depends_on: postgres: condition: service_healthy`)
#### Scenario: Frontend DATABASE_URL
- **WHEN** the candle-annotator service starts
- **THEN** the `DATABASE_URL` environment variable is set to `postgresql://ml_user:ml_password@postgres:5432/candle_annotator`
#### Scenario: Restart policy
- **WHEN** container crashes or stops
- **THEN** docker-compose automatically restarts container unless explicitly stopped (restart: unless-stopped)
#### Scenario: Volume declaration
#### Scenario: No SQLite volume
- **WHEN** docker-compose.yml is parsed
- **THEN** declares 'candle-data' as named volume in volumes section
- **THEN** there is no `candle-data` volume defined or mounted
### Requirement: Environment variable configuration
The project SHALL use environment variables for runtime configuration.
@ -61,6 +65,14 @@ The project SHALL use environment variables for runtime configuration.
- **WHEN** repository is cloned
- **THEN** includes .env.example file documenting all configurable environment variables with example values
#### Scenario: DATABASE_URL configuration
- **WHEN** `DATABASE_URL` environment variable is set
- **THEN** the Next.js application connects to the PostgreSQL database at the specified URL
#### Scenario: No DATABASE_PATH variable
- **WHEN** environment variables are inspected
- **THEN** there is no `DATABASE_PATH` variable (SQLite path is removed)
#### Scenario: PORT configuration
- **WHEN** PORT environment variable is set
- **THEN** Next.js server listens on specified port (default: 3000)
@ -69,14 +81,6 @@ The project SHALL use environment variables for runtime configuration.
- **WHEN** NODE_ENV environment variable is set to 'production'
- **THEN** Next.js runs in production mode with optimizations enabled
#### Scenario: Database path configuration
- **WHEN** DATABASE_PATH environment variable is set (optional)
- **THEN** SQLite database file is created at specified path (default: ./data/candles.db)
#### Scenario: HOSTNAME configuration
- **WHEN** HOSTNAME environment variable is set
- **THEN** Next.js server binds to specified hostname (default: 0.0.0.0 for containers)
### Requirement: Health check endpoint
The API SHALL provide a health check endpoint for container orchestration.
@ -86,7 +90,7 @@ The API SHALL provide a health check endpoint for container orchestration.
#### Scenario: Database connection check
- **WHEN** GET request sent to `/api/health?check=db`
- **THEN** system attempts simple database query and returns 200 if successful, 503 if database unavailable
- **THEN** system attempts a PostgreSQL query and returns 200 if successful, 503 if database unavailable
#### Scenario: Health check in Dockerfile
- **WHEN** Dockerfile defines HEALTHCHECK
@ -138,23 +142,19 @@ The Docker image SHALL be optimized for production use with minimal size.
- **THEN** final image size is under 200MB (excluding data volume)
### Requirement: Database persistence
The deployment SHALL ensure SQLite database persists across container restarts.
The deployment SHALL ensure PostgreSQL data persists across container restarts.
#### Scenario: Volume mounting
- **WHEN** container runs with volume mount
- **THEN** /app/data directory is mounted from host or Docker volume
#### Scenario: Database file location
- **WHEN** application initializes database
- **THEN** SQLite file is created at /app/data/candles.db (inside mounted volume)
#### Scenario: PostgreSQL volume
- **WHEN** docker-compose up runs
- **THEN** the `postgres-data` named volume is mounted to `/var/lib/postgresql/data` in the postgres container
#### Scenario: Container restart preserves data
- **WHEN** container is stopped and restarted
- **THEN** existing database file is reused and all annotations and candles remain intact
- **WHEN** the postgres container is stopped and restarted
- **THEN** all database tables and data remain intact
#### Scenario: File permissions
- **WHEN** container creates database file
- **THEN** appuser has read/write permissions to /app/data directory
#### Scenario: PostgreSQL database name
- **WHEN** the postgres service starts
- **THEN** the `POSTGRES_DB` environment variable is set to `candle_annotator`
### Requirement: Deployment documentation
DEPLOYMENT.md SHALL include comprehensive Docker deployment instructions.

View file

@ -84,6 +84,29 @@ The system SHALL store training run metadata in the PostgreSQL database. Each tr
- **WHEN** the system queries training runs
- **THEN** it returns records from PostgreSQL ordered by created_at descending
#### Scenario: Database name updated
- **WHEN** the ML service connects to PostgreSQL
- **THEN** it connects to the `candle_annotator` database (not `ml_db`)
### Requirement: Direct annotation data access
The ML service SHALL read candle and annotation data directly from PostgreSQL instead of requiring CSV/JSON file exports. The ML service SHALL query the `candles`, `annotations`, `span_annotations`, and `charts` tables for training data.
#### Scenario: Query candle data for training
- **WHEN** the ML training pipeline needs OHLC data for a chart
- **THEN** it queries the `candles` table in PostgreSQL filtered by `chart_id`, ordered by `time`
#### Scenario: Query span annotations for labels
- **WHEN** the ML training pipeline needs labeled spans for training
- **THEN** it queries the `span_annotations` table in PostgreSQL filtered by `chart_id` and optionally by `source`
#### Scenario: No CSV/JSON export required
- **WHEN** the ML training pipeline starts
- **THEN** it does not require pre-exported CSV or JSON files — all data is read from PostgreSQL
#### Scenario: Shared database connection
- **WHEN** the ML service reads candle/annotation data
- **THEN** it uses the same PostgreSQL connection (same database, same credentials) as for `training_runs`
### Requirement: Pipeline config logging
The system SHALL log the full pipeline YAML config as an MLflow artifact with each training run. This config SHALL be used during inference to replicate the exact preprocessing steps.

View file

@ -0,0 +1,80 @@
## ADDED Requirements
### Requirement: PostgreSQL connection via Drizzle ORM
The Next.js application SHALL connect to PostgreSQL using Drizzle ORM with the `node-postgres` (`pg`) driver. The connection SHALL use a pool with a configurable maximum number of connections (default: 10). The connection string SHALL be read from the `DATABASE_URL` environment variable.
#### Scenario: Successful connection
- **WHEN** the application starts with a valid `DATABASE_URL` pointing to a running PostgreSQL instance
- **THEN** Drizzle ORM establishes a connection pool and the `db` export is ready for queries
#### Scenario: Missing DATABASE_URL
- **WHEN** the `DATABASE_URL` environment variable is not set
- **THEN** the application SHALL fail to start with an error message indicating the missing variable
#### Scenario: Database unreachable
- **WHEN** the PostgreSQL instance is not reachable at the configured URL
- **THEN** the application SHALL fail to start with a connection error
### Requirement: PostgreSQL schema definitions
The Drizzle schema SHALL define all frontend tables using `pgTable` from `drizzle-orm/pg-core`. The following tables SHALL be defined: `charts`, `candles`, `annotation_types`, `annotations`, `span_label_types`, `span_annotations`.
#### Scenario: Charts table schema
- **WHEN** the schema is loaded
- **THEN** the `charts` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `created_at` (timestamp, not null, default now)
#### Scenario: Candles table schema
- **WHEN** the schema is loaded
- **THEN** the `candles` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `time` (timestamp, not null), `open` (double precision, not null), `high` (double precision, not null), `low` (double precision, not null), `close` (double precision, not null), with a unique index on `(chart_id, time)`
#### Scenario: Annotation types table schema
- **WHEN** the schema is loaded
- **THEN** the `annotation_types` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `display_name` (text, not null), `color` (text, not null), `category` (text, not null), `icon` (text, nullable), `is_active` (boolean, not null, default true), `created_at` (timestamp, not null, default now)
#### Scenario: Annotations table schema
- **WHEN** the schema is loaded
- **THEN** the `annotations` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `timestamp` (timestamp, not null), `label_type` (text, not null), `geometry` (jsonb, nullable), `color` (text, default '#3b82f6'), `created_at` (timestamp, not null, default now)
#### Scenario: Span label types table schema
- **WHEN** the schema is loaded
- **THEN** the `span_label_types` table has columns: `id` (serial, primary key), `name` (text, unique, not null), `display_name` (text, not null), `color` (text, not null), `hotkey` (text, nullable), `is_active` (boolean, not null, default true), `sort_order` (integer, not null, default 0), `created_at` (timestamp, not null, default now)
#### Scenario: Span annotations table schema
- **WHEN** the schema is loaded
- **THEN** the `span_annotations` table has columns: `id` (serial, primary key), `chart_id` (integer, foreign key to charts.id, not null), `start_time` (timestamp, not null), `end_time` (timestamp, not null), `label` (text, not null), `confidence` (integer, nullable), `outcome` (text, nullable), `notes` (text, nullable), `sub_spans` (jsonb, nullable), `color` (text, not null, default '#2196F3'), `source` (text, not null, default 'human'), `model_prediction` (jsonb, nullable), `created_at` (timestamp, not null, default now)
### Requirement: PostgreSQL migrations via Drizzle Kit
The project SHALL use Drizzle Kit to generate and apply PostgreSQL migrations. The `drizzle.config.ts` SHALL target the `postgresql` dialect. Existing SQLite migrations SHALL be removed.
#### Scenario: Generate migrations
- **WHEN** `drizzle-kit generate` is executed
- **THEN** a new SQL migration file is created in the `drizzle/` directory with PostgreSQL-dialect DDL
#### Scenario: Apply migrations at startup
- **WHEN** the application starts (not during build phase)
- **THEN** Drizzle runs pending migrations against the PostgreSQL database
#### Scenario: Skip migrations during build
- **WHEN** `NEXT_PHASE` is `phase-production-build` or `phase-development-build`
- **THEN** migration execution is skipped
### Requirement: npm dependency changes
The project SHALL remove `better-sqlite3` and `@types/better-sqlite3` from dependencies and add `pg` and `@types/pg`.
#### Scenario: Dependencies updated
- **WHEN** `package.json` is inspected
- **THEN** `better-sqlite3` and `@types/better-sqlite3` are absent, and `pg` and `@types/pg` are present in dependencies
### Requirement: Data migration from SQLite to PostgreSQL
The project SHALL include a one-time migration script at `scripts/migrate-sqlite-to-postgres.ts` that reads all data from the SQLite database and inserts it into PostgreSQL with appropriate type conversions.
#### Scenario: Migrate all tables
- **WHEN** the migration script is executed with both databases accessible
- **THEN** all rows from charts, candles, annotation_types, annotations, span_label_types, and span_annotations are transferred to PostgreSQL
#### Scenario: Type conversions applied
- **WHEN** data is migrated
- **THEN** SQLite integer timestamps are converted to PostgreSQL timestamps, integer booleans (0/1) are converted to PostgreSQL booleans, and text JSON fields are inserted as jsonb
#### Scenario: Idempotent execution
- **WHEN** the migration script is run a second time on an already-migrated database
- **THEN** the script either skips existing data or clears and re-inserts (with a flag), without creating duplicates