candle-annotator/openspec/changes/ml-db-consolidation/design.md at 5f70f13da3090870f253b6ea22119b65a9a5a837

Marko Djordjevic 5f70f13da3 feat: migrate from SQLite to PostgreSQL - complete schema and API updates

- Remove better-sqlite3, add pg driver
- Convert schema to PostgreSQL types (serial, timestamp, boolean, jsonb)
- Generate fresh PostgreSQL migrations
- Update database connection layer with pg.Pool
- Fix all API routes: remove JSON.parse/stringify, use native timestamps and booleans
- Update drizzle.config.ts and .env.example for PostgreSQL

2026-02-17 13:43:06 +01:00

7.4 KiB

Raw Blame History

Context

The candle annotator runs two databases:

SQLite (data/candles.db) — serves the Next.js frontend via Drizzle ORM (better-sqlite3 driver). Contains 6 tables: charts, candles, annotations, annotation_types, span_annotations, span_label_types.
PostgreSQL (postgres:5432/ml_db) — serves the Python ML service via SQLAlchemy. Contains 1 table: training_runs.

The ML service cannot directly query annotation/candle data. Data flows through CSV/JSON file exports. PostgreSQL already runs in Docker for the ML service, so consolidating means adding frontend tables there — not introducing a new service.

Goals / Non-Goals

Goals:

Single PostgreSQL instance for all application data
Drizzle ORM continues to manage frontend schema (just switches dialect)
ML service gains direct read access to candle/annotation tables
Simplified Docker setup (one fewer volume, one database to back up)
One-time data migration path from SQLite to PostgreSQL

Non-Goals:

Changing the ML service ORM (SQLAlchemy stays)
Merging Drizzle and SQLAlchemy migration systems (each manages its own tables)
Changing API route logic or query patterns beyond what's needed for the dialect switch
Multi-tenant or schema separation (all tables go in the public schema)
Migrating away from Drizzle ORM

Decisions

1. Drizzle PostgreSQL driver: `drizzle-orm/node-postgres` with `pg`

Choice: Use pg (node-postgres) as the driver.

Why: pg is the most mature PostgreSQL driver for Node.js. Drizzle supports it natively via drizzle-orm/node-postgres. The postgres (postgres.js) driver is also an option but pg has broader ecosystem support and is easier to debug.

Alternative considered: postgres (postgres.js) — lighter, promise-native, but less battle-tested with Drizzle migrations.

2. Shared database, single `public` schema

Choice: All tables (frontend + ML) live in the same database (ml_db) and the default public schema.

Why: The table sets don't overlap (frontend has charts/candles/annotations, ML has training_runs). Separate schemas add complexity with no benefit for 7 total tables. The ML service already connects to ml_db.

Alternative considered: Separate PostgreSQL schemas (app and ml) — cleaner isolation but adds schema-prefix complexity to queries and cross-schema references. Not worth it at this scale.

3. Rename database from `ml_db` to `candle_annotator`

Choice: Rename the PostgreSQL database to candle_annotator since it now serves the whole application, not just ML.

Why: ml_db is misleading when the database holds frontend data too. Renaming during consolidation is the natural time to do it.

Alternative considered: Keep ml_db — avoids a rename step but creates lasting confusion.

4. Fresh Drizzle migrations (drop SQLite migrations)

Choice: Delete all existing SQLite migrations in drizzle/, rewrite the schema file with pgTable equivalents, and run drizzle-kit generate to produce a fresh initial PostgreSQL migration.

Why: SQLite migrations are dialect-specific (e.g., integer for booleans, no native timestamps). Converting them one-by-one is fragile. A clean start from the PostgreSQL schema is simpler and produces idiomatic SQL.

Alternative considered: Manually converting each SQLite migration to PostgreSQL — error-prone and provides no benefit since there's no production data that needs incremental migration history.

5. Type mappings: SQLite → PostgreSQL

SQLite type	PostgreSQL type	Notes
`integer` (PK, autoIncrement)	`serial`	Auto-incrementing integer
`integer` (timestamps)	`timestamp`	Use `defaultNow()` where applicable
`integer` (booleans like `is_active`)	`boolean`	True PostgreSQL booleans
`real`	`doublePrecision`	OHLC price data
`text`	`text`	No change
`text` (JSON strings)	`jsonb`	For `geometry`, `sub_spans`, `model_prediction`

6. Connection management for Next.js

Choice: Use a connection pool via pg.Pool with max: 10 connections. Connection string from DATABASE_URL env var.

Why: SQLite was single-file, no pooling needed. PostgreSQL requires connection pooling for concurrent API requests. 10 connections is reasonable for the frontend workload.

7. ML service direct access to frontend tables

Choice: The ML service reads frontend tables (candles, annotations, span_annotations) directly via SQLAlchemy using its existing connection. No new SQLAlchemy models needed — raw SQL queries or lightweight table reflections are sufficient for read-only access.

Why: The ML service only needs to read training data. Adding full SQLAlchemy models for tables owned by Drizzle creates a dual-ownership problem. Raw queries or Table reflections keep it simple.

Risks / Trade-offs

[Schema drift between Drizzle and SQLAlchemy] → Both ORMs manage tables in the same database. Drizzle owns frontend tables, SQLAlchemy owns ML tables. Neither should modify the other's tables. This is enforced by convention, not tooling.

[Connection pool exhaustion] → Adding the frontend's database traffic to the same PostgreSQL instance increases load. Mitigation: PostgreSQL 16 handles far more concurrent connections than SQLite. The pg.Pool max of 10 plus SQLAlchemy's pool of 5 is well within PostgreSQL's default max_connections of 100.

[Data loss during migration] → SQLite data must be migrated before switching. Mitigation: Write a migration script that exports SQLite data and imports to PostgreSQL. Run before deploying the new code. Keep the SQLite file as backup.

[Drizzle push/generate differences] → PostgreSQL dialect may generate slightly different migration SQL than expected. Mitigation: Review generated migrations before applying. Use drizzle-kit push for development, drizzle-kit generate + drizzle-kit migrate for production.

[Boolean conversion] → SQLite uses 0/1 for booleans, PostgreSQL uses true/false. Mitigation: The migration script handles conversion. Drizzle's boolean() type handles this transparently at the ORM level going forward.

Migration Plan

Update schema and dependencies — Rewrite Drizzle schema for PostgreSQL, swap npm packages
Generate fresh migrations — drizzle-kit generate from the new PostgreSQL schema
Update docker-compose.yml — Rename database, add frontend dependency on postgres, remove candle-data volume
Update environment variables — DATABASE_URL for the frontend service
Write data migration script — scripts/migrate-sqlite-to-postgres.ts that reads SQLite and inserts into PostgreSQL with type conversions
Update db/index.ts — Switch from better-sqlite3 to pg pool, update migration runner
Test locally — Run migrations, migrate data, verify API routes work
Deploy — Stop current services, run PostgreSQL migrations, run data migration, deploy new code
Rollback — If issues arise, revert docker-compose and code, restore SQLite volume. The SQLite file is kept as backup for 1 week post-migration.

Open Questions

Should the ML service user (ml_user) have write access to frontend tables, or should we create a separate read-only role? (Recommendation: keep ml_user with full access for simplicity, revisit if the team grows.)
Do we need to preserve SQLite migration history in git for reference, or delete the drizzle/ folder contents entirely? (Recommendation: delete and start fresh.)

7.4 KiB Raw Blame History