candle-annotator/openspec/changes/ml-db-consolidation/design.md
Marko Djordjevic 5f70f13da3 feat: migrate from SQLite to PostgreSQL - complete schema and API updates
- Remove better-sqlite3, add pg driver
- Convert schema to PostgreSQL types (serial, timestamp, boolean, jsonb)
- Generate fresh PostgreSQL migrations
- Update database connection layer with pg.Pool
- Fix all API routes: remove JSON.parse/stringify, use native timestamps and booleans
- Update drizzle.config.ts and .env.example for PostgreSQL
2026-02-17 13:43:06 +01:00

7.4 KiB

Context

The candle annotator runs two databases:

  1. SQLite (data/candles.db) — serves the Next.js frontend via Drizzle ORM (better-sqlite3 driver). Contains 6 tables: charts, candles, annotations, annotation_types, span_annotations, span_label_types.
  2. PostgreSQL (postgres:5432/ml_db) — serves the Python ML service via SQLAlchemy. Contains 1 table: training_runs.

The ML service cannot directly query annotation/candle data. Data flows through CSV/JSON file exports. PostgreSQL already runs in Docker for the ML service, so consolidating means adding frontend tables there — not introducing a new service.

Goals / Non-Goals

Goals:

  • Single PostgreSQL instance for all application data
  • Drizzle ORM continues to manage frontend schema (just switches dialect)
  • ML service gains direct read access to candle/annotation tables
  • Simplified Docker setup (one fewer volume, one database to back up)
  • One-time data migration path from SQLite to PostgreSQL

Non-Goals:

  • Changing the ML service ORM (SQLAlchemy stays)
  • Merging Drizzle and SQLAlchemy migration systems (each manages its own tables)
  • Changing API route logic or query patterns beyond what's needed for the dialect switch
  • Multi-tenant or schema separation (all tables go in the public schema)
  • Migrating away from Drizzle ORM

Decisions

1. Drizzle PostgreSQL driver: drizzle-orm/node-postgres with pg

Choice: Use pg (node-postgres) as the driver.

Why: pg is the most mature PostgreSQL driver for Node.js. Drizzle supports it natively via drizzle-orm/node-postgres. The postgres (postgres.js) driver is also an option but pg has broader ecosystem support and is easier to debug.

Alternative considered: postgres (postgres.js) — lighter, promise-native, but less battle-tested with Drizzle migrations.

2. Shared database, single public schema

Choice: All tables (frontend + ML) live in the same database (ml_db) and the default public schema.

Why: The table sets don't overlap (frontend has charts/candles/annotations, ML has training_runs). Separate schemas add complexity with no benefit for 7 total tables. The ML service already connects to ml_db.

Alternative considered: Separate PostgreSQL schemas (app and ml) — cleaner isolation but adds schema-prefix complexity to queries and cross-schema references. Not worth it at this scale.

3. Rename database from ml_db to candle_annotator

Choice: Rename the PostgreSQL database to candle_annotator since it now serves the whole application, not just ML.

Why: ml_db is misleading when the database holds frontend data too. Renaming during consolidation is the natural time to do it.

Alternative considered: Keep ml_db — avoids a rename step but creates lasting confusion.

4. Fresh Drizzle migrations (drop SQLite migrations)

Choice: Delete all existing SQLite migrations in drizzle/, rewrite the schema file with pgTable equivalents, and run drizzle-kit generate to produce a fresh initial PostgreSQL migration.

Why: SQLite migrations are dialect-specific (e.g., integer for booleans, no native timestamps). Converting them one-by-one is fragile. A clean start from the PostgreSQL schema is simpler and produces idiomatic SQL.

Alternative considered: Manually converting each SQLite migration to PostgreSQL — error-prone and provides no benefit since there's no production data that needs incremental migration history.

5. Type mappings: SQLite → PostgreSQL

SQLite type PostgreSQL type Notes
integer (PK, autoIncrement) serial Auto-incrementing integer
integer (timestamps) timestamp Use defaultNow() where applicable
integer (booleans like is_active) boolean True PostgreSQL booleans
real doublePrecision OHLC price data
text text No change
text (JSON strings) jsonb For geometry, sub_spans, model_prediction

6. Connection management for Next.js

Choice: Use a connection pool via pg.Pool with max: 10 connections. Connection string from DATABASE_URL env var.

Why: SQLite was single-file, no pooling needed. PostgreSQL requires connection pooling for concurrent API requests. 10 connections is reasonable for the frontend workload.

7. ML service direct access to frontend tables

Choice: The ML service reads frontend tables (candles, annotations, span_annotations) directly via SQLAlchemy using its existing connection. No new SQLAlchemy models needed — raw SQL queries or lightweight table reflections are sufficient for read-only access.

Why: The ML service only needs to read training data. Adding full SQLAlchemy models for tables owned by Drizzle creates a dual-ownership problem. Raw queries or Table reflections keep it simple.

Risks / Trade-offs

[Schema drift between Drizzle and SQLAlchemy] → Both ORMs manage tables in the same database. Drizzle owns frontend tables, SQLAlchemy owns ML tables. Neither should modify the other's tables. This is enforced by convention, not tooling.

[Connection pool exhaustion] → Adding the frontend's database traffic to the same PostgreSQL instance increases load. Mitigation: PostgreSQL 16 handles far more concurrent connections than SQLite. The pg.Pool max of 10 plus SQLAlchemy's pool of 5 is well within PostgreSQL's default max_connections of 100.

[Data loss during migration] → SQLite data must be migrated before switching. Mitigation: Write a migration script that exports SQLite data and imports to PostgreSQL. Run before deploying the new code. Keep the SQLite file as backup.

[Drizzle push/generate differences] → PostgreSQL dialect may generate slightly different migration SQL than expected. Mitigation: Review generated migrations before applying. Use drizzle-kit push for development, drizzle-kit generate + drizzle-kit migrate for production.

[Boolean conversion] → SQLite uses 0/1 for booleans, PostgreSQL uses true/false. Mitigation: The migration script handles conversion. Drizzle's boolean() type handles this transparently at the ORM level going forward.

Migration Plan

  1. Update schema and dependencies — Rewrite Drizzle schema for PostgreSQL, swap npm packages
  2. Generate fresh migrationsdrizzle-kit generate from the new PostgreSQL schema
  3. Update docker-compose.yml — Rename database, add frontend dependency on postgres, remove candle-data volume
  4. Update environment variablesDATABASE_URL for the frontend service
  5. Write data migration scriptscripts/migrate-sqlite-to-postgres.ts that reads SQLite and inserts into PostgreSQL with type conversions
  6. Update db/index.ts — Switch from better-sqlite3 to pg pool, update migration runner
  7. Test locally — Run migrations, migrate data, verify API routes work
  8. Deploy — Stop current services, run PostgreSQL migrations, run data migration, deploy new code
  9. Rollback — If issues arise, revert docker-compose and code, restore SQLite volume. The SQLite file is kept as backup for 1 week post-migration.

Open Questions

  • Should the ML service user (ml_user) have write access to frontend tables, or should we create a separate read-only role? (Recommendation: keep ml_user with full access for simplicity, revisit if the team grows.)
  • Do we need to preserve SQLite migration history in git for reference, or delete the drizzle/ folder contents entirely? (Recommendation: delete and start fresh.)