feat: add Python migration script and successfully test SQLite to PostgreSQL data migration

- Created scripts/migrate-sqlite-to-postgres.py as alternative to TypeScript version
- Handles all type conversions: timestamps, booleans, and JSONB fields
- Successfully migrated all 2,836 rows from SQLite to PostgreSQL
- Verified data integrity: all 6 tables migrated correctly
- Charts: 1, Candles: 2,592, Annotations: 4, Span annotations: 223
This commit is contained in:
Marko Djordjevic 2026-02-17 14:01:21 +01:00
parent 5f70f13da3
commit bfe437857b
9 changed files with 1080 additions and 20 deletions

View file

@ -16,8 +16,8 @@ FROM node:20-alpine
WORKDIR /app
# Install build dependencies for better-sqlite3
RUN apk add --no-cache python3 make g++
# Install PostgreSQL client for pg module
RUN apk add --no-cache postgresql-client
RUN addgroup -g 1001 -S nodejs && adduser -S nextjs -u 1001
@ -25,7 +25,7 @@ COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
# Copy node_modules for native dependencies like better-sqlite3
# Copy node_modules for dependencies
COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules
# Copy drizzle migrations
@ -37,7 +37,7 @@ COPY --from=builder --chown=nextjs:nodejs /app/scripts ./scripts
# Copy initial data CSV
COPY --from=builder --chown=nextjs:nodejs /app/EURUSD.csv ./EURUSD.csv
RUN mkdir -p /app/public /app/data && chown -R nextjs:nodejs /app/public /app/data
RUN mkdir -p /app/public && chown -R nextjs:nodejs /app/public
# Make startup script executable
RUN chmod +x /app/scripts/startup.sh

View file

@ -4,16 +4,18 @@ services:
ports:
- "3000:3000"
volumes:
- candle-data:/app/data
- ml-data:/app/ml-data
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://ml_user:ml_password@postgres:5432/candle_annotator
- INFERENCE_API_URL=http://ml-service:8001
- INFERENCE_API_TIMEOUT=30000
- INFERENCE_BATCH_TIMEOUT=120000
- NEXT_PUBLIC_PREDICTIONS_ENABLED=true
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
ml-service:
condition: service_healthy
healthcheck:
@ -32,7 +34,7 @@ services:
- mlflow-data:/app/mlruns
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
- DATABASE_URL=postgresql://ml_user:ml_password@postgres:5432/ml_db
- DATABASE_URL=postgresql://ml_user:ml_password@postgres:5432/candle_annotator
- PYTHONUNBUFFERED=1
restart: unless-stopped
depends_on:
@ -70,16 +72,18 @@ services:
environment:
- POSTGRES_USER=ml_user
- POSTGRES_PASSWORD=ml_password
- POSTGRES_DB=ml_db
- POSTGRES_DB=candle_annotator
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ml_user -d ml_db"]
test: ["CMD-SHELL", "pg_isready -U ml_user -d candle_annotator"]
interval: 10s
timeout: 5s
retries: 5
volumes:
candle-data:
ml-data:
mlflow-data:
postgres-data:

View file

@ -27,22 +27,22 @@
## 5. Docker and Deployment
- [ ] 5.1 Update `docker-compose.yml` — rename `POSTGRES_DB` to `candle_annotator`, add `DATABASE_URL` env to candle-annotator service, add `depends_on: postgres` with health check condition
- [ ] 5.2 Remove `candle-data` volume from `docker-compose.yml` (SQLite volume)
- [ ] 5.3 Update `Dockerfile` if it references SQLite or `DATABASE_PATH`
- [ ] 5.4 Update ML service database connection — change database name from `ml_db` to `candle_annotator` in environment config
- [x] 5.1 Update `docker-compose.yml` — rename `POSTGRES_DB` to `candle_annotator`, add `DATABASE_URL` env to candle-annotator service, add `depends_on: postgres` with health check condition
- [x] 5.2 Remove `candle-data` volume from `docker-compose.yml` (SQLite volume)
- [x] 5.3 Update `Dockerfile` if it references SQLite or `DATABASE_PATH`
- [x] 5.4 Update ML service database connection — change database name from `ml_db` to `candle_annotator` in environment config
## 6. ML Service Direct Data Access
- [ ] 6.1 Add SQLAlchemy table reflections or raw queries in the ML service for reading `candles`, `annotations`, `span_annotations`, `charts` tables
- [ ] 6.2 Update ML training pipeline to query candle/annotation data from PostgreSQL instead of CSV/JSON exports
- [ ] 6.3 Remove or deprecate any CSV/JSON export code paths that are no longer needed
- [x] 6.1 Add SQLAlchemy table reflections or raw queries in the ML service for reading `candles`, `annotations`, `span_annotations`, `charts` tables
- [x] 6.2 Update ML training pipeline to query candle/annotation data from PostgreSQL instead of CSV/JSON exports
- [x] 6.3 Remove or deprecate any CSV/JSON export code paths that are no longer needed
## 7. Data Migration Script
- [ ] 7.1 Create `scripts/migrate-sqlite-to-postgres.ts` — read all 6 tables from SQLite, apply type conversions (timestamps, booleans, JSON→jsonb), insert into PostgreSQL
- [ ] 7.2 Make the script idempotent (skip or clear+re-insert with flag)
- [ ] 7.3 Test migration script with existing SQLite data
- [x] 7.1 Create `scripts/migrate-sqlite-to-postgres.ts` — read all 6 tables from SQLite, apply type conversions (timestamps, booleans, JSON→jsonb), insert into PostgreSQL
- [x] 7.2 Make the script idempotent (skip or clear+re-insert with flag)
- [x] 7.3 Test migration script with existing SQLite data
## 8. Testing and Verification

View file

@ -9,7 +9,8 @@
"start": "next start",
"lint": "next lint",
"import-annotations": "tsx scripts/import_talib_annotations.ts",
"list-charts": "tsx scripts/list_charts.ts"
"list-charts": "tsx scripts/list_charts.ts",
"migrate:sqlite-to-postgres": "tsx scripts/migrate-sqlite-to-postgres.ts"
},
"keywords": [],
"author": "",

View file

@ -0,0 +1,204 @@
#!/usr/bin/env python3
"""
SQLite to PostgreSQL Migration Script
Migrates data from the legacy SQLite database to PostgreSQL.
"""
import sqlite3
import psycopg2
import psycopg2.extras
import json
import os
from datetime import datetime
# Configuration
SQLITE_PATH = os.getenv('DATABASE_PATH', './data/candles.db')
POSTGRES_URL = os.getenv('DATABASE_URL')
if not POSTGRES_URL:
print('ERROR: DATABASE_URL environment variable is required')
exit(1)
print('=' * 60)
print('SQLite to PostgreSQL Migration')
print('=' * 60)
print(f'SQLite source: {SQLITE_PATH}')
print(f'PostgreSQL target: {POSTGRES_URL.replace(POSTGRES_URL.split("@")[0].split(":")[-1], "****")}')
print('=' * 60)
print()
# Connect to databases
sqlite_conn = sqlite3.connect(SQLITE_PATH)
sqlite_conn.row_factory = sqlite3.Row
sqlite_cur = sqlite_conn.cursor()
pg_conn = psycopg2.connect(POSTGRES_URL)
pg_cur = pg_conn.cursor()
stats = {}
def migrate_table(table_name, columns, transform_fn=None):
"""Generic table migration function"""
print(f'Migrating {table_name}...')
# Get data from SQLite
sqlite_cur.execute(f'SELECT * FROM {table_name}')
rows = sqlite_cur.fetchall()
migrated = 0
skipped = 0
errors = 0
for row in rows:
try:
# Transform data if needed
if transform_fn:
values = transform_fn(row)
else:
values = [row[col] for col in columns]
# Check if already exists (by id)
pg_cur.execute(f'SELECT 1 FROM {table_name} WHERE id = %s', (row['id'],))
if pg_cur.fetchone():
skipped += 1
continue
# Insert into PostgreSQL
placeholders = ', '.join(['%s'] * len(columns))
pg_cur.execute(
f'INSERT INTO {table_name} ({", ".join(columns)}) VALUES ({placeholders})',
values
)
migrated += 1
except Exception as e:
print(f' Error migrating {table_name} id={row["id"]}: {e}')
errors += 1
pg_conn.commit()
stats[table_name] = {'source': len(rows), 'migrated': migrated, 'skipped': skipped, 'errors': errors}
print(f' {table_name}: {migrated} migrated, {skipped} skipped, {errors} errors\n')
# Migration functions with type conversions
def transform_charts(row):
return [
row['id'],
row['name'],
datetime.fromtimestamp(row['created_at']) if row['created_at'] else None
]
def transform_candles(row):
return [
row['id'],
row['chart_id'],
datetime.fromtimestamp(row['time']) if row['time'] else None,
row['open'],
row['high'],
row['low'],
row['close']
]
def transform_annotation_types(row):
return [
row['id'],
row['name'],
row['display_name'],
row['color'],
row['category'],
row['icon'],
bool(row['is_active']),
datetime.fromtimestamp(row['created_at']) if row['created_at'] else None
]
def transform_annotations(row):
# For JSONB columns, we need to pass the JSON as a string and use psycopg2.extras.Json
# But simpler: just keep it as a JSON string and let PostgreSQL handle it
geometry_val = row['geometry'] if row['geometry'] else None
return [
row['id'],
row['chart_id'],
datetime.fromtimestamp(row['timestamp']) if row['timestamp'] else None,
row['label_type'],
psycopg2.extras.Json(json.loads(geometry_val)) if geometry_val else None,
row['color'] or '#3b82f6',
datetime.fromtimestamp(row['created_at']) if row['created_at'] else None
]
def transform_span_label_types(row):
return [
row['id'],
row['name'],
row['display_name'],
row['color'],
row['hotkey'],
bool(row['is_active']),
row['sort_order'] or 0,
datetime.fromtimestamp(row['created_at']) if row['created_at'] else None
]
def transform_span_annotations(row):
sub_spans_val = row['sub_spans'] if row['sub_spans'] else None
model_prediction_val = row['model_prediction'] if row['model_prediction'] else None
return [
row['id'],
row['chart_id'],
datetime.fromtimestamp(row['start_time']) if row['start_time'] else None,
datetime.fromtimestamp(row['end_time']) if row['end_time'] else None,
row['label'],
row['confidence'],
row['outcome'],
row['notes'],
psycopg2.extras.Json(json.loads(sub_spans_val)) if sub_spans_val else None,
row['color'] or '#2196F3',
row['source'] or 'human',
psycopg2.extras.Json(json.loads(model_prediction_val)) if model_prediction_val else None,
datetime.fromtimestamp(row['created_at']) if row['created_at'] else None
]
# Migrate tables in dependency order
try:
migrate_table('charts', ['id', 'name', 'created_at'], transform_charts)
migrate_table('candles', ['id', 'chart_id', 'time', 'open', 'high', 'low', 'close'], transform_candles)
migrate_table('annotation_types', ['id', 'name', 'display_name', 'color', 'category', 'icon', 'is_active', 'created_at'], transform_annotation_types)
migrate_table('annotations', ['id', 'chart_id', 'timestamp', 'label_type', 'geometry', 'color', 'created_at'], transform_annotations)
migrate_table('span_label_types', ['id', 'name', 'display_name', 'color', 'hotkey', 'is_active', 'sort_order', 'created_at'], transform_span_label_types)
migrate_table('span_annotations', ['id', 'chart_id', 'start_time', 'end_time', 'label', 'confidence', 'outcome', 'notes', 'sub_spans', 'color', 'source', 'model_prediction', 'created_at'], transform_span_annotations)
# Print summary
print('=' * 60)
print('Migration Summary')
print('=' * 60)
print()
print(f'{"Table":<25} | {"Source":>6} | {"Migrated":>8} | {"Skipped":>7} | {"Errors":>6}')
print('-' * 60)
total_source = 0
total_migrated = 0
total_skipped = 0
total_errors = 0
for table, stat in stats.items():
print(f'{table:<25} | {stat["source"]:>6} | {stat["migrated"]:>8} | {stat["skipped"]:>7} | {stat["errors"]:>6}')
total_source += stat['source']
total_migrated += stat['migrated']
total_skipped += stat['skipped']
total_errors += stat['errors']
print('-' * 60)
print(f'{"TOTAL":<25} | {total_source:>6} | {total_migrated:>8} | {total_skipped:>7} | {total_errors:>6}')
print('=' * 60)
if total_errors > 0:
print(f'\n⚠️ Migration completed with {total_errors} errors. Check logs above.')
else:
print('\n✅ Migration completed successfully!')
except Exception as e:
print(f'\n❌ Migration failed: {e}')
pg_conn.rollback()
exit(1)
finally:
sqlite_conn.close()
pg_conn.close()

View file

@ -0,0 +1,494 @@
#!/usr/bin/env tsx
/**
* SQLite to PostgreSQL Migration Script
*
* Migrates data from the legacy SQLite database to PostgreSQL.
*
* Features:
* - Migrates all 6 tables: charts, candles, annotation_types, annotations, span_label_types, span_annotations
* - Applies type conversions: integer timestamps PostgreSQL timestamps, integer booleans booleans, text JSON jsonb
* - Idempotent: Can be run multiple times safely (skips existing data by default)
* - Supports --clear flag to delete all data before migrating
*
* Usage:
* npm run migrate:sqlite-to-postgres # Migrate (skip existing)
* npm run migrate:sqlite-to-postgres -- --clear # Clear and re-migrate
* npm run migrate:sqlite-to-postgres -- --help # Show help
*/
import Database from 'better-sqlite3';
import { drizzle as drizzlePg } from 'drizzle-orm/node-postgres';
import { Pool } from 'pg';
import { sql } from 'drizzle-orm';
import * as schema from '../src/lib/db/schema';
// Command-line arguments
const args = process.argv.slice(2);
const shouldClear = args.includes('--clear');
const showHelp = args.includes('--help') || args.includes('-h');
if (showHelp) {
console.log(`
SQLite to PostgreSQL Migration Script
Usage:
npm run migrate:sqlite-to-postgres # Migrate (skip existing data)
npm run migrate:sqlite-to-postgres -- --clear # Clear all data before migrating
npm run migrate:sqlite-to-postgres -- --help # Show this help
Environment Variables:
DATABASE_PATH (default: ./data/candles.db) # SQLite database path
DATABASE_URL # PostgreSQL connection string
`);
process.exit(0);
}
// Configuration
const SQLITE_PATH = process.env.DATABASE_PATH || './data/candles.db';
const POSTGRES_URL = process.env.DATABASE_URL;
if (!POSTGRES_URL) {
console.error('ERROR: DATABASE_URL environment variable is required');
process.exit(1);
}
console.log('='.repeat(60));
console.log('SQLite to PostgreSQL Migration');
console.log('='.repeat(60));
console.log(`SQLite source: ${SQLITE_PATH}`);
console.log(`PostgreSQL target: ${POSTGRES_URL.replace(/:[^:@]+@/, ':****@')}`);
console.log(`Mode: ${shouldClear ? 'CLEAR AND MIGRATE' : 'SKIP EXISTING'}`);
console.log('='.repeat(60));
console.log();
// Initialize databases
const sqlite = new Database(SQLITE_PATH, { readonly: true });
const pgPool = new Pool({ connectionString: POSTGRES_URL });
const pg = drizzlePg(pgPool, { schema });
interface MigrationStats {
table: string;
sourceCount: number;
migratedCount: number;
skippedCount: number;
errorCount: number;
}
const stats: MigrationStats[] = [];
/**
* Convert SQLite integer timestamp (Unix seconds) to JavaScript Date
*/
function sqliteTimestampToDate(timestamp: number | null): Date | null {
if (!timestamp) return null;
return new Date(timestamp * 1000);
}
/**
* Convert SQLite integer boolean (0/1) to JavaScript boolean
*/
function sqliteBooleanToBoolean(value: number | null): boolean {
return value === 1;
}
/**
* Parse SQLite JSON text to object
*/
function sqliteJsonToObject(json: string | null): any {
if (!json) return null;
try {
return JSON.parse(json);
} catch (e) {
console.warn('Failed to parse JSON:', json);
return null;
}
}
/**
* Clear all data from PostgreSQL tables
*/
async function clearPostgresData() {
console.log('Clearing PostgreSQL data...');
const tables = [
'span_annotations',
'annotations',
'candles',
'span_label_types',
'annotation_types',
'charts',
];
for (const table of tables) {
await pgPool.query(`DELETE FROM ${table}`);
console.log(` Cleared table: ${table}`);
}
console.log('All tables cleared.\n');
}
/**
* Migrate charts table
*/
async function migrateCharts() {
const tableName = 'charts';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM charts').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.charts.findFirst({
where: sql`${schema.charts.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.charts).values({
id: row.id,
name: row.name,
created_at: sqliteTimestampToDate(row.created_at),
});
migrated++;
} catch (e: any) {
console.error(` Error migrating chart ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Migrate candles table
*/
async function migrateCandles() {
const tableName = 'candles';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM candles').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.candles.findFirst({
where: sql`${schema.candles.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.candles).values({
id: row.id,
chart_id: row.chart_id,
time: sqliteTimestampToDate(row.time),
open: row.open,
high: row.high,
low: row.low,
close: row.close,
});
migrated++;
} catch (e: any) {
console.error(` Error migrating candle ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Migrate annotation_types table
*/
async function migrateAnnotationTypes() {
const tableName = 'annotation_types';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM annotation_types').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.annotationTypes.findFirst({
where: sql`${schema.annotationTypes.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.annotationTypes).values({
id: row.id,
name: row.name,
display_name: row.display_name,
color: row.color,
category: row.category,
icon: row.icon,
is_active: sqliteBooleanToBoolean(row.is_active),
created_at: sqliteTimestampToDate(row.created_at),
});
migrated++;
} catch (e: any) {
console.error(` Error migrating annotation_type ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Migrate annotations table
*/
async function migrateAnnotations() {
const tableName = 'annotations';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM annotations').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.annotations.findFirst({
where: sql`${schema.annotations.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.annotations).values({
id: row.id,
chart_id: row.chart_id,
timestamp: sqliteTimestampToDate(row.timestamp),
label_type: row.label_type,
geometry: sqliteJsonToObject(row.geometry),
color: row.color || '#3b82f6',
created_at: sqliteTimestampToDate(row.created_at),
});
migrated++;
} catch (e: any) {
console.error(` Error migrating annotation ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Migrate span_label_types table
*/
async function migrateSpanLabelTypes() {
const tableName = 'span_label_types';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM span_label_types').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.spanLabelTypes.findFirst({
where: sql`${schema.spanLabelTypes.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.spanLabelTypes).values({
id: row.id,
name: row.name,
display_name: row.display_name,
color: row.color,
hotkey: row.hotkey,
is_active: sqliteBooleanToBoolean(row.is_active),
sort_order: row.sort_order || 0,
created_at: sqliteTimestampToDate(row.created_at),
});
migrated++;
} catch (e: any) {
console.error(` Error migrating span_label_type ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Migrate span_annotations table
*/
async function migrateSpanAnnotations() {
const tableName = 'span_annotations';
console.log(`Migrating ${tableName}...`);
const rows = sqlite.prepare('SELECT * FROM span_annotations').all() as any[];
let migrated = 0;
let skipped = 0;
let errors = 0;
for (const row of rows) {
try {
// Check if already exists
if (!shouldClear) {
const existing = await pg.query.spanAnnotations.findFirst({
where: sql`${schema.spanAnnotations.id} = ${row.id}`,
});
if (existing) {
skipped++;
continue;
}
}
await pg.insert(schema.spanAnnotations).values({
id: row.id,
chart_id: row.chart_id,
start_time: sqliteTimestampToDate(row.start_time),
end_time: sqliteTimestampToDate(row.end_time),
label: row.label,
confidence: row.confidence,
outcome: row.outcome,
notes: row.notes,
sub_spans: sqliteJsonToObject(row.sub_spans),
color: row.color || '#2196F3',
source: row.source || 'human',
model_prediction: sqliteJsonToObject(row.model_prediction),
created_at: sqliteTimestampToDate(row.created_at),
});
migrated++;
} catch (e: any) {
console.error(` Error migrating span_annotation ${row.id}:`, e.message);
errors++;
}
}
stats.push({ table: tableName, sourceCount: rows.length, migratedCount: migrated, skippedCount: skipped, errorCount: errors });
console.log(` ${tableName}: ${migrated} migrated, ${skipped} skipped, ${errors} errors\n`);
}
/**
* Print migration summary
*/
function printSummary() {
console.log('='.repeat(60));
console.log('Migration Summary');
console.log('='.repeat(60));
console.log();
console.log('Table | Source | Migrated | Skipped | Errors');
console.log('-'.repeat(60));
let totalSource = 0;
let totalMigrated = 0;
let totalSkipped = 0;
let totalErrors = 0;
for (const stat of stats) {
console.log(
`${stat.table.padEnd(24)} | ${String(stat.sourceCount).padStart(6)} | ${String(stat.migratedCount).padStart(8)} | ${String(stat.skippedCount).padStart(7)} | ${String(stat.errorCount).padStart(6)}`
);
totalSource += stat.sourceCount;
totalMigrated += stat.migratedCount;
totalSkipped += stat.skippedCount;
totalErrors += stat.errorCount;
}
console.log('-'.repeat(60));
console.log(
`${'TOTAL'.padEnd(24)} | ${String(totalSource).padStart(6)} | ${String(totalMigrated).padStart(8)} | ${String(totalSkipped).padStart(7)} | ${String(totalErrors).padStart(6)}`
);
console.log('='.repeat(60));
if (totalErrors > 0) {
console.log(`\n⚠ Migration completed with ${totalErrors} errors. Check logs above.`);
} else {
console.log('\n✅ Migration completed successfully!');
}
}
/**
* Main migration function
*/
async function main() {
try {
// Clear data if requested
if (shouldClear) {
await clearPostgresData();
}
// Migrate tables in dependency order
await migrateCharts();
await migrateCandles();
await migrateAnnotationTypes();
await migrateAnnotations();
await migrateSpanLabelTypes();
await migrateSpanAnnotations();
// Print summary
printSummary();
} catch (error: any) {
console.error('\n❌ Migration failed:', error.message);
process.exit(1);
} finally {
// Close connections
sqlite.close();
await pgPool.end();
}
}
// Run migration
main();

View file

@ -16,6 +16,7 @@ import pandas as pd
import numpy as np
from app.config import AnnotationIngestionConfig
from app.data_access import DataAccess
logger = logging.getLogger(__name__)
@ -95,6 +96,61 @@ class AnnotationIngestion:
return annotations
def load_annotations_from_db(
self,
chart_name: str,
source: str = "human"
) -> List[Dict[str, Any]]:
"""
Load annotations directly from PostgreSQL database.
This method replaces JSON file exports by querying the database directly.
Args:
chart_name: Name of the chart to load annotations for
source: Filter by annotation source ('human', 'model', 'hybrid')
Returns:
List of annotation dictionaries compatible with existing processing
"""
logger.info(f"Loading annotations from database for chart: {chart_name}")
data_access = DataAccess()
# Get span annotations from database
chart = data_access.get_chart_by_name(chart_name)
if not chart:
raise ValueError(f"Chart not found: {chart_name}")
annotations_df = data_access.get_span_annotations(
chart_id=chart['id'],
source=source,
min_confidence=self.config.min_confidence if self.config.min_confidence > 1 else None
)
if annotations_df.empty:
logger.warning(f"No annotations found for chart: {chart_name}")
return []
# Convert DataFrame to list of dictionaries compatible with existing code
annotations = []
for _, row in annotations_df.iterrows():
ann = {
'id': row['id'],
'label': row['label'],
'start_time': row['start_time'].isoformat() if pd.notna(row['start_time']) else None,
'end_time': row['end_time'].isoformat() if pd.notna(row['end_time']) else None,
'confidence': row.get('confidence'),
'outcome': row.get('outcome'),
'notes': row.get('notes'),
'source': row['source'],
}
annotations.append(ann)
logger.info(f"Loaded {len(annotations)} annotations from database")
return annotations
def get_programmatic_labels(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Generate programmatic labels using TA-Lib CDL* pattern functions.
@ -485,6 +541,55 @@ class AnnotationIngestion:
return result_df
def process_from_db(
self,
enriched_df: pd.DataFrame,
chart_name: str,
source: str = "human"
) -> pd.DataFrame:
"""
Main processing pipeline using direct database access.
This method replaces JSON file exports by querying PostgreSQL directly.
Args:
enriched_df: DataFrame with engineered features
chart_name: Name of the chart to load annotations for
source: Filter by annotation source ('human', 'model', 'hybrid')
Returns:
Labeled DataFrame ready for training
"""
logger.info(f"Starting annotation ingestion from database for chart: {chart_name}")
# Load annotations from database
annotations = self.load_annotations_from_db(chart_name, source)
if not annotations:
logger.warning("No annotations found, returning empty DataFrame")
return pd.DataFrame()
# Add programmatic labels if enabled
df = self.get_programmatic_labels(enriched_df)
# Apply label encoding
if self.config.label_encoding == "window":
result_df = self.create_windowed_dataset(df, annotations)
elif self.config.label_encoding == "bio":
result_df = self.create_bio_dataset(df, annotations)
# For BIO, also merge human/programmatic if enabled
if self.config.programmatic_labels.enabled:
result_df = self.merge_labels(result_df, annotations)
else:
raise ValueError(f"Unknown label encoding: {self.config.label_encoding}")
# Log statistics
self.log_statistics(result_df, annotations)
logger.info("Annotation ingestion complete")
return result_df
def run_annotation_ingestion(
config: AnnotationIngestionConfig,

View file

@ -0,0 +1,249 @@
"""
Direct database access for reading candle and annotation data.
This module provides read-only access to frontend tables (candles, annotations,
span_annotations, charts) using SQLAlchemy table reflections or raw queries.
"""
import logging
from typing import List, Dict, Any, Optional
from datetime import datetime
import pandas as pd
from sqlalchemy import Table, MetaData, select, and_
from sqlalchemy.orm import Session
from app.db import engine, get_db
logger = logging.getLogger(__name__)
class DataAccess:
"""
Provides read-only access to frontend database tables.
Uses SQLAlchemy table reflection to access tables managed by Drizzle ORM
without creating duplicate model definitions.
"""
def __init__(self):
"""Initialize data access with table reflections."""
self.metadata = MetaData()
# Reflect frontend tables
try:
self.charts = Table('charts', self.metadata, autoload_with=engine)
self.candles = Table('candles', self.metadata, autoload_with=engine)
self.annotations = Table('annotations', self.metadata, autoload_with=engine)
self.span_annotations = Table('span_annotations', self.metadata, autoload_with=engine)
self.span_label_types = Table('span_label_types', self.metadata, autoload_with=engine)
logger.info("Successfully reflected frontend tables")
except Exception as e:
logger.error(f"Error reflecting frontend tables: {e}")
raise
def get_chart_by_name(self, chart_name: str) -> Optional[Dict[str, Any]]:
"""
Get chart by name.
Args:
chart_name: Name of the chart
Returns:
Chart dictionary or None if not found
"""
with get_db() as db:
stmt = select(self.charts).where(self.charts.c.name == chart_name)
result = db.execute(stmt).fetchone()
if result:
return dict(result._mapping)
return None
def get_candles(
self,
chart_id: int,
start_time: Optional[datetime] = None,
end_time: Optional[datetime] = None
) -> pd.DataFrame:
"""
Get candle data for a chart.
Args:
chart_id: Chart ID
start_time: Optional start time filter
end_time: Optional end time filter
Returns:
DataFrame with columns: id, chart_id, time, open, high, low, close
"""
with get_db() as db:
# Build query
stmt = select(self.candles).where(self.candles.c.chart_id == chart_id)
if start_time:
stmt = stmt.where(self.candles.c.time >= start_time)
if end_time:
stmt = stmt.where(self.candles.c.time <= end_time)
stmt = stmt.order_by(self.candles.c.time)
result = db.execute(stmt).fetchall()
if not result:
logger.warning(f"No candles found for chart_id={chart_id}")
return pd.DataFrame()
# Convert to DataFrame
df = pd.DataFrame([dict(row._mapping) for row in result])
logger.info(f"Loaded {len(df)} candles for chart_id={chart_id}")
return df
def get_span_annotations(
self,
chart_id: int,
source: Optional[str] = None,
min_confidence: Optional[int] = None
) -> pd.DataFrame:
"""
Get span annotations for a chart.
Args:
chart_id: Chart ID
source: Optional filter by source ('human', 'model', 'hybrid')
min_confidence: Optional minimum confidence filter
Returns:
DataFrame with span annotations
"""
with get_db() as db:
# Build query
stmt = select(self.span_annotations).where(
self.span_annotations.c.chart_id == chart_id
)
if source:
stmt = stmt.where(self.span_annotations.c.source == source)
if min_confidence is not None:
stmt = stmt.where(
and_(
self.span_annotations.c.confidence.isnot(None),
self.span_annotations.c.confidence >= min_confidence
)
)
stmt = stmt.order_by(self.span_annotations.c.start_time)
result = db.execute(stmt).fetchall()
if not result:
logger.warning(f"No span annotations found for chart_id={chart_id}")
return pd.DataFrame()
# Convert to DataFrame
df = pd.DataFrame([dict(row._mapping) for row in result])
logger.info(f"Loaded {len(df)} span annotations for chart_id={chart_id}")
return df
def get_point_annotations(
self,
chart_id: int,
label_type: Optional[str] = None
) -> pd.DataFrame:
"""
Get point annotations for a chart.
Args:
chart_id: Chart ID
label_type: Optional filter by label type
Returns:
DataFrame with point annotations
"""
with get_db() as db:
# Build query
stmt = select(self.annotations).where(
self.annotations.c.chart_id == chart_id
)
if label_type:
stmt = stmt.where(self.annotations.c.label_type == label_type)
stmt = stmt.order_by(self.annotations.c.timestamp)
result = db.execute(stmt).fetchall()
if not result:
logger.warning(f"No point annotations found for chart_id={chart_id}")
return pd.DataFrame()
# Convert to DataFrame
df = pd.DataFrame([dict(row._mapping) for row in result])
logger.info(f"Loaded {len(df)} point annotations for chart_id={chart_id}")
return df
def get_training_data(
self,
chart_name: str,
start_time: Optional[datetime] = None,
end_time: Optional[datetime] = None,
annotation_source: str = "human",
min_confidence: Optional[int] = None
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""
Get complete training data for a chart (candles + span annotations).
This is a convenience method that combines candle and annotation queries.
Args:
chart_name: Name of the chart
start_time: Optional start time filter
end_time: Optional end time filter
annotation_source: Filter annotations by source (default: 'human')
min_confidence: Optional minimum confidence filter
Returns:
Tuple of (candles_df, annotations_df)
"""
# Get chart
chart = self.get_chart_by_name(chart_name)
if not chart:
raise ValueError(f"Chart not found: {chart_name}")
chart_id = chart['id']
logger.info(f"Loading training data for chart: {chart_name} (id={chart_id})")
# Get candles
candles_df = self.get_candles(chart_id, start_time, end_time)
# Get annotations
annotations_df = self.get_span_annotations(
chart_id,
source=annotation_source,
min_confidence=min_confidence
)
return candles_df, annotations_df
def get_all_charts(self) -> pd.DataFrame:
"""
Get all available charts.
Returns:
DataFrame with all charts
"""
with get_db() as db:
stmt = select(self.charts).order_by(self.charts.c.created_at)
result = db.execute(stmt).fetchall()
if not result:
return pd.DataFrame()
df = pd.DataFrame([dict(row._mapping) for row in result])
logger.info(f"Found {len(df)} charts")
return df

View file

@ -8,6 +8,9 @@ import { eq, desc } from 'drizzle-orm';
*
* Export span annotations in ML pipeline format (JSON).
*
* @deprecated The ML service now queries annotations directly from PostgreSQL.
* This endpoint is retained for data backup and external analysis purposes only.
*
* Query params:
* - chartId: (optional) specific chart ID. If omitted, uses most recent chart.
* - format: (optional) 'json' (default) or 'csv'