candle-annotator/openspec/changes/archive/2026-02-20-code-review-fix/design.md
Marko Djordjevic 925e7284e3 Archive code-review-fix change and sync specs to main
- Synced 14 capability delta specs to main specs
- Created 6 new main specs: api-authentication, error-boundary, input-validation, security-headers, shared-types
- Updated 8 existing specs with security, validation, and performance requirements
- Archived change to openspec/changes/archive/2026-02-20-code-review-fix/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 08:54:59 +01:00

8.1 KiB

Context

The Candle Annotator is a full-stack application (Next.js frontend + FastAPI ML service + PostgreSQL) for annotating candlestick chart data and training ML models. A code review found 93 issues (13 Critical, 35 Major, 45 Minor). The app currently runs in Docker Compose with no authentication, hardcoded credentials, exposed ports, and multiple frontend stability problems. Zero tests exist.

The application is used internally for ML development — it is not yet public-facing, but the security posture must be production-ready before any deployment beyond localhost.

Goals / Non-Goals

Goals:

  • Eliminate all 13 critical issues (security + frontend stability)
  • Fix all 35 major issues (API hardening, code quality, infrastructure)
  • Fix all 45 minor issues (polish, accessibility, deprecated APIs)
  • Establish a type-safe foundation with centralized shared types
  • Make the Docker deployment secure by default (localhost bindings, non-root, no leaked creds)
  • Add basic input validation on all API boundaries

Non-Goals:

  • Full test suite (tracked separately — this change fixes code, not adds tests)
  • Migrating away from joblib/pickle to ONNX/safetensors (integrity checks are sufficient for now)
  • Replacing the ML service framework (FastAPI stays)
  • Adding a full RBAC/user management system (API key auth is sufficient)
  • Rewriting CandleChart.tsx from scratch (targeted fixes only, no full decomposition into hooks)
  • Adding Celery/RQ task queue for training (out of scope, timeout + size limits are sufficient)

Decisions

D-1. Authentication: API Key via Environment Variable

Choice: Simple X-API-Key header checked in Next.js middleware and FastAPI Depends().

Alternatives considered:

  • NextAuth.js session auth — overkill for a single-user internal tool
  • OAuth2 — unnecessary complexity for the use case
  • mTLS — too heavy for dev/staging environments

Rationale: API key is the simplest approach that closes the "zero auth" gap. The key lives in .env (which will be gitignored). Next.js middleware intercepts all /api/* routes. FastAPI uses a shared dependency. Both read from API_KEY env var.

D-2. Input Validation: Zod on Next.js, Pydantic Already on FastAPI

Choice: Add Zod schemas in each proxy route file. FastAPI already uses Pydantic models.

Alternatives considered:

  • Shared validation library — no shared runtime between TS and Python
  • tRPC — would require significant refactoring of the API layer

Rationale: Zod is the standard for Next.js API route validation. Each proxy route gets a schema that validates before forwarding to the ML service. The run_id validation (alphanumeric + hyphens + underscores) is applied at both layers.

D-3. Shared Types: src/types/index.ts Barrel File

Choice: Create src/types/ with domain type files (annotations.ts, candles.ts, predictions.ts, charts.ts) and a barrel index.ts. Replace all 6+ duplicate interface definitions.

Alternatives considered:

  • Zod schemas as single source of truth (infer types) — good but adds Zod dependency to components that don't validate
  • Keep types co-located — current approach, causes drift

Rationale: Separate type files keep imports clean. Components import types, API routes import both types and Zod schemas. No circular dependencies.

D-4. Error Handling: Generic Messages to Client, Structured Server Logs

Choice: All API routes return { error: "Internal server error" } for 500s. Full error logged via console.error (Next.js) and logging.error (FastAPI) with request context.

Alternatives considered:

  • Error tracking service (Sentry) — good addition later, not needed to fix the leak now
  • Custom error codes — over-engineering for this stage

Rationale: Minimal change. Replace error.message with static string in catch blocks. The actual error is already logged in most routes.

D-5. Frontend Stale Closures: Refs for Mutable State

Choice: Use useRef for values that change frequently and are read in event handlers (modelInfo, selectedSpanId, drawingState, dragState, annotations). Keep useState for values that trigger re-renders.

Alternatives considered:

  • Zustand store — solves the closure problem but is a large refactor
  • useReducer — consolidates state but doesn't fix closure staleness

Rationale: Refs are the targeted fix. Each stale closure issue (C-9, C-11, C-12) is fixed by reading from a ref instead of the closure variable. Minimal blast radius.

D-6. Docker Security: Localhost Bindings + Env Var Interpolation

Choice: Bind all ports to 127.0.0.1, use ${POSTGRES_PASSWORD} interpolation in docker-compose.yml, add .env to .gitignore.

Alternatives considered:

  • Docker secrets — better for production swarm, overkill for compose
  • Vault — enterprise-grade, way too heavy

Rationale: Env var interpolation is native to docker-compose and is the standard pattern. .env file is read automatically by compose. Binding to localhost prevents external access without requiring firewall rules.

D-7. AbortController for Prediction Requests

Choice: Each prediction fetch creates an AbortController. A ref holds the current controller. New requests abort the previous one. Stale responses are discarded by checking if the signal was aborted.

Alternatives considered:

  • Request ID tracking (increment counter, discard if stale) — works but doesn't free network resources
  • Debouncing — reduces frequency but doesn't prevent race conditions

Rationale: AbortController is the browser-native solution. It both prevents stale responses AND frees up network resources. Combined with the ref pattern from D-5, this cleanly solves C-10.

D-8. Security Headers via next.config.js

Choice: Add headers() function to next.config.js with CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy.

Alternatives considered:

  • Middleware-based headers — works but next.config.js is the canonical Next.js approach
  • Reverse proxy (nginx) headers — adds infrastructure complexity

Rationale: Next.js config headers() is zero-dependency and applies to all routes automatically.

Risks / Trade-offs

[API key auth is minimal] → Sufficient for internal tool. Can upgrade to session auth later without breaking changes. The middleware pattern makes swapping auth strategies straightforward.

[Refs pattern increases cognitive complexity] → Developers must remember to update both state and ref. Mitigated by co-locating ref updates next to setState calls and adding brief comments.

[93 issues is a large changeset] → Risk of introducing regressions. Mitigated by phased approach (security first, then stability, then polish) and the model-tier task dispatch (complex changes get opus-level attention).

[No tests to catch regressions] → This is explicitly a non-goal for this change, but it's the biggest risk. Mitigated by making each fix small and targeted, avoiding broad refactors.

[Breaking docker-compose.yml env var pattern] → Existing users with hardcoded .env files will need to update. Mitigated by updating .env.example with clear instructions.

Migration Plan

  1. Phase 1 (Security Critical): .gitignore, credentials, port bindings, run_id validation, upload limits, CORS — all low-effort, high-impact
  2. Phase 2 (Before Production): Auth middleware, Zod validation, transaction fixes, error message cleanup, security headers, Dockerfile hardening
  3. Phase 3 (Frontend Stability): AbortController, stale closure fixes, memory leaks, response.ok checks, shared types, Error Boundary
  4. Phase 4 (Polish): Rate limiting, accessibility, deprecated APIs, dark theme, debounce, constants extraction

Each phase can be deployed independently. Rollback is git revert on the phase branch.

Open Questions

  • API key distribution: How will the API key be shared with legitimate clients? Manual .env copy for now?
  • Rate limiting implementation: Next.js middleware-based or external (e.g., nginx)? Deferred to Phase 4.
  • Model integrity: SHA256 manifest file checked into git, or computed on first load? Leaning toward manifest file.