candle-annotator/openspec/changes/candle-backend/specs/annotation-ingestion/spec.md

5.8 KiB

ADDED Requirements

Requirement: Load annotations from JSON export

The system SHALL load annotation data from JSON files exported by the annotation tool, located at data.annotations_path. The expected format is a JSON object with an annotations array where each annotation has: id, start_time, end_time, label, confidence (nullable), outcome (nullable), and sub_spans (nullable).

Scenario: Load valid annotations JSON

  • WHEN data.annotations_path points to a valid JSON file with annotations
  • THEN the system loads all annotation objects into memory for processing

Scenario: Missing annotations file

  • WHEN data.annotations_path points to a file that does not exist and annotation ingestion is enabled
  • THEN the system SHALL fail with an error message identifying the missing file path

Scenario: Filter by confidence

  • WHEN stages.annotation_ingestion.min_confidence is set to 3
  • THEN annotations with confidence below 3 SHALL be excluded from the labeled dataset

Requirement: Windowed classification encoding

When stages.annotation_ingestion.label_encoding is "window", the system SHALL convert each annotation span into a fixed-size window of candles. The window size is defined by stages.annotation_ingestion.window_size. If the annotation span is shorter than window_size, the system SHALL pad with context candles (centered on the span). If the span is longer, the system SHALL use the full span. Each window becomes one row in the output with flattened OHLCV + feature columns.

Scenario: Span shorter than window

  • WHEN an annotation spans 10 candles and window_size is 30
  • THEN the system extracts 30 candles centered on the annotation (10 before, 10 span, 10 after) and flattens them into a single row

Scenario: Span longer than window

  • WHEN an annotation spans 50 candles and window_size is 30
  • THEN the system uses all 50 candles and flattens them into a single row

Scenario: Span near dataset boundary

  • WHEN an annotation is near the start of the dataset and there aren't enough candles for padding
  • THEN the system SHALL pad with as many candles as available (no error), filling missing positions with NaN

Requirement: BIO sequence labeling encoding

When stages.annotation_ingestion.label_encoding is "bio", the system SHALL assign a BIO tag to each candle in the dataset based on annotations. The first candle of an annotation span gets B-{label}, subsequent candles in the span get I-{label}, and candles outside any annotation get O.

Scenario: Single annotation BIO tags

  • WHEN a "bull_flag" annotation spans candles at times T5 through T8
  • THEN candle T5 gets tag B-bull_flag, candles T6-T8 get I-bull_flag, all other candles get O

Scenario: Overlapping annotations

  • WHEN two annotations overlap in time range
  • THEN the system SHALL create multiple tag columns (bio_tag_1, bio_tag_2) to represent both annotations

Requirement: Programmatic TA-Lib pattern labels

When stages.annotation_ingestion.programmatic_labels.enabled is true, the system SHALL run TA-Lib CDL* pattern recognition functions listed in talib_patterns on the OHLC data. Each CDL function returns +100 (bullish), -100 (bearish), or 0 (no pattern). The system SHALL convert non-zero results to label names (e.g., CDL_ENGULFING with +100 → bullish_engulfing).

Scenario: Detect engulfing pattern

  • WHEN CDL_ENGULFING is in the talib_patterns list and the OHLC data contains an engulfing pattern
  • THEN the system generates a label bullish_engulfing or bearish_engulfing for the corresponding candle

Scenario: No pattern detected

  • WHEN a CDL function returns 0 for a candle
  • THEN no programmatic label is assigned to that candle

Requirement: Human and programmatic label merge

When both human annotations and programmatic labels exist for the same candle, the system SHALL merge them using the strategy in stages.annotation_ingestion.merge_strategy: "human_priority" keeps the human label, "programmatic_priority" keeps the TA-Lib label, "both" keeps both as separate label columns.

Scenario: Human priority merge

  • WHEN merge_strategy is "human_priority" and a candle has human label "bull_flag" and programmatic label "bullish_engulfing"
  • THEN the output label for that candle is "bull_flag"

Scenario: Both labels merge

  • WHEN merge_strategy is "both" and a candle has both human and programmatic labels
  • THEN the output has two separate label columns: label_human and label_programmatic

Requirement: Context padding

The system SHALL include stages.annotation_ingestion.context_padding candles before and after each annotation span in the labeled output. This provides trend context for models.

Scenario: Add padding candles

  • WHEN context_padding is 20 and an annotation spans candles T10 to T15
  • THEN the output includes candles from T-10 (or dataset start) through T35 (or dataset end) associated with that annotation

Requirement: Dataset statistics logging

After annotation ingestion completes, the system SHALL log: total annotations by label, class distribution percentages, average span length per label, and agreement rate between human and programmatic labels (when both are enabled).

Scenario: Log class distribution

  • WHEN annotation ingestion completes with 50 "bull_flag", 30 "bear_flag", and 200 "O" labels
  • THEN the system logs the counts and percentages for each class

Requirement: Labeled CSV output

The system SHALL write the labeled dataset to data.labeled_path in CSV format. The output SHALL contain all feature columns plus the target label column(s).

Scenario: Write labeled CSV

  • WHEN annotation ingestion completes successfully
  • THEN the labeled CSV is written to data.labeled_path with all feature and label columns