candle-annotator/openspec/changes/candle-backend/specs/annotation-ingestion/spec.md

85 lines
5.8 KiB
Markdown

## ADDED Requirements
### Requirement: Load annotations from JSON export
The system SHALL load annotation data from JSON files exported by the annotation tool, located at `data.annotations_path`. The expected format is a JSON object with an `annotations` array where each annotation has: `id`, `start_time`, `end_time`, `label`, `confidence` (nullable), `outcome` (nullable), and `sub_spans` (nullable).
#### Scenario: Load valid annotations JSON
- **WHEN** `data.annotations_path` points to a valid JSON file with annotations
- **THEN** the system loads all annotation objects into memory for processing
#### Scenario: Missing annotations file
- **WHEN** `data.annotations_path` points to a file that does not exist and annotation ingestion is enabled
- **THEN** the system SHALL fail with an error message identifying the missing file path
#### Scenario: Filter by confidence
- **WHEN** `stages.annotation_ingestion.min_confidence` is set to 3
- **THEN** annotations with confidence below 3 SHALL be excluded from the labeled dataset
### Requirement: Windowed classification encoding
When `stages.annotation_ingestion.label_encoding` is "window", the system SHALL convert each annotation span into a fixed-size window of candles. The window size is defined by `stages.annotation_ingestion.window_size`. If the annotation span is shorter than window_size, the system SHALL pad with context candles (centered on the span). If the span is longer, the system SHALL use the full span. Each window becomes one row in the output with flattened OHLCV + feature columns.
#### Scenario: Span shorter than window
- **WHEN** an annotation spans 10 candles and window_size is 30
- **THEN** the system extracts 30 candles centered on the annotation (10 before, 10 span, 10 after) and flattens them into a single row
#### Scenario: Span longer than window
- **WHEN** an annotation spans 50 candles and window_size is 30
- **THEN** the system uses all 50 candles and flattens them into a single row
#### Scenario: Span near dataset boundary
- **WHEN** an annotation is near the start of the dataset and there aren't enough candles for padding
- **THEN** the system SHALL pad with as many candles as available (no error), filling missing positions with NaN
### Requirement: BIO sequence labeling encoding
When `stages.annotation_ingestion.label_encoding` is "bio", the system SHALL assign a BIO tag to each candle in the dataset based on annotations. The first candle of an annotation span gets `B-{label}`, subsequent candles in the span get `I-{label}`, and candles outside any annotation get `O`.
#### Scenario: Single annotation BIO tags
- **WHEN** a "bull_flag" annotation spans candles at times T5 through T8
- **THEN** candle T5 gets tag `B-bull_flag`, candles T6-T8 get `I-bull_flag`, all other candles get `O`
#### Scenario: Overlapping annotations
- **WHEN** two annotations overlap in time range
- **THEN** the system SHALL create multiple tag columns (`bio_tag_1`, `bio_tag_2`) to represent both annotations
### Requirement: Programmatic TA-Lib pattern labels
When `stages.annotation_ingestion.programmatic_labels.enabled` is true, the system SHALL run TA-Lib CDL* pattern recognition functions listed in `talib_patterns` on the OHLC data. Each CDL function returns +100 (bullish), -100 (bearish), or 0 (no pattern). The system SHALL convert non-zero results to label names (e.g., `CDL_ENGULFING` with +100 → `bullish_engulfing`).
#### Scenario: Detect engulfing pattern
- **WHEN** `CDL_ENGULFING` is in the talib_patterns list and the OHLC data contains an engulfing pattern
- **THEN** the system generates a label `bullish_engulfing` or `bearish_engulfing` for the corresponding candle
#### Scenario: No pattern detected
- **WHEN** a CDL function returns 0 for a candle
- **THEN** no programmatic label is assigned to that candle
### Requirement: Human and programmatic label merge
When both human annotations and programmatic labels exist for the same candle, the system SHALL merge them using the strategy in `stages.annotation_ingestion.merge_strategy`: "human_priority" keeps the human label, "programmatic_priority" keeps the TA-Lib label, "both" keeps both as separate label columns.
#### Scenario: Human priority merge
- **WHEN** merge_strategy is "human_priority" and a candle has human label "bull_flag" and programmatic label "bullish_engulfing"
- **THEN** the output label for that candle is "bull_flag"
#### Scenario: Both labels merge
- **WHEN** merge_strategy is "both" and a candle has both human and programmatic labels
- **THEN** the output has two separate label columns: `label_human` and `label_programmatic`
### Requirement: Context padding
The system SHALL include `stages.annotation_ingestion.context_padding` candles before and after each annotation span in the labeled output. This provides trend context for models.
#### Scenario: Add padding candles
- **WHEN** context_padding is 20 and an annotation spans candles T10 to T15
- **THEN** the output includes candles from T-10 (or dataset start) through T35 (or dataset end) associated with that annotation
### Requirement: Dataset statistics logging
After annotation ingestion completes, the system SHALL log: total annotations by label, class distribution percentages, average span length per label, and agreement rate between human and programmatic labels (when both are enabled).
#### Scenario: Log class distribution
- **WHEN** annotation ingestion completes with 50 "bull_flag", 30 "bear_flag", and 200 "O" labels
- **THEN** the system logs the counts and percentages for each class
### Requirement: Labeled CSV output
The system SHALL write the labeled dataset to `data.labeled_path` in CSV format. The output SHALL contain all feature columns plus the target label column(s).
#### Scenario: Write labeled CSV
- **WHEN** annotation ingestion completes successfully
- **THEN** the labeled CSV is written to `data.labeled_path` with all feature and label columns