candle-annotator/openspec/changes/candle-backend/specs/annotation-ingestion/spec.md at 1a653c58663693c748fccf61f45b776b7ad83561

Marko Djordjevic 1a653c5866 feat: add ML service scaffolding with Python FastAPI, Docker, and MLflow setup

2026-02-15 11:58:31 +01:00

5.8 KiB

Raw Blame History

ADDED Requirements

Requirement: Load annotations from JSON export

The system SHALL load annotation data from JSON files exported by the annotation tool, located at data.annotations_path. The expected format is a JSON object with an annotations array where each annotation has: id, start_time, end_time, label, confidence (nullable), outcome (nullable), and sub_spans (nullable).

Scenario: Load valid annotations JSON

WHEN data.annotations_path points to a valid JSON file with annotations
THEN the system loads all annotation objects into memory for processing

Scenario: Missing annotations file

WHEN data.annotations_path points to a file that does not exist and annotation ingestion is enabled
THEN the system SHALL fail with an error message identifying the missing file path

Scenario: Filter by confidence

WHEN stages.annotation_ingestion.min_confidence is set to 3
THEN annotations with confidence below 3 SHALL be excluded from the labeled dataset

Requirement: Windowed classification encoding

When stages.annotation_ingestion.label_encoding is "window", the system SHALL convert each annotation span into a fixed-size window of candles. The window size is defined by stages.annotation_ingestion.window_size. If the annotation span is shorter than window_size, the system SHALL pad with context candles (centered on the span). If the span is longer, the system SHALL use the full span. Each window becomes one row in the output with flattened OHLCV + feature columns.

Scenario: Span shorter than window

WHEN an annotation spans 10 candles and window_size is 30
THEN the system extracts 30 candles centered on the annotation (10 before, 10 span, 10 after) and flattens them into a single row

Scenario: Span longer than window

WHEN an annotation spans 50 candles and window_size is 30
THEN the system uses all 50 candles and flattens them into a single row

Scenario: Span near dataset boundary

WHEN an annotation is near the start of the dataset and there aren't enough candles for padding
THEN the system SHALL pad with as many candles as available (no error), filling missing positions with NaN

Requirement: BIO sequence labeling encoding

When stages.annotation_ingestion.label_encoding is "bio", the system SHALL assign a BIO tag to each candle in the dataset based on annotations. The first candle of an annotation span gets B-{label}, subsequent candles in the span get I-{label}, and candles outside any annotation get O.

Scenario: Single annotation BIO tags

WHEN a "bull_flag" annotation spans candles at times T5 through T8
THEN candle T5 gets tag B-bull_flag, candles T6-T8 get I-bull_flag, all other candles get O

Scenario: Overlapping annotations

WHEN two annotations overlap in time range
THEN the system SHALL create multiple tag columns (bio_tag_1, bio_tag_2) to represent both annotations

Requirement: Programmatic TA-Lib pattern labels

When stages.annotation_ingestion.programmatic_labels.enabled is true, the system SHALL run TA-Lib CDL* pattern recognition functions listed in talib_patterns on the OHLC data. Each CDL function returns +100 (bullish), -100 (bearish), or 0 (no pattern). The system SHALL convert non-zero results to label names (e.g., CDL_ENGULFING with +100 → bullish_engulfing).

Scenario: Detect engulfing pattern

WHEN CDL_ENGULFING is in the talib_patterns list and the OHLC data contains an engulfing pattern
THEN the system generates a label bullish_engulfing or bearish_engulfing for the corresponding candle

Scenario: No pattern detected

WHEN a CDL function returns 0 for a candle
THEN no programmatic label is assigned to that candle

Requirement: Human and programmatic label merge

When both human annotations and programmatic labels exist for the same candle, the system SHALL merge them using the strategy in stages.annotation_ingestion.merge_strategy: "human_priority" keeps the human label, "programmatic_priority" keeps the TA-Lib label, "both" keeps both as separate label columns.

Scenario: Human priority merge

WHEN merge_strategy is "human_priority" and a candle has human label "bull_flag" and programmatic label "bullish_engulfing"
THEN the output label for that candle is "bull_flag"

Scenario: Both labels merge

WHEN merge_strategy is "both" and a candle has both human and programmatic labels
THEN the output has two separate label columns: label_human and label_programmatic

Requirement: Context padding

The system SHALL include stages.annotation_ingestion.context_padding candles before and after each annotation span in the labeled output. This provides trend context for models.

Scenario: Add padding candles

WHEN context_padding is 20 and an annotation spans candles T10 to T15
THEN the output includes candles from T-10 (or dataset start) through T35 (or dataset end) associated with that annotation

Requirement: Dataset statistics logging

After annotation ingestion completes, the system SHALL log: total annotations by label, class distribution percentages, average span length per label, and agreement rate between human and programmatic labels (when both are enabled).

Scenario: Log class distribution

WHEN annotation ingestion completes with 50 "bull_flag", 30 "bear_flag", and 200 "O" labels
THEN the system logs the counts and percentages for each class

Requirement: Labeled CSV output

The system SHALL write the labeled dataset to data.labeled_path in CSV format. The output SHALL contain all feature columns plus the target label column(s).

Scenario: Write labeled CSV

WHEN annotation ingestion completes successfully
THEN the labeled CSV is written to data.labeled_path with all feature and label columns

5.8 KiB Raw Blame History