Building Curistat: Technical Architecture

The Data Pipeline

Curistat processes over 3.5 million one-minute bars of ES and NQ futures data spanning 10 years (2016-2026). The data originates from DataBento, arrives as raw OHLCV bars with nanosecond timestamps, and flows through a multi-stage pipeline before it becomes a volatility forecast. The pipeline is strictly incremental. Every data fetch checks what already exists locally, downloads only the delta (new rows since the last fetch), deduplicates, and appends. A full re-download never happens. This design was born from hard experience: early in development, a single "let me just re-download everything" operation took hours and introduced subtle data mismatches that took days to debug. Incremental-only is now an immutable rule. The stages: raw 1-minute bars are loaded and validated (checking for gaps, duplicate timestamps, and contract rollovers). Bars are assembled into trading sessions -- 23-hour windows from 6 PM to 5 PM Eastern, reflecting the actual MES/MNQ trading schedule with the 1-hour daily maintenance break. Each session becomes a single row in the sessions cache: open, high, low, close, volume, bar count, and the raw bar data needed for intraday feature extraction. Market data from other sources (VIX, VVIX, VIX9D, VIX3M, SKEW, SPY, QQQ, DXY) is fetched separately, backfilled to 2015, and merged with the session data by date. Economic event data (CPI, FOMC, NFP, GDP, and dozens of others) is maintained in a calendar that maps each trading day to its scheduled events. Every data source has a documented staleness threshold -- if any source is more than 24 hours behind, the system raises an alert.

Feature Engineering

Each trading session is described by 62 engineered features. After ablation testing -- systematically removing features and measuring the impact on model accuracy -- 42 remain active. The other 20 are zeroed out: they added noise without improving prediction. The features span several categories: Volatility features: prior session standard deviation, 3-day average standard deviation, 5-day average, 10-day average. The single most predictive feature in the entire model is the 3-session average standard deviation (correlation r=0.84 with next-session realized volatility). Volatility clustering -- the tendency of volatile days to follow volatile days -- is the statistical foundation the entire model is built on. Term structure features: VIX level, VIX-to-VVIX ratio, VIX 9-day to 3-month spread, contango/backwardation state. The VIX term structure captures market expectations about future volatility and whether those expectations are rising or falling. Momentum and positioning features: CTA positioning proxy (price relative to key moving averages), consecutive up/down day count, gap size from prior close, overnight range. These capture the flow dynamics that create volatility independent of news events. Economic event features: binary flags for scheduled events (CPI, FOMC, NFP, etc.) weighted by their historical impact on volatility. After calibration, 15 of 17 economic event weights were zeroed because their impact was already captured by other features (the market prices in event risk through VIX and term structure before the event occurs). The feature engineering process was iterative and driven by data, not theory. Features were added based on hypotheses, retained based on ablation testing, and pruned aggressively. The philosophy: a smaller model with strong features outperforms a large model with redundant ones.

Model Training and Validation

The model is trained using differential evolution -- a global optimization algorithm that searches for the best feature weights without getting trapped in local optima the way gradient descent methods can. The objective function is the correlation between predicted and actual session volatility. Validation uses rolling cross-validation with 37 windows. Each window trains on approximately 2,000 sessions and tests on the next 60. The windows slide forward by 60 sessions each step, covering the full 10-year dataset. This prevents look-ahead bias: the model is always tested on data it has never seen during training. The results for ES: training correlation r=0.88, test correlation r=0.78, rolling cross-validation mean r=0.72 across all 37 windows. For context, a correlation of 0.72 means the model explains roughly 52% of the variance in next-session volatility. That leaves 48% unexplained -- the irreducible uncertainty from unforeseeable events, random fluctuation, and factors the model does not capture. We do not pretend the model is perfect. A 0.72 correlation is strong by financial forecasting standards (most published models in academic literature report 0.3-0.5), but it means roughly 1 in 4 forecasts will miss by more than 1 rating point. We publish our accuracy metrics because transparency about limitations is more valuable than marketing claims about precision. The trained weights are stored as protected artifacts with SHA-256 hash verification. Any unauthorized modification -- even a single byte change -- triggers an alert. This is not paranoia; it is production hygiene. A weight file silently corrupted by a bad merge or an accidental overwrite would produce subtly wrong forecasts that might not be caught for days.

Production Deployment

The platform runs on Railway, a cloud deployment service that provides always-on hosting with automatic restarts. The production stack is Python-based: 28+ modules handling everything from data ingestion to forecast generation to Telegram delivery. The daily cycle operates on two schedules: Morning (7 AM ET): Fresh DataBento bars from the prior session become available (DataBento has a roughly 24-hour embargo on historical data). The system fetches the new bars, rebuilds the latest session, extracts features, runs the model, and generates a morning forecast. This is the most accurate forecast because it incorporates the most recent data. Evening (6 PM ET): Market data (VIX, term structure, economic calendar) is refreshed for the upcoming session. An evening forecast is generated using the latest available features. During high-volatility weeks, the evening forecast may underestimate by 2-3 rating points because it does not yet have the current session's bar data -- the morning run corrects this. Both ES and NQ get independent forecasts. The underlying model architecture is identical, but each product has its own trained weights, its own sessions cache, and its own feature history. NQ tends to produce higher volatility ratings than ES on the same day because NQ (Nasdaq futures) is inherently more volatile than ES (S&P futures). Forecasts are delivered via Telegram to subscribers, published to a web dashboard, and logged for ongoing accuracy tracking. A 30-day rolling validation report runs continuously, comparing forecasted ratings against realized volatility to monitor model drift.

What We Learned Building It

Building a production forecasting system taught lessons that no textbook covers: Data quality is 80% of the work. The models and features took weeks. The data pipeline -- handling contract rollovers, detecting gaps, managing timezone conversions, validating bar counts, reconciling multiple data sources -- took months. Every data anomaly we found (and we found dozens: interleaved contract codes, forward-filled bars masking gaps, a year-2020 contract code bug that dropped 251 trading days including the COVID crash) was discovered not by error messages but by asking "why is this number different than expected?" Curiosity about anomalies is the most important quality in data engineering. Simplicity survives. The first version of the model had 120+ features, ensemble methods, and neural network components. The production model has 42 features and a linear combination with optimized weights. It outperforms the complex version because every additional feature is a potential source of overfitting, every ensemble method is a potential source of bugs, and every layer of complexity is a potential source of silent failure. We pruned ruthlessly and the model improved. Protect your artifacts. Trained model weights, rated session caches, and calibrated parameters are the product. We learned this the hard way: an accidental weight overwrite during a routine code change produced wrong forecasts for an entire day before anyone noticed. Now every protected artifact has a SHA-256 hash, every modification requires explicit permission, and automated checks verify integrity at every system startup. Transparency beats marketing. We publish our accuracy metrics, our methodology, our limitations, and our data sources. This makes us vulnerable to criticism -- anyone can point out that our model misses 25% of the time. But it also builds trust, because the traders who use our platform know exactly what they are getting. No black box, no "proprietary AI" hand-waving, no accuracy claims without evidence. The data is the argument.

This article is for educational purposes only and does not constitute trading or financial advice. Always do your own analysis and manage your own risk.