← All articles
ALGORITHMIC TRADING INFRASTRUCTUREMay 18, 2026

Most institutional algo teams would generate higher risk-adjusted returns by investing 80% of their infrastructure budge

Most institutional algo teams would generate higher risk-adjusted returns by investing 80% of their infrastructure budget into data cleaning, normalization, and regime-labeling pipelines rather than into lower-latency execution, because the alpha lost to dirty data and misclassified market regimes dwarfs the alpha gained from shaving microseconds off fill times.

DP
Donald Pierre
Founder, Vhalanx Core
View on LinkedIn →

Most institutional algo teams are spending their budgets in exactly the wrong place. They pour capital into shaving microseconds off execution while feeding their models data that is riddled with gaps, misaligned timestamps, and regime labels that were stale two volatility shifts ago. The result is a fast system that is precisely wrong.

I have watched this play out repeatedly. A team will spend eighteen months and seven figures building a colocated execution stack that saves them 40 microseconds per fill. Then they deploy a strategy built on top of price data that still has uncorrected stock splits, volume spikes from exchange test prints, and corporate action adjustments that silently broke three months ago. They celebrate the latency win. They never audit the data. And they cannot figure out why their Sharpe ratio in live trading is 0.4 when the backtest showed 1.8.

The thesis is simple. If you run an institutional algo operation, you would generate higher risk-adjusted returns by allocating 80% of your infrastructure budget to data cleaning, normalization, and regime labeling rather than to lower latency execution. The alpha destroyed by dirty data and misclassified regimes dwarfs whatever edge you gain from faster fills.

This is not a popular position. The industry has a deep bias toward execution infrastructure because latency is measurable, vendible, and impressive in pitch decks. Data quality is none of those things. You cannot show an investor a chart of how clean your OHLCV bars are. But you can show them your rack in Equinix NY5. So the money flows to what is visible, not to what matters.

Let me be specific about what dirty data actually costs you. Consider a mid-frequency systematic equity strategy trading US large caps on a 15 minute signal. If your adjusted close prices carry even a 0.3% error rate from botched corporate action adjustments, your signal generates false entries. Not occasionally. Structurally. That 0.3% error propagates through your return series, corrupts your covariance estimates, and biases your position sizing. The downstream effect is not a 0.3% drag on returns. It is a regime of chronic overallocation to positions that your model believes are uncorrelated but are not. I have seen a single bad dividend adjustment on a top-ten holding blow out a portfolio's realized volatility by 200 basis points over a quarter.

Now layer on the regime problem. Most teams that implement regime detection treat it as a classification task. They use something like a Hidden Markov Model with two or three states, trained on realized volatility or a mix of vol and correlation. The standard Hamilton framework gives you a filtered probability of being in each state at each time step. Clean and elegant. And almost always late.

Here is why. The HMM transition matrix is estimated from historical data. It assumes that the process generating regime switches is stationary. It is not. The mechanism that drives a shift from low vol to crisis is not the same mechanism in 2008 as in 2020 as in 2023. In 2008, the trigger was levered credit products forcing correlated liquidation. In 2020, it was an exogenous shock compressing the entire transition from calm to panic into eleven trading days. In 2023, it was a slow grinding regime shift driven by rate expectations repricing over months. A two-state HMM trained on pooled history will smear across all of these and give you a posterior probability that is confidently wrong during exactly the moments when being right matters most.

The fix is not a better model. It is better data infrastructure upstream of the model. You need regime labeling pipelines that incorporate multiple orthogonal signals. Realized vol, yes. But also cross-sectional dispersion, term structure slope of implied vol, credit spreads, and funding rates. Each of these captures a different dimension of market state. And each requires its own cleaning and normalization pipeline because the raw data comes from different vendors, at different frequencies, with different error profiles. If you do not invest in building that infrastructure, your regime model is doing sophisticated math on top of unreliable inputs. You are polishing the telescope lens while pointing it at the ground.

The latency argument breaks down further when you examine where most institutional alpha actually lives. If you are a market maker or a stat arb shop operating at sub-second horizons, yes, latency is existential. But for the vast majority of institutional systematic strategies, holding periods range from hours to weeks. At those horizons, the difference between a 50 microsecond fill and a 500 microsecond fill is noise. It does not register against the magnitude of signal decay, transaction cost modeling error, or the kind of catastrophic misallocation that comes from a strategy operating in what it believes is a mean reverting regime when the market has already transitioned to trending.

I built Vhalanx Core around this conviction. Not because data infrastructure is glamorous. It is the opposite. It is tedious, thankless engineering that nobody wants to fund until the blowup happens. But it is where the actual edge compounds. A clean, well normalized, correctly labeled data environment makes every downstream model better. It makes your backtest trustworthy. It makes your risk estimates honest. It makes your live performance converge to your simulated performance, which is the only thing that actually matters.

The question I keep returning to is this. If most systematic funds already know their backtests degrade in production, and they know the primary source of that degradation is data and regime misspecification rather than execution speed, why does the budget allocation almost never reflect that knowledge?

Get this in your inbox
One article per week on algo trading infrastructure and systematic risk.