The Problem with Historical Data

Final alternative data essay of the summer, exploring the plight of data scientists and mathematicians everywhere when correlations matrices break down. Given the number of geopolitical and economic paradigm shifts underway and given the new kinds of data becoming available for risk analysis, this is far from an academic exercise..

See the ESSAY on LinkedIn HERE. Full content appears below.


As August holidays approach, it is difficult to avoid the sense that a paradigm shift approaches in the autumn. Multiple macro shifts are poised to hit high gear in September, from monetary policy and climate/energy policy to digital currency policy. Policy activities in these areas will operate amid an unsettled macro environment that includes U.S. mid-term elections, leadership changes in England Italy, and of course an intensifying war in Ukraine.


Many are bracing for market volatility when autumn arrives, just as firms seek to position their trading books for year-end profit and loss statements. Basis risks will increase across multiple dimensions. Headline risks regarding monetary policy, climate finance/transition policy, renewable energy, and digital currency policy are all expressed verbally, while market participants must measure and price risks quantitatively.


Market reactions to headline risk reflect uncertainty regarding potential future outcomes. The uncertainty is expressed through rapid, discontinuous asset repricing. In other words, during periods of volatility, historical correlations break down, rendering historical data of limited utility to capital markets.


Fortunately, some alternative data is standing in the wings and ready to help support strategic decisions for professionals who do their best work amid dynamically shifting environments (portfolio managers --particularly volatility traders and advocates), those seeking to maximize opportunities (particularly thematic investors), and those responsible for helping investors make sense of a situation (particularly research analysts, strategists and chief economists).


The Historical Data Conundrum



Markets demand multiple years of historical data for a simple, rational purpose. Long time series provide the foundation for spotting repeatable patterns. Those patterns can be internal to a dataset or they can be combined with other data sets to spot previous under-appreciated relationships (e.g., correlations, covariances). The point is to make better data-driven decisions about likely outcomes.


Machine learning (ML) and artificial intelligence (AI) capabilities amplify the capacity to spot patterns and anomalies at scale, driving the recent frenzy for financial firms to acquire alternative data. Advanced computing and the growing availability of exhaust data thus drove the first phase of alternative data expansion, Increasingly, ML/AI capabilities powered by platforms like Snowflake mean that no two firms will take the same path even through the same dataset.


But as this essay series has been discussing throughout 2022, the market for data continues to evolve rapidly. Individual, new datasets may generate unique insights to support a specific investment thesis, but the informational advantage decays with time as more firms acquire the dataset. need to acquire large datasets as well as a multiplicity of datasets.


At its core, industrial-strength pattern/anomaly detection depends crucially on a key underlying assumption: that the past provides perspective into the future. Validating datasets thus requires that firms have the capacity to identify which correlations they seek as a litmus test for reliability. Advanced firms offload that analysis to automated processes.

The system works well for single-channel data (e.g., emissions or foot/parking lot activity in front of a retail establishment). When consumer patterns shifted to online buying, analogous metrics (like website views and mobile app downloads and activity inside mobile apps) were crafted.

But what happens when new situations arise for which no ready metric exists?
And what happens when the next generation of data becomes available for which no previous history exists to provide benchmarks and validation parameters?

In many ways, these are familiar challenges to capital markets. It is no secret that correlation matrices breakdown when individuals (economic actors, policymakers) begin making decisions in different ways. The economically optimal or desirable outcome may not be available or chosen. Identifying opportunity and risk in this context can become exceedingly challenging.


The Data Validation Conundrum

When a paradigm shift arrives, those clinging to historical data are caught flat-footed.

We are facing just that kind of moment now in the capital markets.

COVID-19, the climate transition, the war in Ukraine, digital currency markets, and the current inflation/labor market structural shifts are rendering many historical datasets of limited utility.

Paradigm shifts are not limited to the global macroeconomic context. The data economy is poised to deliver to markets entirely new analytical frameworks by delivering metrics which have never existed in the past.



Consider data generated from language.


First forays into this space have focused on familiar sentiment analysis driven by technology giant’s need to understand consumer sentiment and capital markets’ need to understand investor sentiment. But a broader universe awaits beyond sentiment analysis.



Our own patented process provides a good example. We generate data that the world has never seen before by measuring what political scientists and lobbyists have known for centuries: that momentum in public policy reflects a daily multivariate reaction function between what policymakers say and do and what journalists report.


Backtesting public policy data is particularly challenging because the underlying activity being measured – policy formation processes – by definition are designed to create a break in the time series rather than deliver a mean reversion.

Policy processes at their core have an uneasy relationship with history and precedent because the purpose of public policy is to change the status quo.

Quantitative data generated from the public policy process follows its own, identifiable rhythm, which can vary from issue to issue. It is path-dependent, if not always linear. More importantly, policy activity triggers market reaction functions. Markets know this instinctively, because they acquire and analyze automatically institutional news feeds.



But to date their data intake has been one-dimensional (the news feed). We are honored to partner with two legendary news organizations: Dow Jones and Bloomberg because the news flow is a crucial element of policy data. But it is only one dimension of the policy process. Our data provides visibility into the rest of the process.

Institutional news feed customers and Bloomberg Terminal customers start seeing the policy process in technicolor then they use our quantitative multivariate time series data alongside these news market leaders.

Our initial backtests from the first two years of data generated by the patented PolicyScope process illustrate the kinds of informational advantages that accrue to firms that adopt this approach today. They are transitioning to a new analytical framework.


The Data Transition

Financial markets are accustomed to managing the basis risk between verbal and quantitative triggers for risk pricing.

--They hire subject-matter experts and top quality strategic analysts to help them de-code the policy cycle into the language of fundamental equity analysis and price arbitrage opportunities. --They acquire institutional news feeds.

--They apply sophisticated natural language processing capabilities to glean automated insights regarding analyst sentiment and issuer sentiment.

--They invest in sophisticated computing technology to help strategists and portfolio managers connect the dots faster and better



But the 24/7 news cycle turbo-charged by social media and advanced computing capabilities ironically creates information overload for strategic analysts and advocates alike even with these sophisticated tools.


Increasingly, these human readers are seeing that the language is the data and that measuring momentum delivers to them significant triage opportunities….even before we start using our curated language data to train ML/AI decision intelligence tools. Both layers of patented data (language and charts) are available immediately to Bloomberg Terminal usrs via {APPS PLCY <GO>} as you can see from this screenshot:

Objective, volume-based signals of public policy volatility helps investors and advocates identify inflection points well in advance of the news cycle. They can identify with pin-point precision which specific policy issues are gaining traction on any given day, accelerating their capacity to take strategic action from Wall Street to the halls of their local legislature and regulator.

The point is NOT to ignore historical data. The point is to choose carefully WHICH benchmarks to backtest against AND to align time horizons.

The patterns present in consumer activity, consumer sentiment, Twitter or Reddit sentiment, and tick-by-tick market activity are only useful as training data for those contexts.


New activities newly visible via advanced technology for which no historical data exists require analysts, strategists, and the people programming ML/AI explorations to take out a blank sheet of paper and start thinking for themselves. Because the only thing worse that blindly relying on historical data amid a paradigm shift situation is trying to shoehorn new datasets into tired, old analytical frameworks. The new class of data requires data scientists, analysts, and advocates to ask a new set of questions, including:

  • How do you evaluate a data series for which no history exists?

  • Do you attempt to backfill?

  • How do you compensate for the observation effect?

These are the questions that appear at the innovation frontier which we are addressing daily in my company.


Mindlessly, robotically, testing new data sets against the same time horizon without consideration for when and how the dataset delivers value delivers problematic results.

Entirely new datasets generated by machines or generated from language require different analytical approaches. Consider the framework financial firms currently use to measure credit risks. It is different from market risk measurement because the underlying behavior and data are different. It took a decade to settle on an industry standard; I know because I was there. The new class of alternative data requires comparable context-driven analytics to maximize the information and signal value of the data. Hopefully, the availability of ML/AI techniques will shorten considerably the analytical cycle.


BCMstrategy, Inc. generates quantitative momentum and volatility data from the public policy process using patented technology. Our PolicyScope data helps portfolio managers, investment analysts, global macro strategists, and advocates anticipate market volatility and connect the dots faster. Charts and underlying language data (in PDF form) are available to human strategists via the Bloomberg Terminal at {APPS PLCY <GO>.}.