top of page

AI Training Data 101 Guide: Synthetic Data

Writer's picture: BCMstrategy, Inc.BCMstrategy, Inc.

2025 is the year when data buyers and vendors become serious and discerning regarding AI training data. The proliferation of quantitative and language data used to train a range of AI models for expanding use cases may also generate confusion as buyers attempt to distinguish among different kinds of data.


And so we start 2025 by providing an AI Training Data 101 Guide. This series of three posts describes the main kinds of data:


Futuristic server emits synthetic data streams as binary code surrounds it, arrows point outwards, figures stand nearby. Cool tones, digital ambiance.

Synthetic Data 101 -- The Basics

Synthetic data is....not real. It consists of computer-generated artificial data.


Technically, artificial data has existed in the analog world in the form of "dummy variables" and "test data" as well as estimated outputs from stress tests and scenario analysis. Synthetic data serves the same purpose as its analog cousin. In both cases, the pretend data illuminates for humans a range of potential outcomes that only machines can generate. From system failure to the range of possible outcomes in the future based on currently known facts, synthetic data adds context and shines a spotlight on potential future vulnerabilities.


The climate and energy transition provide an excellent example of how synthetic data can be used to help humans make better data-driven decisions.

  • In March 2024, NVIDIA released an 'Earth Climate Digital Twin" capabilities that enables users to simulate and visualize potential future shifts in weather and climate with greater computational efficiency and accuracy based on historical weather patterns.

  • Also in 2024, the team at riskthinking.AI launched a Climate Digital Twin (CDT) that provides capital markets with the ability to connect the dots between climate developments and financial analysis by applying robust, well-accepted stochastic and other forward-looking capital market risk measurements (e.g., value at risk or VAR mechanisms) to corporate and climate data.

In both cases, these market leaders are deploying the unparalleled pattern-matching power of AI to a range of datasets in order to deliver next generation analysis regarding climate-related physical and financial risks.


The model outputs are "synthetic" in the sense that they are computer-generated. All the usual caveats regarding scenario analysis apply in order to understand the outputs. It is crucial to understand the underlying assumptions and parameterization. It matters greatly whether the models were designed to deliver linear or exponential outputs (which of course would skew results). It is crucial to know what kind of conditional probabilities and reaction functions were programmed into the models during foundation-level training. These are NOT the risks related to using synthetic data in the AI context.


In fact, synthetic data augments robust model testing. Unsupervised learning and agentic AI include embedded processes which defy audit trails and other efforts to see every step in the process. The capacity for AI to thus mimic the randomness of human decision-making can help deliver more realistic predictive analytics by incorporating illogical or unclear leaps of logic.


Synthetic data facilitates model testing by enabling a human to evaluate the expected result from the synthetic data compared with the outputs. It can accelerate determinations about whether or not the AI agent is fit for purpose. It is also important to use synthetic data if the risk exists that an automated AI agent may alter, transform, or reconfigure underlying data during the testing phase. As discussed below, intermingling appropriately generated synthetic data on a par with volume and velocity data measured in the real world creates real concerns regarding the veracity of subsequent outputs.


Additional good reasons for using synthetic data arise in the insurance sector and any sector seeking to protect personally identifiable information in order to protect privacy. Anonymized data on its own is not synthetic. Anonymized data merely masks identities. But anonymized data can be used as the foundation for creating a large amount of synthetic data at scale in order to estimate possible outcomes for a larger population of similar entities.


However, commingling synthetic and real data or obscuring when synthetic data was used for model training creates real risks. In a world increasingly worried about "deep fakes," and with processing speeds growing faster by the day, the ability to distinguish between real data (e.g., volume and velocity measurements, signal data) and synthetic data can disappear in a millisecond if the datasets have not been appropriately labeled. Those risks increase potentially exponentially in the language data/generative AI context.

We neither generate nor use synthetic data at BCMstrategy, Inc. due to the risks discussed below.

Synthetic Data Used as AI Training Data -- Three Key Risks

Blue flask icon with bubbles rising, set against a white background. The flask is partially filled, suggesting a scientific or experimental theme.

Risk One -- Mingled Data: As the data revolution advances, it will become increasingly difficult to detect the difference between synthetic data and real data that measures volume and velocity as well as legitimate derivative data. Non-benign scenarios include


--entities that complete their datasets with manufactured inputs.

--entities that use scenario analysis outputs as the foundation for strategic decisions and/or training data.


Strong internal controls, data labeling, and testing protocols must protect against the artificial data being commingled with, and treated on a par with, real data.


Data buyers that require veracity and integrity in the datasets they purchase may seek disclosures regarding the use of synthetic data. Buyers prioritizing data integrity may also learn to tolerate some gaps in datasets rather than accept datasets that contain manufactured inputs. In the scenario analysis context, it will become increasingly important to identify clearly which data points have been estimated, which assumptions underpin the estimate, and which data points were used as inputs for the scenario analysis.

Blue silhouette of a slot machine showing symbols: number 7, a lemon, and cherries, on a white background.

Risk Two -- Tangled Training: Both quantitative AI processes and language-based generative AI processes "learn" from input data. Models trained on synthetic data may need to "unlearn" the path towards the correct answer when they ingest real world data rather than synthetic data.


Initiating the learning process with "clean" or original processes that used the same code but never interacted with synthetic data may not be the answer either. Models that deliver the expected outcome when using synthetic data may not deliver the same outcome when processing real world data.


The real-world consequences associated with not addressing the tangle of training using synthetic and then real-world data are significant. Some scenarios to consider:


  • An AI agent trained on synthetic data regarding carbon emissions levels in one jurisdiction incorrectly assumes that all entities in all jurisdictions follow the same emissions reduction protocols. It advises policymakers to set carbon taxes, emissions allowances, and other policies with the sole goal of equalizing all components. When the AI system ingests real-world data from jurisdictions where carbon emissions are lower/higher than the synthetic data sets, it throws exception errors.


  • A health insurance company trains an AI agent on synthetic data in order to identify the optimal premium or standard of care for a particular condition. Having been trained on perfect synthetic data, the AI agent incorrectly or inappropriately identifies anomalies or throws exceptions when confronted by real-world data.


  • An AI agent trained on synthetic data regarding historical stock market prices and news headlines draws conclusions about causation as well as correlations. But when deployed into the real-world context with a more heterogenous set of data inputs, it misallocates capital because it misprices the underlying risk.


Open book icon with communication symbols (phone, chat, mic) inside; exclamation and question marks radiate outward in blue on white.

Risk Three -- Mangled Language: In many ways, this is the most pernicious of the risks.


Today's leading generative AI applications were trained on all the language on the internet, often without respecting copyrights. They were trained based on the assumption that all human language in all contexts is the same. The training data was not synthetic, but it was also not context specific.


Many times, the non-synthetic language data may be more palatable than real world language data. But that does not make synthetic language data more accurate or appropriate for the use case; it may actually degrade the signal value of the language.


Market participants are thus moving quickly towards Small Language Models with well-defined lexicons and robust ontologies specific to individual contexts. This shift towards targeted, specialized language eliminates the potential need for synthetic language data to supplement data inputs.


The risk horizon here focuses on using synthetic computer-generating language as training data for generative AI solutions. The mismatch should be intuitively obvious. If the purpose of generatie AI is to deliver answers in a syntax and lexicon that humans can understand, using as training data language generated by AI processes undermine the foundation veracity and accuracy of the training process. Human langugae will always be messier than a machine's effort to mimic human lexical interaction.


Less benign examples exist, of course. The most malicious examples involve AI trained on targeted synthetic language data for the purpose of delivering outputs that encournage humans to take inappropriate actions (e.g., complete a financial transaction, vote in a particular manner).


Computer-generated language is already a reality. The problem is not computer-generated language. The problem is in using that language as training data for downstream generative AI applictions. It is therefore increasingly important to distinguish between (i) appropriate chatbot outputs based on robust underlying real human communications and (ii) inappropriately using those legitimate outputs to undertake further language model training.


Using AI processes to generate synthetic generative AI training data may save companies the fees associated with respecting copyrights, but it will NOT serve as a solid foundation for generative AI training. The risk of creating a turbo-charged echo chamber is exceedingly high, even if the hallucination risk can be contained and managed. In addition, humans convey a large amount of information through non-verbal channels both in writing and in the spoken word. It is far from clear that AI systems can detect those non-sentiment cues much less replicate them for purposes of generating copyright-compliant training data.


Conclusion

The data and AI industries continue to evolve quickly. We have no doubt that robust risk management processes can be crafted to mitigate, manage, and contain (if not eliminate) the risks identified above regarding synthetic data. We also fully expect that savvy data buyers that care about data governance, integrity, and veracity increasingly will being to ask pointed questions about whether and how synthetic data is being used by current and potential future data vendors.


At BCMstrategy, Inc., we ingest official sector language and, with appropriate data mining licenses, fact-checked journalism language that moves markets and changes peoples' lives. We have deep reverence for the integrity of the input data. We do not use synthetic data to train our processes. We do not include synthetic data in our datasets. And we strongly advise our clients to avoid using synthetic data, particularly with respect to language training.

 

BCMstrategy, Inc. generates language-derived training data to support predictive analytics and generative AI applications focused on public policy using award-winning, patented technology. We currently generate quantitative and language datasets in three main verticals: global macro (monetary policy, trade policy, reserve currency/de-dollarization policy), climate/energy policy (including renewables and carbon emissions reduction initaitives), and digital currency policy (including asset tokenization, crypto, CBDC, blockchain and stablecoins). We also deploy the data to supply dashboards and automated research assistants powered by generative AI.

Awards for BCMstrategy, Inc.'s ML/AI training data for renewable energy crypto and monetary policy alternative data
  • BCMstrategy Inc on LinkedIn

(c) 2024 BCMstrategy, Inc.

bottom of page