The first post in this series explained why and how Natural Language Processing (NLP) delivers superior mechanisms for analyzing language used in the policy process. You can find that post HERE.
Today’s post focuses on structured data derived from that language.
It is so exciting to be working on the frontier of innovation! One of the most thrilling parts of working with NLP-derived data is that it is entirely new data that no one has ever seen before. We are about to learn new things about the communication process. In the public policy space, where BCMstrategy, Inc. operates, for the first time we will be able to draw direct, visible connections between words and actions (not just sentiment).
This will revolutionize how people think about policy trend projection and political intelligence.
The Data Revolution – A Primer
After The Economist declared that “data is the new oil,” technology gurus recoiled. After all, why would anyone in the digital economy want to be associated with unpopular carbon-based energy that also happens to be a depleting resource? Wired contested the idea and jumped right to hot button issues like data privacy and data brokering before concluding that “No, data isn’t the new oil. And it never will be, because the biggest data repositories don’t want it to be.” Ignoring the rather circular logic, Venture Beat conceded the utility of the analogy but only with respect to how data is used (e.g., to operate a machine) rather than as a commodity in its own right. TechCrunch seemed to like the idea of data as a commodity asset class, advocating that companies build up a strategic reserve of data in order to power AI-drive analytical engines.
The intensity of the debate and the search for an appropriate analogy illustrates well that we are on the cusp of a revolution in perspective with objective, concrete data at its core.
The data revolution provides economic agents with structured information regarding individual habits harvested from devices (e.g., mobile phones, search data), apps, and programs (e.g., Xcel spreadsheets, sales data). “Smart” objects generate streams of usage data regarding everything from thermostat levels (Nest), home occupancy (security systems) and driving (cars, even before they become autonomous) to food consumption (smart refrigerators and kitchen appliances) and fitness patterns. These devices transmit data to other computers for storage and analysis, generating autonomous internet usage patterns among inanimate objects (the “Internet of Things” or “IoT”).
Companies obsessed with customer discovery, more effective marketing, product innovation, and supply chain management are avid consumers of this data as they seek to support enhanced efficiencies as well as customer satisfaction throughout the life cycle of a product from the design/build phase to the call center complaint phase. Financial markets also seek to capitalize on the data revolution in order to make smarter investment decisions as well as to design better financial products for consumers and reach new customers for their savings, securities, insurance, and banking products.
This synopsis illustrates well why data is more like wind than oil. It is a renewable, readily available resource which changes with various conditions. But not all data is created equal.
User data generated by electronic devices and electronic computer chips embedded in consumer devices is commonly referred to as “alternative” data because it provides analysts with previously unavailable insight into behavior patterns at an individual level.
However, most “alternative” data collected from smart objects constitute an evolutionary step, not a fundamental break, with preexisting data collection efforts. Consider:
Supply Chain Management: Businesses and analysts for centuries have collected and analyzed data regarding shipments in an effort to find new efficiencies. Attaching an RFID tag to an item or a container is merely a more efficient mechanism to accomplish the same tracking.
Insurance: Insurance companies for over a century have sought insight into a potential insured’s habits (Smoker? How many glasses of wine a week? Exercise? Age? Family medical history? How many car accidents in the last five years?) in order to quote a premium. Fitbits and electronic automobile records merely provide more detailed data which generate a more accurate foundation for underwriting.
Marketing: Stores and manufacturers have long monitored foot traffic patterns, returns patterns, conversations with sales clerks, maintenance/repair patterns, television watching patterns (through Nielsen boxes), warranty registration questionnaires in order to adjust how they present and market their products. Tracking internet usage, clicks, surfing patterns, and consumption patterns merely accelerates the depth and speed of access to the same kind of information.
Customer Service and Sales: Companies have long monitored and quantified the kinds of customer service requests and complaints received regarding their products. NLP-powered chatbots not only make the customer service (and sales) processes more efficient. They also provide a mechanism to monitor, measure, and analyze the customer service or sales interaction with a level of precision and objectivity previously unavailable.
Let’s leave aside the significant privacy questions that arise from monitoring consumers in this manner. The point is that data has been collected by businesses for decades if not centuries.
Technology empowers companies to shift how they collect, store, and analyze data regarding these interactions. Consequently, this aspect of the data revolution represents only an evolution of time-honored business management methods.
The Data Revolution Frontier -- Unstructured Data
If you seek the real data revolution, you must look past the evolutionary shifts and focus on the frontier where words and images (now viewed as “unstructured data”) are being converted into structured data (integers) in order to support analysis and artificial intelligence.
Tech-savvy readers understand well the challenge this translation process presents. Regarding images, artificial intelligence systems continue to experience difficulty determining when a picture is a cat (NYT link) as opposed to a leopard print sofa or a bowl of guacamole .
Within language, NLP experiences comparable challenges in order to understand context, meaning, and nuance. Consider this blogpost from Towards Data Science describing the programming challenges associated with identifying context correctly or this 11-minute YouTube video illustrating the challenges associated with algorithmically discerning emotion (positive or negative) implied by words. Put simply: this is not easy.
Most efforts to create structured data from language involve counting words. Classic examples include word clouds and sentiment analysis. It is rather rudimentary. How many times is a specific word used? If certain positive or negative words are used, then the speaker or writer must have a positive or negative normative view of the issue at hand and frequency determines the strength of the sentiment.
The approach delivers many false positives. For example, central bank speeches will always register a high incidence of the words “monetary policy” and often (but not always) negative sentiment regarding “inflation.” A financial regulator can be expected to use frequently the words “financial stability.” A data privacy supervisor can be expected to have a high incidence of the words “data privacy” and “data protection.” Counting words tells the strategic analyst nothing interesting.
The task of identifying positive or negative sentiments is a minefield. Normative values can be elusive. It is true that all people can agree at a high level that certain public policy goals are desirable (e.g., freedom, fairness). But reasonable people can and do hold differing views on whether individual policies are a good idea. Sentiment applied to public policy thus creates a high risk that the programmer’s preferred perspective (or bias) is hardwired into the system from the beginning.
Consider Brexit. Applying sentiment analysis to statements made regarding Brexit only provides a megaphone to the loudest or most prolific contributors to the debate. Identifying sentiment accurately amid high levels of sarcasm and specific idiomatic uses of language in England presents additional challenges.
For example, check out this 30-second brilliant Brexit explainer from BBC Scotland. It is guaranteed to explode every NLP sentiment analysis engine on the planet.
It is easy to see why so much effort has been directed towards automated NLP systems that support customer service and sales functions. Algorithms currently can anticipate the next word in highly stylized interactions where the language options are limited either to product options (color, size) and functionality or to articulating often binary emotions (positive/negative) about the product. When they perform these functions, they generate an alternative data trail which itself generates additional insight about communication patterns.
Expanded interaction with smart devices may also generate shifts in human language. Consider the voice interactions with Siri, Alexa, or any smart car. Effective communication with the smart device requires human adaptation in order to generate the appropriate action by the device. Often, this means starting with or emphasizing the verb (call, order) or specific interrogatories (how much, when, where).
Communication on social media platforms, particularly Twitter, also are changing how people communicate by encouraging brevity, reliance on hashtags, and emojis.
The Data Frontier
The data frontier lies beyond these customer service and sales functions. The data frontier exists where entirely new kinds of data are being generated through the human/machine interface.
Consider the data created by our patented process. Our process uses metadata tagging and NLP to identify the action type embedded in the language of public policy and assign an integer to various actions. No one has systemically quantified the language of policy before.
Compare the patterns presented just last week regarding the U.S. ratification dramas surrounding the United States-Mexico-Canada free trade agreement (USMCA). Volume levels by themselves only tell part of the story.
It may seem counter-intuitive to see the declining volumes of action and leaks amid much rhetorical grandstanding from policymakers regarding the trade deal. Interpreting the data requires understanding not only on the aggregate volume, but also its distribution across different activity types and the underlying details.
In this example, the week started off with a single action: Mexico passed labor market reforms. Lack of labor law reform had served as a significant roadblock to U.S. ratification of the U.S./Mexico/Canada free trade agreement (USMCA). The rest of the week was filled with a reaction function as policymakers in Washington DC reacted by pivoting towards the next set of issues important for ratification.
Interpreted properly, the chart above thus illustrates that the most important development of the week occurred at the beginning. It also illustrates the challenging of applying classic data expectations to the policy context: "big" data regarding public policy will be very different (and lower) than in the retail consumer web context.
Quantifying the language of policy provides opportunities to assess objectively and transparently the trajectory policymakers are taking on any given day. These are very early days, of course. It will take quite some time before we have sufficient data to start charting trajectories algorithmically with the data. But we are building our data lake daily and automatically with this goal in mind.
What Comes Next
We are not the only company creating innovative uses for language quantification (although we do have the only patent for quantifying the language of public policy). We are not the only company asking hard questions and experimenting with the information value of quantified language.
A universe of additional analytical and computational challenges is only just opening up for exploration. Key questions include:
How much data is sufficient to achieve statistically significant observations for testing hypotheses and for charting relationships within the data?
How can we verify that observed correlations, covariances, and divergences in the data are correct? Collecting historical data to generate backtests will provide some of the answers, of course. This likely needs to be undertaken before deploying unsupervised learning models with no audit trail.
Then there is the NLP version of the Heisenberg Uncertainty Principle. How can we assess whether communication patterns are changing due to the increased scrutiny available from computational linguistics?
We are looking forward to sorting through these and other issues with our counterparts on the innovation frontier in NLP.