Why Language (policy) + Data (structured) = Predictive Analytics (Part 2 of 4)


The first post in this series explained why and how Natural Language Processing (NLP) delivers superior mechanisms for analyzing language used in the policy process. You can find that post HERE.

Today’s post focuses on structured data derived from that language.

It is so exciting to be working on the frontier of innovation! One of the most thrilling parts of working with NLP-derived data is that it is entirely new data that no one has ever seen before. We are about to learn new things about the communication process. In the public policy space, where BCMstrategy, Inc. operates, for the first time we will be able to draw direct, visible connections between words and actions (not just sentiment).

This will revolutionize how people think about policy trend projection and political intelligence.

The Data Revolution – A Primer

After The Economist declared that “data is the new oil,” technology gurus recoiled. After all, why would anyone in the digital economy want to be associated with unpopular carbon-based energy that also happens to be a depleting resource? Wired contested the idea and jumped right to hot button issues like data privacy and data brokering before concluding that “No, data isn’t the new oil. And it never will be, because the biggest data repositories don’t want it to be.” Ignoring the rather circular logic, Venture Beat conceded the utility of the analogy but only with respect to how data is used (e.g., to operate a machine) rather than as a commodity in its own right. TechCrunch seemed to like the idea of data as a commodity asset class, advocating that companies build up a strategic reserve of data in order to power AI-drive analytical engines.

The intensity of the debate and the search for an appropriate analogy illustrates well that we are on the cusp of a revolution in perspective with objective, concrete data at its core.

The data revolution provides economic agents with structured information regarding individual habits harvested from devices (e.g., mobile phones, search data), apps, and programs (e.g., Xcel spreadsheets, sales data). “Smart” objects generate streams of usage data regarding everything from thermostat levels (Nest), home occupancy (security systems) and driving (cars, even before they become autonomous) to food consumption (smart refrigerators and kitchen appliances) and fitness patterns. These devices transmit data to other computers for storage and analysis, generating autonomous internet usage patterns among inanimate objects (the “Internet of Things” or “IoT”).

Companies obsessed with customer discovery, more effective marketing, product innovation, and supply chain management are avid consumers of this data as they seek to support enhanced efficiencies as well as customer satisfaction throughout the life cycle of a product from the design/build phase to the call center complaint phase. Financial markets also seek to capitalize on the data revolution in order to make smarter investment decisions as well as to design better financial products for consumers and reach new customers for their savings, securities, insurance, and banking products.

This synopsis illustrates well why data is more like wind than oil. It is a renewable, readily available resource which changes with various conditions. But not all data is created equal.

Alternative Data

User data generated by electronic devices and electronic computer chips embedded in consumer devices is commonly referred to as “alternative” data because it provides analysts with previously unavailable insight into behavior patterns at an individual level.

However, most “alternative” data collected from smart objects constitute an evolutionary step, not a fundamental break, with preexisting data collection efforts. Consider:

  • Supply Chain Management: Businesses and analysts for centuries have collected and analyzed data regarding shipments in an effort to find new efficiencies. Attaching an RFID tag to an item or a container is merely a more efficient mechanism to accomplish the same tracking.

  • Insurance: Insurance companies for over a century have sought insight into a potential insured’s habits (Smoker? How many glasses of wine a week? Exercise? Age? Family medical history? How many car accidents in the last five years?) in order to quote a premium. Fitbits and electronic automobile records merely provide more detailed data which generate a more accurate foundation for underwriting.

  • Marketing: Stores and manufacturers have long monitored foot traffic patterns, returns patterns, conversations with sales clerks, maintenance/repair patterns, television watching patterns (through Nielsen boxes), warranty registration questionnaires in order to adjust how they present and market their products. Tracking internet usage, clicks, surfing patterns, and consumption patterns merely accelerates the depth and speed of access to the same kind of information.

  • Customer Service and Sales: Companies have long monitored and quantified the kinds of customer service requests and complaints received regarding their products. NLP-powered chatbots not only make the customer service (and sales) processes more efficient. They also provide a mechanism to monitor, measure, and analyze the customer service or sales interaction with a level of precision and objectivity previously unavailable.

Let’s leave aside the significant privacy questions that arise from monitoring consumers in this manner. The point is that data has been collected by businesses for decades if not centuries.

Technology empowers companies to shift how they collect, store, and analyze data regarding these interactions. Consequently, this aspect of the data revolution represents only an evolution of time-honored business management methods.

The Data Revolution Frontier -- Unstructured Data

If you seek the real data revolution, you must look past the evolutionary shifts and focus on the frontier where words and images (now viewed as “unstructured data”) are being converted into structured data (integers) in order to support analysis and artificial intelligence.

Tech-savvy readers understand well the challenge this translation process presents. Regarding images, artificial intelligence systems continue to experience difficulty determining when a picture is a cat (NYT link) as opposed to a leopard print sofa or a bowl of guacamole .

Within language, NLP experiences comparable challenges in order to understand context, meaning, and nuance. Consider this blogpost from Towards Data Science describing the programming challenges associated with identifying context correctly or this 11-minute YouTube video illustrating the challenges associated with algorithmically discerning emotion (positive or negative) implied by words. Put simply: this is not easy.

Most efforts to create structured data from language involve counting words. Classic examples include word clouds and sentiment analysis. It is rather rudimentary. How many times is a specific word used? If certain positive or negative words are used, then the speaker or writer must have a positive or negative normative view of the issue at hand and frequency determines the strength of the sentiment.

The approach delivers many false positives. For example, central bank speeches will always register a high incidence of the words “monetary policy” and often (but not always) negative sentiment regarding “inflation.” A financial regulator can be expected to use frequently the words “financial stability.” A data privacy supervisor can be expected to have a high incidence of the words “data privacy” and “data protection.” Counting words tells the strategic analyst nothing interesting.

The task of identifying positive or negative sentiments is a minefield. Normative values can be elusive. It is true that all people can agree at a high level that certain public policy goals are desirable (e.g., freedom, fairness). But reasonable people can and do hold differing views on whether individual policies are a good idea. Sentiment applied to public policy thus creates a high risk that the programmer’s preferred perspective (or bias) is hardwired into the system from the beginning.

Consider Brexit. Applying sentiment analysis to statements made regar