Sentiment is a key social attribute – we’re talking feelings, opinions, and emotions, expressed about brands, products, and personalities – but measurement is not yet science. The sentiment-analysis solution space lacks standard definitions, metrics, and methods, and not a few solutions offer only crude, low-grade analytics. The result? Business users are hard-pressed to weigh the many vendor claims they hear, first-and-foremost surrounding accuracy.
We know for certain that social-media sentiment analysis is a must-do and that accuracy is a must-have. But does your solution provider (or the providers you’re looking at) do sentiment right? How accurate are their findings compared to competitors’? Will they help you make better business decisions and do a better job engaging your customers and social followers? The answers are not so simple, but I’ll take a shot at them by exploring the angles that can help you improve your social strategy and make informed provider choices.
Solution providers ask the same questions, by the way, or at least those wishing to stay ahead of the competition do. This article was spurred by questions from one vendor that hopes to leapfrog rivals, social-analytics start-up Metavana. CMO Romi Mahajan is slated to appear on the Innovators and Innovation panel at the up-coming Sentiment Analysis Symposium, a conference that focuses on the business of sentiment/social/behavioral analytics. Romi’s query –
Do you agree that general accuracy in the social-analytics industry is about 60-65%? Do you have good references?
And here the complications begin. It’s a difficult question, not because it can’t be answered but rather because “accuracy” and “sentiment analysis” mean different things to different people. They’re unfortunately subject to marketer abuse. I’ll explain although first I’ll do you (and Romi) a favor by answering Romi’s question as asked –
It is generally accepted that tools will be 50%-70% “accurate” out-of-the box although I’ve heard stories of accuracy as low as 30%-35%.
How to boost tool accuracy
To elaborate (getting a bit into the underlying technology): You can and should expect better than out-of-the-box accuracy, delivered by general-purpose tools. Many vendors credibly claim to hit the 80%-90% range via:
- domain adaptation (for instance, via lexicons, taxonomies, “word nets,” and language rules generated for particular industries),
- algorithm choice and tuning to source type (for short-message sources such as Twitter versus long-message sources such as blogs), and
- strong tool training using machine learning.
Many (but not all) vendors will let you, the business user, customize and train the tool yourself, by importing a taxonomy, using your own training set, or creating your own search patterns and language rules. If you can’t hit at least 80% accuracy, defining accuracy according to your own business needs, find another tool.
And do express skepticism of vendors that claim 97%-98% accuracy. Those figures are highly doubtful. Two people won’t generally agree on, well, just about anything subjective at such a high rate. Claims of unbelievably high accuracy rates suggest that the tool has been tailored to fit a very specific set of source materials and applications, or that results will be so general as to prove unusable. But even reasonable accuracy claims may be problematical. I’ll present four reasons why.
Challenge #1: Comparable-seeming tools don’t all measure the same thing.
Some do only document-level sentiment (where document = a tweet, e-mail message, online review, article, status update) while others resolve sentiment to the feature level (feature = named entity, topic, concept). A tool that does a good job of measuring the valence (positive, negative, neutral) of a tweet may miss the sentiment displayed toward a particular entity of interest.
What do you need to measure to accomplish your own business tasks?
Challenge #2: Tools don’t all use the same measurement scale.
Some give you only a positive/negative/neutral rating while others factor in intensity to score sentiment on a scale, for instance, from -5 to 5. Some rate only valence where others will measure mood (happy, sad, angry, frustrated) or emotion (e.g.,http://www.discoveringpeace.com/the-abraham-hicks-emotional-guidance-scale.html). A tool that delivers B-grade emotion classification may be more business-useful than one that offers A-grade valence.
What scale will help you translate sentiment measurements into business decisions?
Challenge #3: Accuracy is, itself, too-often ill defined.
Accuracy in information retrieval is typically measured via an f-score that takes into account both “precision” and “recall.” Precision refers to how well you’ve classified or rated the cases before you, while recall means how many of the eligible cases you’ve actually caught. Most people, in talking about sentiment accuracy, focus almost exclusively on precision. How good is a tool that (to apply the popular metaphor) finds only two needles in a haystack that contains twenty? Actually, such a tool could be quite good, if your business can thrive with two new, high-value customers but doesn’t have the capacity to handle more, or if the cost of acquisition of 100% of prospects in the pool is too high. If you’re working on counterterrorism, two out of twenty is not so good.
I’d argue that, for sentiment analysis, relevance should be factored into accuracy scoring, alongside precision and recall. I mean *contextual* relevance that incorporates timeliness, influence, activities, and lots of other still-fuzzy *social* notions. Analysis results are relevant if they can help me respond to a business challenge.
An emerging form of analysis, intention analysis, can help with relevance. What do people’s statements and sentiment say about their plans? Companies such as Aiaioo Labs, Expert System, and OpenAmplify have focused on going beyond sentiment to intent. I recently profiled an application of another tool, Neurolingo, that distinguishes predictions, feeling, and wishes. This ability to distinguish makes for more relevant sentiment analysis.
Another take on relevance: Not all sentiment has equal value. If a long-time, high-value customer is mildly displeased with your product or service, better to address that case than to spend time on a one-time buyer who’s angry that a product didn’t function the way he expected.
What accuracy measures fit your own business needs?
Challenge #4: Not all inaccuracies have equal business impact.
If you’re looking at aggregate statistics, a negative sentiment rating of a positive opinion is more consequential than a neutral rating for a positive/negative may be. But if you’re doing engagement, I’d argue that it’s better to find sentiment — that positive-negative rating — than to miss it by wrongly rating a message as neutral.
Accuracy beyond market messaging
In sum, sentiment-analysis accuracy is a tempting messaging-point for a marketer trying to differentiate his company’s tools. The market does look for uncomplicated, understandable ways to assess tool capabilities. The challenge is that there’s no fixed, objective, industry-standard way to measure accuracy. Be careful how you read accuracy claims. Understand what’s being measured and how, and always look beyond the numbers, to business impact. Regardless of vendor claims, you need capabilities and performance that match your own, special business needs. Insist on them, because without business impact, accuracy means nothing.
Seth Grimes is a strategy consultant and industry analyst with Washington DC based Alta Plana Corporation, founding chair of the Sentiment Analysis Symposium and the Text Analytics Summit, and InformationWeek contributing editor. Seth consults, writes, and speaks on business intelligence, text mining, data visualization, and their application to meet current-day business challenges. Follow him on Twitter at @sethgrimes.