Why You Shouldn't Trust Automated Sentiment Scoring
Why You Shouldn’t Trust Automated Sentiment Scoring
Why You Shouldn’t Trust Automated Sentiment Scoring
by

Automated sentiment scoring has evolved as the feature du jour of the social media monitoring platforms of late. So many services were offering it, even big players like Radian6 had to bring it to the table for fear of losing prospects. It’s not enough to tell brands how many conversations are being had. Social media monitoring services have to now must report on whether or not they like us.

A few months ago, we discussed automated sentiment scoring with Jeffrey Caitlin, CEO of Lexalytics, a leading natural language processing firm that serves as the engine behind many monitoring service’s offerings. His assertions were that while natural language processing can do a lot to get you close to scoring the sentiment and tone of a given piece of content, teaching computers to recognize sarcasm, false positives and the like is a significant challenge. Looking at sentiment scoring across a large data set gives you a more accurate view, but is still an estimation and not exact.

So the belief is held by many that sentiment and tone scores by social media monitoring services are pretty good, but not perfect. Still, it’s the accepted mechanism to know if people like you or not. Human analysis is better, but more expensive and potentially cost prohibitive, particularly for medium and small businesses. So, automated scoring is accepted as an industry feature and we’re all happy, right?

Not so fast. Scott Marticke of Sentiment360 reached out to me after my assertion earlier this month that human deciphering of information is something the monitoring services don’t offer, to let me know he had some surprising comparison information of automated sentiment scoring vs. human analysis I might like to see. Sure, he wanted to ensure I knew about Sentiment360, which adds a layer of human analysis to social media monitoring, but it was the disparity in analysis that struck me in our chat.

In a recent comparison for CBS Television on conversations around the show NCIS, Sentiment360 found as much as a 50% swing in sentiment, or lack there of, in machine vs. human analysis. Essentially, once you let humans analyze the data, the machine-produced results are crap.

Sentiment360's human vs. machine sentiment analysis

Here’s a run-down of their comparison:

  • Sentiment360 used an unnamed social media monitoring service to collect the data. (They’re tool agnostic and say they use several different ones depending upon the client need. Companies they report as part of their arsenal include Radian6 and ScoutLabs.)
  • The results showed 50,000 conversations around the show in a given month with the search performed in March of this year.
  • According to the service’s automated scoring, of the 50,000 conversations, 84 percent were neutral or passive mentions, 11 percent were positive and five percent were negative. NCIS is talked about a lot, and more positively than negatively, but the vast majority of the conversations don’t hold an identifiable opinion.
  • Sentiment360 pulled a sample of 3,000 of those conversations and had their analysts go to work. So the results show just the human analysis of the sample, but the sample is six percent of the total data set, far better than most market research firms offer.

And here’s what Sentiment360’s analysis found. Some of these numbers astounded me:

  • 23 percent of the entries were irrelevant. They mentioned NCIS or linked to the show, but contained no other qualifying information about the show or were spam sites.
  • Once the irrelevant entries were removed, only 30 percent of the entries reviewed were found to be neutral or passive, a 54 percent difference than the machine analysis.
  • Human analysis found that 63 percent of the online conversation around NCIS was positive, not 11 percent as the machine asserted.

Looking at the comparison, a couple of thoughts came to mind for me. In using several monitoring solutions, I’ve noticed a great deal of the automated scores I see are passive or neutral, proving overall useless to a brand. I’ve also noticed an awful lot of irrelevant posts appear in your searches, almost regardless of how minutely you program your keyword searches. If human analysis shows 23 percent of the results are irrelevant and that more than 50 percent of the passive/neutral results can be scored, then automated scoring needs to get a LOT better.

Caitlin was right. Automated sentiment scoring can only get you so far, but if this experiment is representative of what would happen with your brand, I’d say automated sentiment scoring doesn’t get us very far at all.

Don’t get me wrong: I’m not saying the people behind automated scoring aren’t working hard or helping us accomplish a difficult task easier. I am saying, however, that we need to be clear that letting a machine supply us with this particular piece of marketing intelligence is flawed. It’s not that we shouldn’t do it, but that if we do, we must understand the limitations and prioritize the intelligence accordingly.

You can certainly question the human analysis. Sentiment360 uses analysts from the Philippines to provide their analysis. Outsourcing overseas is cheaper and allows them to offer human analysis at a much lower price point than the big research firms. But they claim all their analysts are graduate or post-graduate level analysts and recognize that the Philippines is the third-largest English speaking country in the World. It was also once a U.S. colony, so there’s a faint cultural commonality, too.

But what you can’t really question is who’s believing the results. Saatchi & Saatchi‘s New York operation just named Sentiment360 as their preferred social media listening provider with VP for Digital Strategy Shel Kimen saying, “Sentiment360 demonstrated that their combination of machine listening and human analysis provided us with excellent intelligence. We had looked at a number of their competitors and Sentiment360 excelled in quality of the analysis, ROI and delivery time.”

Not bad for a firm that began to exist in December.

Sentiment360 is going to run you around $7,000 per month, so it’s still prohibitive for medium to small businesses. But more importantly, they are helping us all see that the machines are good but not great (or, depending upon your perspective, not really all that good at all) and natural language processing has a long way to go.

I don’t see Sentiment360 as a competitor for many social media monitoring services because of their price point. They’ll hob knob with the major brands and do well, but for companies that need to refine monitoring to less than $2,000 per month (which is the majority of companies), Sentiment360 doesn’t fit.

What I do see them doing, however, is forcing companies like Lexalytics, or even the core social media monitoring services to either get their algorithms better faster, or add a layer of human analysis on top of what they offer.

Have you or your company conducted similar experiments with machine vs. human analysis? How about service vs. service analysis? If so, please share your results or thoughts in the comments. If not, go try it and report back. It will make the industry better as a whole.

Enhanced by Zemanta

About the Author

Jason Falls
Jason Falls is the founder of Social Media Explorer and one of the most notable and outspoken voices in the social media marketing industry. He is a noted marketing keynote speaker, author of two books and unapologetic bourbon aficionado. He can also be found at JasonFalls.com.
  • Pingback: weight loss and diabetes type 1()

  • Pingback: n/a/()

  • Pingback: click through the up coming website()

  • Pingback: proxy server()

  • Pingback: Sentiment, text mining and visualization | Dennis' professional site()

  • Pingback: Zoetica Media » Whitepaper: A Commonsense Framework of Social Media Measurement()

  • Facundo

    Great post Jason. It really shed some light for me in relation to the conversations not holding an “identifiable opinion”. We use mainly Radian6 and indeed that happens. As you say, it’s probably a matter of acknowledging the limitations and conveying that to the clients. Certainly that term they’ve coined is going to help me :)

  • Facundo

    Great post Jason. It really shed some light for me in relation to the conversations not holding an “identifiable opinion”. We use mainly Radian6 and indeed that happens. As you say, it’s probably a matter of acknowledging the limitations and conveying that to the clients. Certainly that term they’ve coined is going to help me :)

  • Pingback: Sentiment VS Message In Social Media – Which Do You Value More? | Edelman Digital()

  • Pingback: Sentiment analysis and the problem with computational analysis | Jed Hallam()

  • Pingback: Automated Sentiment Kinda Works, Sometimes | Kevmo()

  • Pingback: What are the Organizational Limits to Analytics in SCRM? « Skilful Minds()

  • Qigaihu

    Do you think people can not lose, not the original is not lost
    You drained tears, another person make you laugh,
    Your heart, and then find people who do not love you,
    Not worth your sadness,
    I look back on it not a comedy?
    Love to do when, a brand new realm,
    All history is nothing but sorrow.
    army surplus
    body armor
    pouch belt

  • Qigaihu

    Do you think people can not lose, not the original is not lost
    You drained tears, another person make you laugh,
    Your heart, and then find people who do not love you,
    Not worth your sadness,
    I look back on it not a comedy?
    Love to do when, a brand new realm,
    All history is nothing but sorrow.
    army surplus
    body armor
    pouch belt

  • Hhe article's content rich variety which make us move for our mood after reading this article. surprise, here you will find what you want! Recently, I found some wedsites which commodity is colorful of fashion. Such as mbt outlet store that worth you to see. Believe me these websites won’t let you down.

  • I agree with the other comments on the fact that this is a very insightful and honest post based on true and realistic data.
    Thank you very much for sharing.

  • Pingback: ConsumerBase Makes Market Research Faster, Cheaper | BrettMBell.com()

  • Pingback: Consumer Base Offers Cheaper, Faster Market Research Solution()

  • Pingback: Sentiment vs. message in social media – which do you value more? :PRBreakfastClub()

  • Pingback: Social Media is a Partnership Between Man and Machine | Mark Evans Tech()

  • Pingback: Cognitive Bias Song – Shameless Shilling for My Employer « Coffee’s Gone Already?()

  • Pingback: Social Media Analytics Italia » Automated Sentiment Analysis()

  • trfitzgibbon

    Thanks for the insightful post, Jason. It is a great start to a discussion of the many problems there are with human and machine Sentiment Analysis.

    To backup what some have said here, especially Zoe (zoeDisco), I question the value of measuring sentiment even if said measurement were perfect. In fact, rather than “human vs. machine?”, I think the real question should be “measure vs. discover”.

    Users and developers of text analytics started jumping on the sentiment bandwagon a few years ago before anyone really had a good solution or even a good use in mind. But, it was an attractive idea, so interest grew on both sides.

    However, interest alone does not produce results. The main reason that everyone has such a hard time with Sentiment Analysis is that we are trying to assign rigid categories and hard numbers to something that is, by nature and intent, very subjective. So, should we be trying to measure it at all?

    “Sentiment” is only one particular type of meaning that can be communicated through text; it is the “how” people are talking instead of the “what”. But, there is no inherent reason why we should single out the “how”. What I hear most from users of text analytics applied to social media is that they simply want to know what people are saying. Sentiment Analysis is one possible solution to that problem, but certainly not the only one. A better one would be to roll the “how” and the “what” up into one type of analysis.

    The human being to whom we are reporting our results is, hands down, the BEST judge of sentiment in the world because it is her domain and her judgement of sentiment that matters most. Thus, I think the solution is to deliver themes discovered by machines, what we at Networked Insights call “Discovery Insights”, whose sentiment (as well as other types of meaning) is self-evident. That way, the consumer of the Insight can “measure” or judge the sentiment however she sees fit.

    In other words, remove Sentiment Analysis from your mind for a second and start from scratch. If the machine were to summarize the conversation around, let's say, laptop X with topics similar to the following, would you really need a sentiment number?

    “no camera” – 10% of posts
    “battery problems” – 5% of posts
    “recommended it” – 20% of posts

    Though we’ll continue providing traditional sentiment analysis, at Networked Insights we think our Discovery Insights provide a better approach to understanding the conversation. And, the advancements to Discovery Insights that we are working on now (more along the lines of sentiment-evident Insights) will be far more fruitful than pursuing incremental improvements in our ability to measure positive, negative and neutral sentiment.

    – T.R. Fitz-Gibbon

  • Pingback: Artículos de Social Media (weekly) « Sector Gawed()

  • Pingback: Twitalyzer on The Art and Science of Tone and Sentiment in Social Media - Social Marketing Conversations()

  • Thanks for this wonderfully honest post. We've used many buzz monitoring sytems and none of them do SA well. For all of our initial audience research projects with a new client, we add a significant (and costly) layer of human analysis. Most of our clients are willing to pay for this on a quarterly basis on a limited scale (i.e., not all keywords but for some of the most important questions the monitoring is meant to answer) as well because they see pretty quickly how misleading the results can be without it (all it takes is one or two dives into the results).

    I agree with Margaret that the cost of doing human analysis can quickly get out of control, but I disagree that the trade off is insignificant, especially when the data collected is being used to make decisions about product. Take the example you gave about NCIS. If the machine had said that 11% of the conversation about NCIS was negative and it was really 63%, that is a pretty big miss with some heavy potential consequences.

    There are ways to mitigate some of the problems that lead to too many irrelevant records (which can be a big factor in skewing the sentiment). And there are also ways to approach human analysis which reduce the scope of the human piece. Cost/benefit is a tricky balance for sure, but at the end of the day, I wouldn't make important decisions based on buzz monitoring data that had not been scrubbed and analyzed by humans, at least to some extent.

  • Pingback: Social Media is a Partnership Between Man and Machine « Sysomos Blog()

  • To reply to some of those posters who question the Sentiment360 “$7k pricing and what is received for that price. $7k is the high end of our pricing- our reporting starts as low as $2k per month for detailed reporting. $7k allows for a very deep dive with daily monitoring for up to four brands. We also provide the opportunity to adjust the parameters of the search at any time. Our analysts can search for trends, connections between diverse and disparate conversations, cross-referencing and more. They can also gauge sentiment on variables such as videos, imagery, audio, etc. An analogy might be that of automobiles and horses in 1900. Sure cars were fun and new, but they were unreliable and the roads were bad. No one depended solely on them for reliable transportation. A few years down the road and it was a different story…we'll get there but we aren't there yet.

  • To reply to some of those posters who question the Sentiment360 “$7k pricing and what is received for that price. $7k is the high end of our pricing- our reporting starts as low as $2k per month for detailed reporting. $7k allows for a very deep dive with daily monitoring for up to four brands. We also provide the opportunity to adjust the parameters of the search at any time. Our analysts can search for trends, connections between diverse and disparate conversations, cross-referencing and more. They can also gauge sentiment on variables such as videos, imagery, audio, etc. An analogy might be that of automobiles and horses in 1900. Sure cars were fun and new, but they were unreliable and the roads were bad. No one depended solely on them for reliable transportation. A few years down the road and it was a different story…we'll get there but we aren't there yet.

  • Pingback: Sentiment Analysis: When Humans Are Not More Reliable | V M R()

  • Hi Jason,

    Fantastic analysis! I've had my share of hand-coding. At NetBase, we built our semantic analysis algorithms by first “walking 100 miles” in our customers shoes. We felt the pain of having to do so much reading and not having enough time to read it all. So I think what you're going to find with the analysis in our upcoming tool is higher precision at detecting not only the positive and negative is but the reasons for the sentiment. I just posted about this today and I included some examples of our analysis. My post is at http://netbase.com/blog/?p=104.

    Michael

  • Jason: The best conversations are always taking place on this blog. I want to make a couple of observations about sentiment and scoring at scale, which I imagine apply to many vendors beyond just Scout Labs.

    1) There is no magic bullet for sentiment scoring, human or machine. In our mass tests (with companies like Dolores Labs) we haven't been able to get humans to agree with each other more than about 85% of the time. And machines are not as good as people: they can't recognize all the potential patterns in the English language, nor all permutations of vocabulary. I think automated sentiment scoring will likely incrementally improve over time as the science improves, but it's never going to be a 100% kind of science. Here's post I wrote about it: http://www.scoutlabs.com/2009/02/26/how-does-se

    2) People like automated sentiment scoring because there isn't any cost effective way to do it with humans. Especially when you get beyond basic brand monitoring. Who wants to pay $7K/month for one word? Imagine you're a big company, with dozens of brands, products, competitors. You've got industry issues, consumer segments, ingredients, partners, celebrity spokespeople- you're likely to want to look at sentiment for dozens of entities. Who can take the time, the expense, the overhead of having all that done by humans? Most companies we've talked to will make a small tradeoff in the ability to score by custom guidelines to get fast, actionable, real time sentiment analysis.

    3) Scale is where this is going. We, and likely every other vendor in this space, are seeing social media data proliferate at an astonishing rate. And it's snowballing. If you're Netflix or Coca-Cola or Wii or even Jason Falls, the tend of thousands of comments and observations and RTs and so forth about you is getting so big that without automated tools, you aren't going to be able to see the forest for the trees.

    The job of platforms like Scout Labs is to take as much of the grunt work out of social media analysis as possible, so that humans, whether agencies or employees, can focus on driving the strategies needed to grow their business. http://www.scoutlabs.com/2009/02/26/how-does-se

  • Pingback: Tracking Reputational Value of Sustainability and Corporate Responsibility « Your EHS Connection()

  • matt14

    Would be interesting to see results if Sentiment360 ran a human to human experiment as a control – its easy to show humans and machines don't grade the same, but historically getting a room full of human reviews to grade anything (positive/negative, relevance, etc) similarly is also an issue. Further test would be their offshore graders vs. watchers of the show. Lots of variables here that make the human-machine measures less impactful.
    Second, in the interest of helping the IR/Text Analysis community, Sentiment360 should release their test set complete with sentiment rankings (machine and human) as a test set similar to how TREC is used in information retrieval tasks. The community could use more baseline measures to compare across techniques. Of course these results would be subject to my first point too.
    Matt S

  • marksysomos

    Jason,

    Much like social media ROI, the accuracy and relevancy of sentiment is increasingly get more attention as people demand more from monitoring services. And while sentiment technology is not perfect, it is still early days, and improvements are happening at a steady basis so it will become more accurate.

    In the meantime, the strength of sentiment technology is its ability to filter through millions of conversations so people have a better ideas of conversations – good, bad or neutral. Without sentiment technology, there's no way people could do this amount of filtering.

    At the same time, there is a place for people when it does come to sentiment by giving them the ability to easily and manually make adjustments and changes. This makes overall sentiment scores better and more accurate, while giving people the ability to take into account things such as nuances, sarcasm, etc.

    Truth be told, sentiment technology isn't perfect but it will get better and more accurate.

    Mark

    Mark Evans
    Director of Communications
    Sysomos Inc.

  • KDPaine

    Great post and great discussion Jason. And Sean, thanks for the shout out. As both of you know, I have long regarded automated sentiment analysis tools as the worst thing to happen to communications in a long time. The notion that we should base strategic business decisions on systems that are only about 65% accurate makes me nuts, and reflect a professional laziness that only reinforces all the negative stereotypes of marketing and PR.
    But I also know how to make these systems better. You use humans, properly trained to achieve a minimum of 90% intercoder reliability scores, to teach computers how to interpret sentiment. You do this on a client by client basis — since obviously the interpretation of positive and negative can vary dramatically between a toilet paper company and an oil company. You then test and tweak the system until the accuracy of the automated sentiment tool meets or beats the reliability and accuracy of the humans. That's what SAS SMA has done and we've helped them do it. So is there an automated sentiment system I can love, yes.
    On the other hand, I don't recommend it for everyone. As you so brilliantly pointed out, the majority of stuff that shows up in social media monitoring systems is truly irrelevant. Sure we may get 5000 “items” for a client, but after you delete the 2000 coupons, the 2000 items that have nothing to do with the client, you're left with a very manageable 1000 items that can easily be coded in about 25 hours by trained humans. With an emphasis on trained. Humans make mistakes, and even our humans at KDPaine & Partners make mistakes but we have systems to catch and fix those mistakes, just as a good automated sentiment analysis system does. But you have to start with strict coding instructions, a supervisor that essentially re-reads everything for the first month or two, and don't trust anyone until they get to 90% accuracy. (And by the way, we do it in America, creating jobs in the north country of New Hampshire, not by exploiting low-wage workers overseas).
    Now lets talk about the “humans are too expensive” argument. Our typical measurement program averages about $2-3000 a month, for which you get a report that actually interprets the data and provides some insight as to how you might improve. It's called research, and that represents about 10% of a $200,000-$300,000 PR/Social Media/Marketing program. Don't you want to spend 10% to find out whether its worth spending the other 90%?

  • Jason,

    I help distribute and train people over here in Mexico about SMM and more specifically Radian6. I must say your post is brilliant and right to the point. I encounter a lot of people over here bent over automated sentiment as a holy grail of SMM analysis. It's amazing how a lot of people are losing the point that you are analizing human interactions which are subjective and contextual.
    I agree with previous comments where Automated sentimenta can be used as a eagle eye indicator but we must strive in telling people that without a real human analysis they are just wasting time.

  • Good article Jason, my one caution is lumping “sentiment analysis” into one activity, as they say there are many ways to skin a cat. Most sentiment analysis tools are just “counting words” and measuring their proximity to brand mentions. This statistical approach has an awful lot of problems due to the fact that humans don't always use words in the same way or to mean the same thing. If someone says that a product is “bad ass” you would run the risk of that being interpreted as two negative words in close proximity to the brand. Hence the importance of a real sentence parser that understands what sentences mean. This is why ConsumerBase is so ground breaking, we're not measuring sentiment, we're measuring emotional involvement with brands based on the real conversation that is happening in the marketplace.

    Have you had a demo of our product yet Jason? if not ping me and let's get one set up, cheers.

  • Jason, apparently, you also can get cost-effective human coding from @KDPaine's outfit…

    NLP is still young, but it is interesting. Media content analysis is such a big business that it was a matter of time before some smart person applied it to social media. I'm not a fan of the automated version, as I wrote last year http://bit.ly/ae7OZY+ — but at National City, we did see improvements in the automated tone scores as we made the manual adjustments. The system did learn from our efforts. It still couldn't grasp subtleties, but it worked well enough for our purposes. People are still best.

  • Jason,

    Jason,
    Just to set the record straight-we believe that sentiment is the tip of the iceberg of the intelligence that can be gleaned by smart analysis. The true value is in the details and, to the point of many of the posts, the ability to use the information to create a strong strategy. To that end our Sentiment360 product, combined with a strong marketing department and or agency, is where the benefit lies.

  • Jason, I appreciate the analysis of the automated sentiment scoring pros and cons. One of the things that I look at is the “emotion” behind the sentiment – I don't believe automated systems (and possibly non-native-language sentiment scoring) can effectively determine the sentiment of “snarky,” tongue-in-cheek, slang, or facetious comments. For example, a Tweet I read recently was something like, “Thanks, Xbox, for the beautiful red rings on my console.” While I might be inclined to interpret this as a positive sentiment – new graphics on the Xbox exterior? – my son would interpret this (correctly) to mean someone was not happy that their Xbox was no longer functioning (aka the “red ring of death”). Fortunately, the brands I deal with are generally manageable by human sentiment analysis, but it's good to know that there are solutions when my brands get large enough to need them.

  • I've really only found Sentiment to be valuable when it is:
    1. Applied to a segment of conversation, rather than one brand/keyword
    2. Used as a leading indicator, comparing trends over time, and
    3. Backed up by action. What good is monitoring sentiment, if you have no plan to change it?

  • JenZingsheim

    Jason, great post and thanks for bringing this up. CustomScoop gets this question quite often (about automated sentiment analysis) and thus far, we have stayed away from automated sentiment for many of the reasons you outline above.

    I think again the main question comes down to, once again, what are you looking to do with the data and information you collect? How big is your data set? If you are looking for broad brush strokes over a very large pile of information, then automated sentiment probably makes sense, especially financially.

    But if you are looking to find out what people think of your redesigned widget, or introducing a new product to a niche audience, accuracy matters–and false positives will taint your data far more in a small data set than it will a larger one. Yes, even humans will differ on sentiment, but to me this argument doesn't hold much water. Our analysts know that “wicked” in the Northeast means something different than it does elsewhere–and if they are ever stumped by a new term/phrase, they can suss it out somehow, whether by determining the context in which the phrase was used, looking at other content by the same author, or going to Urban Dictionary. I have yet to see a machine with that level of research skill, or free will for that matter. ;-)

    I don't think there's a right way and a wrong way on this–just different levels of comfort in trusting a machine, and what you plan on doing with the data.

  • As someone who worked for a brand whose detractors were often very snarky and sarcastic, automatic sentiment SUCKS. Ultimately, sifting through by human – whether contracted out or a team of hungry and underpaid interns – is the best way.

    • JenZingsheim

      George, this comment made me LOL!

  • Good post Jason.

    I have done a lot of research in sentiment scoring, using dozens of tools over the past five years. This is simply a matter of “hitting it home” when it comes to human vs machine.

    A simple question to ask any automatic scoring is unleashing it on the tonality and reference point it cannot index: the human relationship and perspective. If I know you personally, whether through frequent reading or personal interaction: the quality of of words changes dramatically.

    In regards to one of your points about “conversation spam” – there is almost no way to decipher the credibility of relevant or meaningful interactions without being able to identify the originator viewpoint.

    A true spam conversation saying “its crap” could be disregarded.

    A comment from an industry leader saying “its crap” could be a death sentence.

    In perspective, sites like Digg, Stumbleupon, and even Twitter do a better job of classifying a specific topic.

  • These social media monitoring companies will improve in time, as will their sentiment scoring analysis.

    In the meantime, I'm sticking with a Netvibes setup of selected and tweaked RSS feeds and other tools developed over the course of four or five years so far.

    A five minute skim daily and I can see what is being said about the college I work for on most of the visible Web. What isn't detected is seemingly obscure. That doesn't make those items worthless, but it does mean they aren't easily detected by the world at large, either.

    In addition, this approach is budget-friendly :)

    If someone is willing to pay $7k a month to human-score mentions for their company, let me know ;)

  • Pingback: Beginner 9: IntelliStop Part 2 – Cyborg Automated Trading System()

  • Certainly some interesting points raised in the post and comments. Since 2007 our tools, services and philosophy has been deeply rooted in the “human advantage”, however there are a number of factors that have led to vendors (ourselves included) moving in the direction of automation. Real-time monitoring has played a persuasive role for cautiously bringing it to the SMM table.

    Some folks in past discussion on the topic of sentiment analysis/scoring have raised excellent points on the limitation in using a three category scoring (positive/neutral/negative). We can certainly throw in a fourth – a “mixed” value – into the mix for incidents carrying a positive steam, until the topic of “outrageous costs” land in the post or the comment stream. But is that enough? What about eliminating “false” values with a “no-doubt” value? Is that even possible when we're talking about a blog which continues to publish comments months/years after the original post? How about walled gardens that allow only small groups of people to carry the conversation in a “private” group discussion? Do we dismiss these and only focus on incidents that stand a chance at showing-up in a vanity search?

    And what about the handling of a post Do we keep a bucket handy and run a calculation on the number of positive/negative incidents and arrive at a ratio even though a number of the positive posts were repetitive, and the strongest, most convincing points in the comments happened to be carrying a negative steam?

    Do we come up with a 10-point grading scale, and start assigning values to the title, the post, and the comments and/or handling? I can show you more than just a few example incidents where the title pretty much deals a knockout blow, leaving no doubt about sentiment. And others with sentiment nested at the halfway mark of an audio track or video – or an image conveying the worst kind of brand assault which couldn't possibly be scored accurately unless a person read every single incident. To say nothing of the way these types of incidents deliberately mask themselves to choke up content discovery efforts.

    Again, I concur on the advantages of human review, and would remark that automated sentiment still serves some usefulness as a windsock to gauge direction. However when speaking in terms of interpreting risk/context, I think it is vitally important that the analyst be knee deep in the data/incident repository, with a depth of understanding from a sourcing and historical perspective, to really be able to find the most use and answer the “what next” question.

    For all these reasons (and more), this notion of comparing tools using sample data, while interesting, doesn't serve sentiment analysis in a manner that can be applied in a uniform and qualitative manner. Until a report could accurately depict the complex relationships between data, the analyst(s) and duty/requirements, we overlook the benefits that truly reveal the chasm of difference between machines and humans.

    Joseph
    @RepuTrack

  • laurenvargas

    As always, you know how to spark a brilliant conversation! You have people talking about why and how people should be analyzing data and not relying on a magic button to spit out results. Automated sentiment is just a launch pad for human analysis. Why are you sorting the sentiment as you do? What is the context of the data being sorted? This analysis is tedious. So, answer why you need to do sentiment analysis before checking off that item on your SMM tool review checklist as a feature you must have.

    Lauren Vargas
    Community Manager at Radian6
    @VargasL

  • michmski

    Really interesting post, Jason. I'd go even a couple of steps further to say that social media monitoring shouldn't just “report on whether or not they like us”, but also WHY people do/don't like a brand or one of its products.

    Synthesio, for example, insists on having a layer of human analysis, either using our teams or our client's because we provide reports and alerts along with our monitoring to explain the various analyses. In terms of sentiment, for example, we look at the overall article’s sentiment but also at the different tags within it. For example, one article might praise one feature while saying how much they hate another feature of the same product.
    Christine makes a great point, too, that even human sentiment scoring can have its problems, though. Whoever is doing the needs to be well-briefed on the company and task at hand.

    I think the new question is going to be – how do you like your social media monitoring ? We all want analyses as quickly as possible, but where do you draw the line of accuracy/intricacy vs. speed ?

    @Zoe – thanks for the link, I’m going to check it out
    @Hugh – Absolutely agree that sentiment analysis is only one feature of social media monitoring. Now brands can use it to carry out market research and actually listen for insights about where to go in the future. We have one client, for example, that is listening to HOW people use a certain pharmaceutical.
    @Jason – I’d love to chat more ;)
    @All – great comments, I’m looking forward to some more

    Best, Michelle @Synthesio

  • Great call Jason! We've seen and heard from many of the people we've interviewed for our book observe that the ability for a solutions to produce an accurate sentiment is more an art than a science, which has included Radian6 and Alterian and that the core issue is 1) the language and 2) the use of ever changing colloquialisms. Not sure if / when we'll be able to take humans out of the process.

    I'm off to check out the Serntiment360 people! Thanks!

  • jeffespo

    Jason – this is a great post. One of my larger issues when selecting a new monitoring tool was the auto sentiment. There is a lot of time spent grading mentions, and while there are a large number of posts that fall into the neutral category, it is not as large as the auto-sentiment dictates.

    When testing out the new vendors to make a decision (I posted here: http://bit.ly/drnNYm) the test was the auto sentiment against the human tagging we did in our old tool which was R6 (great tool, but we changed vendors). I didn't care about the info found, it was just on the accuracy of the sentiment as we were looking to pull in a SIM score with our competition.

    Some were close others not so much, but at the end of the day the computer couldn't match the human accuracy. At the end of the day we selected Scout Labs, who was the best of the vendors we looked at in terms of AI accuracy, but we still need to do manual work.

    Great insight as usual Jason.

  • Hi Jason:

    I'm going to stick my $0.02 in here too. Sentiment analysis (like most things) seems easy until you really do it.

    Regarding the studies above, it strikes me as unusual that 70% of the messages (in the Sentiment360 analysis) express emotions. We have been doing this work across all categories for 7 years and that is much higher than we typically see.

    It isn't that neutral messages are irrelevant or don't tell you anything. Placed in context (category, competitors, etc.) % of neutral messages is indeed an interesting data point. Some brands (Toyota) are mostly rational where others (Ford) are much more emotional. There are implications to this.

    The answer is neither machine alone, nor human alone. You need smart machines guided by smart people (context, category and project dependent linguistic models for scoring.)

    Tom O'Brien
    MotiveQuest LLC
    @tomob

  • richardbagnall

    Hi Jason,

    An interesting post and it's great to have a commentator with as much authority and knowledge in this field draw attention to this issue.

    Your post validates Metrica's approach entirely in that we use automated analysis as a starting point only. Our team of London based international media analysts who have been analysing media clips since our founding in 1993 then check a significant portion of the coverage to ensure the accuracy of our data.

    Consequently we are able to provide tailored metrics that reflect our clients' business objectives and blend this information in with mainstream media findings too. on top of all of this we provide consultancy, reporting and alerts – all other services that are too often overlooked as companies seek to understand what is being said about them online.

    All best, Richard

  • Hi Jason,

    Interesting post and always such a hot topic. I don't think you'll be surprised by me saying this post could very easily have been titled “Why You Shouldn't Trust Human Sentiment Scoring.” While both humans and computers could do a portion of the job of sentiment scoring, and should work together to maximize results, we know for a fact that computers don't change their minds.

    If an automated system reads the same material 100 times, it will score it the same 100 times. This is perfect for processing huge amounts of data – it can reach the higher accuracy numbers the more it has to work with. If your project requires that every single document be scored for sentiment then you will want to layer that with human analysis.

    Perhaps you let automated analysis strip out the neutral and let humans focus on the edges of sentiment. The BBC is currently using our system to do just that to monitor the UK elections. The journalist is making his opinion available by following the sentiment trends on the highly positive and negative sides of the discussion and getting to the analysis faster by using our automated system.

    On the other end of the scale, ThomsonReuters uses it fully automated to help with the algorithmic trading systems. No humans intervene at all as it processes a pipe full of content and needs to make quick decisions.

    Since your post focuses on social media monitoring I couldn't agree more that humans should play a role along the way, however companies like ours let those humans do their job a bit faster by providing automated tools to process and strip out the noise. It all depends on how much data you have, the source of the data and what your end goal is before you can say whether it is for you or not. Companies should always ask for a proof of concept, or do a trial of the tools. It's probably the best way to determine which avenue you should take.

    Best Regards,
    Christine
    @christinelexa
    Lexalytics

  • richardbrown2000

    Excellent post. We have been humanly monoriting sentiment across the web, and resisting the blandishments of the autobots for the past five years. We see context as being the biggest problem for the machines. Semantic analysis is all well and good, and ok for single posts, but once you get into conversations the challenge starts. What do you do with a single post in a discussion thread perhaps hundreds of posts long, that just says “Me too!” unless you can read the thread context?

    Perhaps more importantly, your point about actionable information is spot on.

  • I'm having one of those mornings when I hate you, Jason. Because struggling with the accuracy (or lack thereof) of automated sentiment analysis makes me grumpy, and you've gone and poked the sore spot. Good post.

    As for Scott's question of who benefits from sentiment analysis, if your activity for a client is largely mitigating customer frustrations because their product is poorly understood and misapplied, it's actually a decent gauge of how well you're educating their customers out of their aggravation. As we learned from Kathy Sierra way back in the day “Frustrated Users = People Who Won't Buy Your Product Next Time. “It's a passably good metric for online community management work, and for consumer education work, in short.

  • Hi Jason,

    Great post. I'm always interested to hear the varying points of view on this topic – it's surprisingly controversial.

    It's something that's also really important to us in the Measurement Science department at Syncapse. We recently wrote a white paper on a guerilla experiment we conducted in sentiment analysis. If you are interested in reading it, here is the link – http://www.syncapse.com/media/syncapse_sentimen

    Our biggest finding was that even human sentiment scoring isn't that accurate or reliable. This is because there are many versions of the truth.

    But let's ignore the quality factor, and pretend that we found a way to measure sentiment that was consistent and accurate.

    One question I would ask is, what good is sentiment scoring going to do for your business? I mean, when considering the insights and intelligence we can learn from a good measurement strategy, what can you do with “Yes, 75% of these people like you.”

    I hear 'sentiment analysis' and I think 'so what?'. There are more effective ways to bring analysis to the table, analysis that we can actually act on and change, analysis that enables us to improve and respond to our customers.

    When someone asks for sentiment analysis I feel as though they are just looking to check off social media on their 'To Do' list. We, as practitioners, need to steer from this thinking and show what true measurement science is.

    z


    zoë siskos
    @zoedisco

    • Zoe – The scientific study you reference in your comment is incredibly insightful and disturbingly revealing. I encourage all brand marketers to check it out.

      Syncapse wanted “to demonstrate that a panel of humans, given equivalent AMBIGUOUS [Hugh's emphasis] instructions that most marketers are given when they log into a listening platform, could not unanimously agree on the Generic Sentiment Score.” You appear to have shown exactly that. But I do think one of the key findings of this study is that making instructions as unambiguous as possible is vital to having objective results. By instructions I am referring to the guidelines given to the analyst concerning SPECIFICALLY WHAT the sentiment analyst is assessing for sentiment. In the case of your survey, the instructions relating to sentiment were:
      “Please indicate whether you believe the statements below are positive, neutral, or negative – towards books.”

      I would be very curious to see whether you came to the same conclusion regarding the Generic Sentiment Score not being reliable if your instructions were “less ambiguous” whatever that might mean. I propose my own suggestion below as to what I think that might mean.

      It seems to me one way to resolve this problem of subjectivity is to ask (1) the client and (2) the analyst to grade the exact same sample of data with the exact same instructions . If their sentiment scores agree using one set of instructions then we would use that set of instructions for the analyst to evaluate all data. If their sentiment scores are not in agreement then the instructions need to be further clarified.

      Clarifying relative terms and making sure the client and analyst are on exactly the same page, therefore, is hugely important in sentiment analysis. That's my main take away from the excellent study you site.

      My other take away, from your comment above (not the study) is that you are questioning the value of sentiment analysis in the first place. I'm not in agreement with you there (if that is what you are saying) but I would admittedly like to see more use cases for us to practically rather than theoretically debate this issue. I do agree there is a major problem with the way we currently “do” sentiment analysis today that needs to be addressed. I hope my suggestion above regarding instructions is a helpful one.

      S360 – would love to hear your input on this…

      • thanks for the reply Hugh; happy to hear you found value in the white paper.

        I agree with needing to clarify what what we are assessing for sentiment. It would be interesting to do another study with much more in depth instructions.

        Another area I would like to expand on in an expiriment is the length of what they are evaluating. Twitter was easiest, obviously. But I wonder what kind of results we would have in scoring of much lengthier blog posts.

        As for the relevance of even doing sentiment analysis – I don't totally discount it…if there is purpose behind it. Many just want to know the sentiment and nothing else. I question the value of this because there is no action that can really be taken.

        Again, thanks for the feedback; I always enjoy healthy conversations :)

        z

  • scottmarticke

    Thanks Jason, it was a pleasure talking with you last week.

    I'd like to answer some of the commenters regarding using tools for sentiment only. You are absolutely right…a simple sentiment score should not be the “be all/end all” of research. What we are doing at Sentiment360 goes beyond mere sentiment ratings. Our analysts are trained to connect dots, probe for trends, discover and monitor potential issues and emerging crises and certainly measure the effectiveness of marketing and media campaigns relative to the online world.

    So I agree with those who see sentiment analysis produces a limited result. The extent of knowledge that can be gained by constantly refreshing the monitoring parameters (as we do) goes far beyond that of simple sentiment.

  • Thanks for such a thorough analysis, Jason. I've wondered/worried about sentiment analysis ever since I started playing with Radian6, SM2 and other monitoring tools. It always seemed like an included feature without much substance.

    The data you share here pretty much confirms it: Counting on a machine to interpret tone or sentiment is a big, dumb crapshoot.

    By the way…who exactly benefits from sentiment analysis? Is it the PR/ad agency that can say “Look, they used to hate you, and now they love you! We did such a good job for you!”

    Seems like you'd have to spend a lot of money to get reliable data just to make that sort of claim. I suppose when you're dealing with ad budgets in the millions of dollars, $7,000 a month here and there is no big deal.

    Is this really what it all boils down to? Determining on a macro scale whether people like us?

  • Thanks for such a thorough analysis, Jason. I've wondered/worried about sentiment analysis ever since I started playing with Radian6, SM2 and other monitoring tools. It always seemed like an included feature without much substance.

    The data you share here pretty much confirms it: Counting on a machine to interpret tone or sentiment is a big, dumb crapshoot.

    By the way…who exactly benefits from sentiment analysis? Is it the PR/ad agency that can say “Look, they used to hate you, and now they love you! We did such a good job for you!”

    Seems like you'd have to spend a lot of money to get reliable data just to make that sort of claim. I suppose when you're dealing with ad budgets in the millions of dollars, $7,000 a month here and there is no big deal.

    Is this really what it all boils down to? Determining on a macro scale whether people like us?

  • gianandreafacchini

    This post is music to my hears. My monitoring company is exactly doing this: focus on human capability, not on plain algorytms.

  • donnalehman

    Thank you for sharing this comparison of 'man versus machine'. It's interesting to track the development (and effectiveness) of natural-language tracking tools for marketing purposes. Aka 'Semantic Web'. Will soon see a demo of a new-ish entry – Netbase – and will be sure to ask if they've run comparisons with human input. They call their method 'Netnography', and started as a science/innovation research program. Seems promising – but will keep your comments in mind.
    Cheers!

    • Thanks for the comment Donna, I'm with Netbase and the actual product you will see a demo of will be ConsumerBase, which is our brand insight tool. We are currently using human analysis to test ourselves to make sure what our NLP is interpreting and reporting is correct. We don't just do sentiment analysis but actually understand what emotional involvement consumers have with brands through the emotional language they use. We help companies understand likes/dislikes/emotions/behaviors etc. so we are not just measuring the usual sentiment triad of positive/neutral/negative, which IMHO is un-actionable for most organizations. Hope you have a good demo and I'd love your feedback on it as well.

  • jonnybgood

    Interesting points well put across, Jason. I would just add my 5 cents worth that some human analysis services can also provide you with actionable insights drawn from metrics additional to sentiment. Who is expressing the opin¡on: present, former, potenital, competition customer, employee, journalist, motive or type of opinion: question, suggestion, complaint, recommendation. Where is the opinion being expressed (linking in web metrics such traffic, links, interactivity, originality of content etc)? If you go to the trouble (and expense) of using human analysis then you need to be able to get more out of it than mere sentiment. If I were a potential buyer of these services then I would want to know exactly what is good, bad or indifferent about me, who is saying this, why and what I can do about it. THe people in you web space all day every day are the people who can help you with this.

    • Good point. This falls into the issue of evaluating sentiment priority. Source influence analytics are important to the overall analytics equation. And Radian6, for one, does an outstanding job at this in particular (they break down influencer influence with an easily adjustable algorithm that uses 8 viewable key metrics (such as inbound links to the post) if you are especially concerned about more granularity in this area. Also important and somewhat related are (1) the reputation of the source and (2) the intensity of the sentiment that is expressed. (see Maria Ogneva's insightful post at http://digg.com/d31Otu3 for more on this)

      One additional observation: What about multimedia conversations taking place online that do not make use of text but instead make use of voice (blogtalkradio), video (youtube), pictures (flickr, slideshare), etc… I'm curious: Does Sentiment360 do any sentiment analytics for this part of the social web? Might they or you have any stats on the percentage of conversations online that are multimedia vs. text-based? This is sort of a blind spot for the analytics technology providers when it comes to sentiment. No?

      Jason – keep up the great work. Thanks for being an outstanding resource for those of us who live and breath in this space every day! It's evolving so rapidly and it really helps to have your perspective and those of the community here who add insightful comments. (i guess Sentiment360 could grade the sentiment of this post as “2 thumbs up”:) )