I’ll report here on a LinkedIn error — it’s not a bug, it’s a flawed algorithm, significant although far from earth-shattering — that shows how difficult automated matching can be. I’ll then offer practical steps LinkedIn could take toward accurate matching.
Why should you read on? Not only because (I’m guessing) you have a LinkedIn profile, but also because in an “omni-channel” world, data matching — also known as data integration, data fusion, record linkage, and synthesis — is central to meeting everyday social and enterprise business challenges.
Check out the snippet to the right, from my own hand-entered LinkedIn profile. You’ll see that I wrote for Intelligent Enterprise magazine for quite few years. (Jeannette Boyne, thanks again for the recommendation!)
But what’s with the Intel Corporation information that appears when I hover over “Intelligent Enterprise magazine (CMP)”? It’s a LinkedIn-generated misconnection. Intelligent Enterprise magazine was folded into InformationWeek a few years back. IE ceased to exist as a free-standing brand. LinkedIn must have taken the first five characters of the magazine’s name and decided that Intel is the best match.
This matching error is significant because for LinkedIn, a connections company, social-graph accuracy is gold. Members hand-craft their networks based on past, current, and hoped-for future business relationships. LinkedIn derives People You May Know recommendations from our employment histories and interests, but recommendations are only suggestions. We see that when LinkedIn asserts connections, as the platform does in the example I show, the company gets into trouble.
(LinkedIn also misses certain computable connections, more on which later.)
LinkedIn actually mismatches a second of my former employers, Magnet Interactive. I didn’t work for the company you’ll see in the mouseover pop-up when you visit my profile, or for any company related to it. Just because two company names look the same, doesn’t mean they’re the same company!
So we have two examples of “entity resolution” false positives in just my profile. (LinkedIn should pay me a product-quality bounty. See also My Search for Relevance on LinkedIn, posted in March; my April via-Twitter reporting of incorrect rendering of HTML character entities, here and here; a February 2013 Twitter thread about the lack of needed LinkedIn profile spell-checking; and my July 2012 LinkedIn, Please Take on Group Spammers.)
I found additional examples by looking at profiles of others who were formerly employed by now-defunct companies. Here’s one such example, in the image to the right. Hyperion Software was acquired by Arbor Software — as this person’s profile states! — which in turn was acquired by IBM.
Funny thing: Somehow, as you can see toward the top of the image, LinkedIn did get right that PeopleSoft was acquired by Oracle.
Automated Matching is Hard
Yup, automated matching is hard. Direct marketers and others have been working the problem for years, for instance, in order to merge and deduplicate mailing lists. Software tools may declare record matches when the values of several fields in pairs of records line up — for instance, first initial + last name + address — and they may tolerate abbreviations, misspellings, and data variations (PA = Penna. = Pennsylvania = Pennslvania) or even exploit phonetic similarity in names. Some even determine that fields in different databases have the same meaning, despite different field names, based on data profiling, based on a scans of the fields’ values. The matches are sometimes fuzzy, decided based on a probability judgment.
Check out database-systems wizard Mike Stonebraker’s latest, Tamr, which aims to overcome the data disconnect.
FirstRain, which extracts, aggregates, and organizes business information from online and social sources, provides an even better example of semantic matching done right. As described in words pulled from FirstRain’s Web site: “Selling to GE Locomotives? You won’t want to read a generic newsfeed on GE Aviation or GE Capital — and with FirstRain, you won’t. You will only see what’s relevant based on how you sell and market to each specific business line within a company.” That is, the company (claims it) has successfully addressed the semantic-matching problem, at a level of granularity, the division level, that exceeds the LinkedIn matching need’s.
I wrote about FirstRain and a number of other semantic-matching successes — Tableau, Attivio, Google, and the now-defunct Extractiv — in my 2011 InformationWeek article, 5 Paths To The New Data Integration. (I’m linking you to page 2, which features Attivio and FirstRain.)
A Company Graph
My prescription for LinkedIn: Create a company graph, an ontology that recognizes factors that include:
- naming variations (e.g., General Motors = General Motors Corporation = (sometimes) GM);
- hierarchy (Chevrolet is a GM division);
- temporality and geography (Digital Equipment Corporation was founded in Massachusetts in 1957 and existed under that name through 1998, with a first international office opening in West Germany (now just Germany, of course) in 1963);
- transactions (Arbor Software merged with Hyperion Software, formerly IMRS, to form Hyperion Solutions Corporation, which was in turn acquired by IBM);
- multiple uses of a single name (polysemy) (SAS is both an enterprise software company and an airline); and
- identity shifts (SAS, as in the software company, once stood for Statistical Analysis System, and the company was called SAS Institute; SAS, the airline, was once Scandinavian Airlines System).
LinkedIn has the data to create just such a company graph, via text mining, and surely employees have the data-science smarts. How-to examples? Two:
That profile pictured above that lists Hyperion Software as an employer: It explicitly states, “Company acquired by Arbor Software.” Text analytics will identify “Company” as a contextual anaphoric reference to Hyperion Software. (Sorry about the jargon. Pronouns such as he and she are other commonly found anaphora.) Text analytics will identify Arbor Software as a named entity and will discern and extract the “acquired by” relationship. Further, I’d bet there are other LinkedIn profiles that corroborate this particular corporate acquisition.
Now refer back to my first image, above. Jeannette Boyne, who recommended me, lists in her profile that she worked as “Senior Editor, Intelligent Enterprise” for “CMP Media (div of United Business Media) from September 1998 to September 2005, overlapping the years I list for my “Intelligent Enterprise magazine (CMP)” association. Our profiles, and profiles of other former associates with whom we’re both linked, provide data that supports high-confidence entity (company name) and subsidiary-relationship resolution. LinkedIn, lacking sufficient semantic smarts, despite Jeannette’s and my first-degree connection and her recommendation and the similarity of the employer names, failed to infer an obvious connection.
The Unreasonable Effectiveness of Data… and Analytics
Consider this column a call to action, for LinkedIn and for other data-rich, capable organizations.
Do you want to lead in digital? Simplistic approaches — Assuming associations based only on a match in the first few characters of two names!? — don’t cut it. Use your data. Create and apply knowledge structures — graphs, ontologies, semantic networks — to resolve and disambiguate names and extract relationships. Apply multiple methods, performing cross-checks until you’re reasonably certain about inferences.
LinkedIn, if you’re not going to take these steps: Better to provide no results — skip the possibly mis-inferred connections — rather than erroneous ones. But consider that high-quality data, and high-value results, are worth the extra effort. Users will thank you.