By now you've probably heard the news: Google is working on an algorithm, not yet incorporated into the search engine, that would allow them to score web sources based on their trustworthiness.
That is, all else being equal, if one webpage or site has a reputation for producing factually accurate information, it will be more likely to show up in search results.
Google published a paper on the algorithm, which they call Knowledge-Based Trust (KBT), at arXive.org.
While the news that Google plans to rank sites based on facts has certainly made the rounds, I haven't seen much in-depth discussion of how the algorithm works. Aside from Aaron Bradely's thoughtful post, I haven't seen any discussions that seem to be inspired by the paper itself, as opposed to the New Scientist article and the spin-offs it has inspired.
With that in mind, I thought I'd go ahead and give the paper a read, and summarize what I came back with. What follows is going to be a bit technical, but I shouldn't have to apologize for that in an industry that uses the name "optimization" in its title. That said, I like to think I've kept things informal enough for the relative layman to understand what I'm getting at.
For more depth, there's always the paper itself.
Let's get started.
To build a database of facts, and to test whether a page has accurate facts, Google has to extract knowledge from pages. This is accomplished by extracting triples. A triple is a statement of the form (subject, predicate, object), where:
- A subject is a real world entity that is assigned an ID.
- A predicate is a predefined attribute that entities can have.
- An object is an entity, string, numerical value, or date.
The KBT paper mentions that the Knowledge Vault extracts triples from webpages using 16 different methods. Only 4 of these are mentioned explicitly in the Knowledge Vault paper. They are called extractors:
- Text document extraction: These extractors are machine-learning algorithms trained on selected data sets with predetermined correct outcomes. At the time of the Knowledge Vault paper, these extractors had been trained on 4469 predicates, using logistic regression and a variety of other techniques. The correct answers for these predicates were pulled from Freebase.
- HTML DOM (Document Object Model) Trees: DOM is a language-independent convention for working with objects in HTML. Google's DOM extractors are, similarly, machine-learning algorithms trained on selected data with predetermined correct answers.
- HTML Tables: These extractors are designed somewhat differently. Relationships between columns of tables are inferred based on comparisons with Freebase.
- Human Annotated Pages: This is based on extraction from formats such as schema.org, microformats.org, opengraphprotocol.org, and so on.
The important thing to understand is that several types of extractors are used, and cross-referenced with one another. These can then be fused in order to achieve a greater overall confidence score.
The extractors are then combined with something else: priors. Priors are models built on data from a source outside of the web at large. Again, in the Knowledge Vault paper, the source was Freebase.
They trained the priors on Freebase data in order to predict which kinds of extractions were more (or less) plausible. For example, if two people share a child, they are more likely to be married.
They then fuse this prior knowledge with the extractors in order to achieve more accurate fact extraction.
Even without understanding all of the technical details, you can probably guess that there are some fundamental flaws with fact extraction of this kind: it's essentially a popularity vote. The probability that a fact is correct is determined by the percentage of sources that agree on it. As uncomfortable as I am with that kind of reasoning, it's almost certainly the best we can hope for in the foreseeable future.
As we'll get into in a bit, the KBT algorithm makes some improvements on this method of determining the accuracy of facts, but it's still based on a form of popularity.
2. The KBT Algorithm
Central to the algorithm is this tautology, quoted from the paper:
"Inference is an iterative process, since we believe a source is accurate if its facts are correct, and we believe the facts are correct if they are extracted from an accurate source." [Emphasis mine]
In other words, the Knowledge Vault algorithm for identifying facts is modified slightly. Initial estimates of facts are based essentially on popular vote. These estimates are then used to estimate the accuracy of sources. This estimate for the accuracy of each source is then used to update the estimates for each fact, and so on.
But this alone isn't enough, because the Knowledge Vault's extractors are far from perfect.
In order to tell whether a source is trustworthy, it's not enough to look at the facts extracted from that page. Google's extractors screw up the triple much more often than the triple is actually wrong, and the paper makes it clear that they are well aware of this problem.
In order to deal with this issue, the KBT algorithm has to estimate the probability of a failed extraction.
I don't want to dive too deep into the math, but I promised an explanation of how the algorithm works, and without at least a bit of math, I feel like I'm cheating.
KBT is an iterative (repeating) algorithm that updates two series of data, Z and theta. Once the algorithm reaches a point where each update to these series returns the same values, the algorithm stops.
Z = (V, C)
Here, V is a series of latent variables that represent whether any given triple is "true."
C is a series of latent variables representing whether any given web source actually provides a given triple.
Both V and C are series of binary variables. Any given triple is either true or false, and the question of whether any given web source provides a given triple is either true or false.
Now for theta:
theta = (theta_1, theta_2)
Here, theta_1 is a series of variables representing the probability that a given web source provides a "true" triple.
The second series, theta_2, actually contains two series within it. Here's a screenshot, taken from the paper:
If you're not familiar with higher math, don't get scared by all the unrecognizable symbols. The first series, P_e, represents the precision for each extractor; out of all of the extracted triples, it represents the percentage of them that are extracted correctly. R_e represents the recall for each extractor; out of all of the existing triples, it represents the percentage of them that are extracted. This Wikipedia image should help explain:
Without going any deeper into the math, all of these series of variables are related to each other by joint probabilities. Default values are chosen for the variables at the beginning of the algorithm. Each time the algorithm loops, the values for the variables update one another.
As the algorithm loops, its estimates get closer to a final series of values. A source doesn't start looking untrustworthy until the probability that the source is wrong starts to outweigh the probability that the extractors are wrong. The trustworthiness of each page then influences how big its overall vote is in determining the "truthfulness" of each possible triple, which in turn influences the other factors, and so on.
I'm Going to End it There
Rather than try to shoehorn some strategic advice into this post, I'd rather just let the algorithm speak for itself. If you have thoughts on the algorithm, I welcome them in the comment section.
Image credit: emdot