Researchers at Wikimedia are using machine learning to predict whether—and why—any given sentence on Wikipedia may need a citation in order to help editors identify areas of content violating the verifiability policy.
One of the key mechanisms that allows Wikipedia to maintain its high quality is the use of inline citations. Through citations, readers and editors make sure that information in an article accurately reflects its source. As Wikipedia’s verifiability policy mandates, “material challenged or likely to be challenged, and all quotations, must be attributed to a reliable, published source”, and unsourced material should be removed or challenged with a citation needed flag.
However, deciding which sentences need citations may not be a trivial task. On the one hand, editors are urged to avoid adding citations for information that is obvious or common knowledge—like the fact that the sky is blue. On the other hand, sometimes the sky doesn’t actually appear blue—so perhaps we need a citation for that after all? Scale up this problem to the size of an entire encyclopedia, and it may become intractable.Wikimedia is a global movement whose mission is to bring free educational content to the world. Through various projects, chapters, and the support structure of the non-profit Wikimedia Foundation, Wikimedia strives to bring about a world in which every single human being can freely share in the sum of all knowledge.
One of the more compelling use cases for AI is automating mission-critical tasks that humans don’t want to do, or can’t do. Wikipedia ran into just such a problem with its citations. With crowdsourced content, citations are crucial to providing accuracy and reliability in the site’s vast ocean of articles, but according to a blog post from the WikiMedia Foundation, around 25% of Wikipedia’s English-language articles lack a single citation. “This suggests that while around 350,000 articles contain one or more “citation needed” flags, we are probably missing many more,” reads the post.
Anyone who’s spent time on Wikipedia has seen that more citations, generally, would be helpful, especially considering the site’s verifiability policy that states in part, “All quotations, and any material whose verifiability has been challenged or is likely to be challenged, must include an inline citation that directly supports the material.” In an email interview, Jonathan Morgan, Senior Design Researcher and co-author of Wikimedia’s “Citation Needed” study, noted accuracy isn’t the only advantage. “Citations not only allow Wikipedia readers and editors to fact-check information, they also provide jumping-off points for people who want to learn more about a topic,” he said.
The challenge for Wikipedia is not merely adding more citations, though; it’s understanding where citations are needed in the first place. That’s a laborious process in and of itself. To solve this twofold problem, Wikimedia developed a twofold solution. Step one was to create a framework for understanding where citations need to go and create a data set. Step two was to train a machine learning model classifier to scan and flag those items across Wikipedia’s hundreds of thousands of articles.
How they got there
A roster of 36 English, Italian, and French Wikipedia editors were given text samples and were asked put together a taxonomy of reasons why you would need a citation, and reasons why you wouldn’t. For example, if “the statement contains statistics or data” or “the statement contains technical or scientific claims,” you’d need a citation. If “the statement only contains common knowledge” or “the statement is about a plot or character of a book/movie that is the main subject of the article,” you would not.
With a set of guidelines in place, Wikimedia’s researchers created a data set upon which to train a recurrent neural network (RNN). In the blog post, the researchers said, “We created a data set of English Wikipedia’s “featured” articles, the encyclopedia’s designation for articles that are of the highest quality—and also the most well-sourced with citations.” The setup for the training was fairly simple: When a line in a given feature article had a citation, it was marked as “positive,” and a line that did not have a citation was “negative.” Then, based on a sequence of words in a given sentence, the RNN was able to classify citation needs with 90% accuracy, according to Wikimedia’s post.
But why is the model up to 90% accurate? What is the algorithm looking at when deciding whether a sentence needs a citation?
Explaining algorithmic predictions
To help interpret these results, we took a sample of sentences needing citations for different reasons, and highlighted words the model considered the most when it classified the sentences. In the case of “opinion” statements, for example, the model assigned the highest weight to the word “claimed”. In the “statistics” citation reason, the most important words to the model are verbs that are often used in reporting numbers (such as: “estimated). In the case of scientific citation reasons, the model pays more attention to domain-specific words like “quantum”.
Above: Keywords flagged for needing citations
To take the process a step further, Wikimedia’s researchers created a second model that could add reasons to its citation classifications. Using Amazon’s Mechanical Turk, they pulled in human minds for the task and gave the volunteers some 4,000 sentences that had citations. We found that sentences more likely need citations when they are related to scientific or historical facts, or when they reflect direct/indirect quotations.
The participants were asked to apply one of eight labels — like “historical” or “opinion” — to show the reason why a citation was needed. With that data in hand, the researchers modified their RNN so that it assign an unsourced sentence into one of those eight categories. The researcher modified the neural network designed in the previous study, so that it can classify an unsourced sentence into one of the 8 citation reason categories. Put it another way, they retrained this network using the crowdsourced labeled data, and found that it provides reasonable accuracy (precision at 0.62) in predicting citation reasons, especially for classes with a substantial amount of training data.
So far, the model is trained only on English-language Wikipedia content, but Wikimedia is working on expanding it to more languages. Given how the data acquisition was performed, there are some obvious potential challenges with other languages that are structured differently than English. “We don’t have to start from scratch, but the amount of work may vary by language,” said Miriam Redi, research scientist at the Wikimedia Foundation and lead author on the paper. “To train our models, we use ‘word-vectors,’ namely language characteristics of the article text and structure. These word vectors can be easily extracted from text of any language existing in Wikipedia.”
She added that in some cases, they would need to collect new samples from other “featured articles” and would have to rely on the Wikipedia editors who work in those languages. Morgan added that they have processes for “translating English words that we know are associated with sentences that are likely to need citations into other languages.”
Even with some AI involved, the lion’s share of the work falls on the shoulders of a group of dedicated volunteer Wikipedia editors. Creating a mass of hundreds of thousands of accurate citation flags is informative, but humans will need to tackle them all one at a time. But at least, now they know where to start.
Ideally, the researchers believe that this AI can help Wikipedia editors understand where information needs to be verified and why, and show readers what content is especially trustworthy. Once the code is open sourced, they hope it will encourage other volunteer software developers to make more tools that can increase the quality of Wikipedia articles.
But there are larger implications, said Morgan: “Outside the Wikimedia movement, we hope that other researchers (such as members of the Credibility Coalition) use our code and data to develop tools for detecting claims in other online news and information sources that need to be backed up with evidence.”
Source: (Wikimedia Blog, Venturebeats, Wikipedia)