It turns out April 1st is good for at least one thing – helping language researchers identify more malicious “fake news.” No joke.
Ph.D. student Edward Dearden and his advisor Dr. Alistair Baron, of Lancaster University’s School of Computing and Communications, have come up with a constructive way of using April Fools Day hoax stories that appear on the Internet. The pair are using them to study deceptive language in the hope of offering insight into ways of spotting what many have come to call “fake news.”
Of course, fake news is a complicated term – one that can even become dangerous when leveled against authors and readers who simply have a different outlook. In this case, the definition is narrower and refers to deliberate and malicious misinformation disguised as legitimate journalism.
Dearden notes that April Fools articles provided “a verifiable body of deceptive texts” and clarified in an e-mail:
There’s also the problem of subjectivity and people’s right to express their opinions. Labelling things as true or false can be a slippery slope to go down. One attractive thing about April Fools hoaxes is that everybody can agree that they aren’t true.
The authors, who study online disinformation and deception more generally, collected 14 years’ worth of April Fools articles from over 370 websites. They ended up with over 500 articles and compared these hoax pieces to legitimate articles written in the same general time period. Their analysis revealed that those attempting to write fiction disguised as fact use some of the same stylistic techniques.
In comparing the spoof articles to legitimate news, Dearden and Baron paid special attention to the amount of detail used, the vagueness of the language, the formality of the author’s writing style, and the complexity of their language.
Next, they took a data set from a 2017 study on “fake news” (and, more specifically, on how such stories are titled) that identified common characteristics of deceptive and malicious news stories. While not a sure-fire way of identifying such stories, the researchers found that much “fake news” – when compared to legitimate, non-deceptive news – is shorter, easier to read, written in simplistic language, and less formal (often using first names). These stories (and their titles) also contain more proper nouns, first-person pronouns, profanity, and spelling errors, and fewer punctuation marks and dates.
When compared to news stories not designed to deceive readers, April Fools stories were also shorter in length, easier to read, and used more first-person pronouns. But they also contained more unique words, longer sentences, and fewer proper nouns. Spoof stories also tended to refer to vague events in the future, contain more references to the present, and mentioned fewer past events.
It’s not surprising that names, places, and specific dates and times – which are integral to contextualizing a news story – appear less often in both April Fools stories and “fake news.” It also makes sense that proper nouns, such as the names of politicians, are more commonly found in “fake news.”
But the researchers did point out that the use of first-person pronouns, such as “we,” was unexpectedly prominent in deceptive stories since those trying to obscure the truth tend not to use them.
Dearden and Baron’s next task was to create a machine learning classifier to identify if an article was an April Fools hoax, fake news, or genuine news. Their algorithm managed to identify April Fools articles accurately 75% of the time and fake news with a 72% accuracy rate.
But the real challenge was to see if they could train the classifier on data from April Fools stories and then use it to predict fake news – that would show just how useful those April 1st stories are for improving our understanding of linguistic similarities between the two types of stories. In the end, April Fools data allowed the classifier to identify other fake news with an accuracy rate exceeding 65%.
There is a large body of research on the category of artificial intelligence research that deals with machine-learning systems that process natural language. Algorithm-based systems that identify linguistic cues are under development at multiple universities and companies in the hopes of developing “fake news” detectors to fight against blatant and willful misinformation. Human editors simply can’t keep up, so researchers are actively working to combat some of the most dangerous deception campaigns, which often spread quickly through social media. Machines that can perform linguistic analyses on quantifiable attributes like grammar, word choice, and punctuation have a much better chance of intercepting potentially detrimental stories.
But researchers are also aware of the difficulties in judging what counts as “fake” or legitimate. One of the reasons we don’t yet have a plethora of so-called fake news detectors to rely on is because researchers have been committed to more rigorous testing of the AI algorithms as well as ways of collecting data to train the systems.
As Dearden explained to me, their research is just a piece of the puzzle and not a comprehensive checklist that people can reliably use to identify misleading copy with perfect accuracy:
The aim of our research is to try and understand the language being used in hoax news articles and to see how this relates to the kinds of disinformation that we’ve come to refer to as ‘fake news.’ None of the features we discuss in our paper are the silver bullet for detecting fake news.
But the research can help people spot some warning signs and become more aware of what they are reading. While instructing people in ways to think critically and fact-check news is beyond the scope of this particular research study, Dearden mentioned that their work in teaching computers to spot deceptive text is an important piece of the puzzle.
There’s a lot of really interesting work in fighting disinformation at the moment. It’s particularly important because society’s really struggling to adjust to the volume of information that is now available and it’s having real consequences. Hopefully, the research community can develop methods that help us tackle this problem and minimise its effects in the future.
Dearden and Baron will present this research at the 20th International Conference on Computational Linguistics and Intelligent Text Processing, to be held in La Rochelle, France later in April.