Following Part 1, let’s find out whether computers are as smart as human in reading comprehension in this next part. 

Refining the Recipe

Like any good recipe, BERT was soon adapted by cooks to their own tastes. In the spring of 2019, there was a period “when Microsoft and Alibaba were leapfrogging each other week by week, continuing to tune their models and trade places at the number one spot on the leaderboard,” Bowman recalled. When an improved version of BERT called RoBERTa first came on the scene in August, the DeepMind researcher Sebastian Ruder dryly noted the occasion in his widely read NLP newsletter: “Another month, another state-of-the-art pretrained language model.”

BERT’s “pie crust” incorporates a number of structural design decisions that affect how well it works. These include the size of the neural network being baked, the amount of pretraining data, how that pretraining data is masked and how long the neural network gets to train on it. Subsequent recipes like RoBERTa result from researchers tweaking these design decisions, much like chefs refining a dish.

In RoBERTa’s case, researchers at Facebook and the University of Washington increased some ingredients (more pretraining data, longer input sequences, more training time), took one away (a “next sentence prediction” task, originally included in BERT, that actually degraded performance) and modified another (they made the masked-language pretraining task harder). The result? First place on GLUE – briefly. Six weeks later, researchers from Microsoft and the University of Maryland added their own tweaks to RoBERTa and eked out a new win. As of this writing, yet another model called ALBERT, short for “A Lite BERT,” has taken GLUE’s top spot by further adjusting BERT’s basic design.

We’re still figuring out what recipes work and which ones don’t,” said Facebook’s Ott, who worked on RoBERTa.

Still, just as perfecting your pie-baking technique isn’t likely to teach you the principles of chemistry, incrementally optimizing BERT doesn’t necessarily impart much theoretical knowledge about advancing NLP. “I’ll be perfectly honest with you: I don’t follow these papers, because they are extremely boring to me,” said Linzen, the computational linguist from Johns Hopkins. “There is a scientific puzzle there,” he grants, but it doesn’t lie in figuring out how to make BERT and all its spawn smarter, or even in figuring out how they got smart in the first place. Instead, “we are trying to understand to what extent these models are really understanding language,” he said, and not “picking up weird tricks that happen to work on the data sets that we commonly evaluate our models on.”

In other words: BERT is doing something right. But what if it’s for the wrong reasons?

Clever but Not Smart

In July 2019, two researchers from Taiwan’s National Cheng Kung University used BERT to achieve an impressive result on a relatively obscure natural language understanding benchmark called the argument reasoning comprehension task. Performing the task requires selecting the appropriate implicit premise (called a warrant) that will back up a reason for arguing some claim. For example, to argue that “smoking causes cancer” (the claim) because “scientific studies have shown a link between smoking and cancer” (the reason), you need to presume that “scientific studies are credible” (the warrant), as opposed to “scientific studies are expensive” (which may be true, but makes no sense in the context of the argument). Got all that?

If not, don’t worry. Even human beings don’t do particularly well on this task without practice: The average baseline score for an untrained person is 80 out of 100. BERT got 77 – “surprising,” in the authors’ understated opinion.

But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 – equivalent to random guessing. An article in The Gradient, a machine-learning magazine published out of the Stanford Artificial Intelligence Laboratory, compared BERT to Clever Hans, the horse with the phony powers of arithmetic.

In another paper called “Right for the Wrong Reasons,” Linzen and his coauthors published evidence that BERT’s high performance on certain GLUE tasks might also be attributed to spurious cues in the training data for those tasks. (The paper included an alternative data set designed to specifically expose the kind of shortcut that Linzen suspected BERT was using on GLUE. The data set’s name: Heuristic Analysis for Natural-Language-Inference Systems, or HANS.)

So is BERT, and all of its benchmark-busting siblings, essentially a sham? Bowman agrees with Linzen that some of GLUE’s training data is messy – shot through with subtle biases introduced by the humans who created it, all of which are potentially exploitable by a powerful BERT-based neural network. “There’s no single ‘cheap trick’ that will let it solve everything [in GLUE], but there are lots of shortcuts it can take that will really help,” Bowman said, “and the model can pick up on those shortcuts.” But he doesn’t think BERT’s foundation is built on sand, either. “It seems like we have a model that has really learned something substantial about language,” he said. “But it’s definitely not understanding English in a comprehensive and robust way.”

According to Yejin Choi, a computer scientist at the University of Washington and the Allen Institute, one way to encourage progress toward robust understanding is to focus not just on building a better BERT, but also on designing better benchmarks and training data that lower the possibility of Clever Hans–style cheating. Her work explores an approach called adversarial filtering, which uses algorithms to scan NLP training data sets and remove examples that are overly repetitive or that otherwise introduce spurious cues for a neural network to pick up on. After this adversarial filtering, “BERT’s performance can reduce significantly,” she said, while “human performance does not drop so much.”

Still, some NLP researchers believe that even with better training, neural language models may still face a fundamental obstacle to real understanding. Even with its powerful pretraining, BERT is not designed to perfectly model language in general. Instead, after fine-tuning, it models “a specific NLP task, or even a specific data set for that task,” said Anna Rogers, a computational linguist at the Text Machine Lab at the University of Massachusetts, Lowell. And it’s likely that no training data set, no matter how comprehensively designed or carefully filtered, can capture all the edge cases and unforeseen inputs that humans effortlessly cope with when we use natural language.

Bowman points out that it’s hard to know how we would ever be fully convinced that a neural network achieves anything like real understanding. Standardized tests, after all, are supposed to reveal something intrinsic and generalizable about the test-taker’s knowledge. But as anyone who has taken an SAT prep course knows, tests can be gamed. “We have a hard time making tests that are hard enough and trick-proof enough that solving [them] really convinces us that we’ve fully solved some aspect of AI or language technology,” he said.

Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that’s specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?

That’s a good analogy,” Bowman said. “We figured out how to solve the LSAT and the MCAT, and we might not actually be qualified to be doctors and lawyers.” Still, he added, this seems to be the way that artificial intelligence research moves forward. “Chess felt like a serious test of intelligence until we figured out how to write a chess program,” he said. “We’re definitely in an era where the goal is to keep coming up with harder problems that represent language understanding, and keep figuring out how to solve those problems.”

Source: Wired

Related posts: