Why there must be universal grammar

The guardian ran an interview with Daniel Everett yesterday. Everett is a linguist most famous for his claim that universal grammar (the belief that some rules of grammar are "hard wired" into the brain) as popularized by Chomsky, is false. Specifically, he believes that the Pirahã language lacks recursion.

His claims are quite controversial, but one thing which is worth mentioning is that universal grammar is (for a reasonable definition of "proof") provably correct. By this I mean:

Theorem: Learning grammar is so hard that the only way humans (or anyone) can do it is if they have innate structures.

This is related to Chomsky's poverty of the stimulus argument.

It can be proven in the following way: suppose we restrict ourselves to just the subset of English sentences consisting only of nouns and verbs. "I like John" and "You are here" would be two examples. These both follow the pattern "noun verb noun". A sentence like "jump run you" is non-grammatical, because "verb verb noun" is not an acceptable pattern in English.

Now let's consider how long it would take a learner to learn these patterns. There are 23 = 8 possible patterns of length three, so if a learner thinks they're all possible, it will have to test out all eight of them. ("Mommy, is 'jump run you' a sentence?")

Most sentences have much more than three words of course, so a learner will need to test out the 24 = 16 four word patterns, the 25 = 32 five word patterns, etc. In general, there are 2n possible sentences with n words, meaning that the number of tests that the learner will need to run is exponential in the number of words.

The Cobham-Edmonds thesis states that any problem which takes exponential time is, in practice, unsolvable.

Why is this true? There are, depending on your definition of "part of speech", about 20 parts of speech in English. If you tested one grammar per second, it would take you about a month to learn all the five word grammars. The six word grammars would take you two years, and you would be forty before you learned all the seven word grammars. That last sentence had 22 words, and it would take you 1021 years to test all of the 22-word-grammars. The universe is only 1013 years old.

So who knows whether all languages are recursive. But it seems unlikely that human children consider all possible grammars equally. They must use some shortcuts and those shortcuts must, by definition, be innate.

A Simple Proof: Occam's Razor

How do you know that I'm not a robot? How do you know we're not living in the matrix?

The usual resolution is some form of Occam's razor: sure, it's possible that I'm a robot, but the simpler explanation is that I'm human, and simpler explanations are preferable.1

This just pushes the question back: why are simpler explanations better?

There is a straightforward proof that comes from Computer Science, of all places, which I hope to explain here.


Suppose I enter the world as a blank slate - I have a "bag" of hypotheses about how things work, and I consider them all equally probable. As I perform experiments, I disprove some of my hypotheses, while others remain. As time goes on, my bag of plausible hypotheses gets smaller and smaller.

If I eventually reach a point at which I only have two hypotheses remaining and I randomly choose one to believe, I'm 50% certain that I've got the right one. But if I randomly believe one out of a hundred possible hypotheses, I've almost certainly chosen wrong (i.e. I've probably selected a hypothesis that by luck happened to fit with all the observed data, even though it's in fact wrong).

Believe it or not, this concludes the proof.

If I have a simple hypothesis ("fire is hot") there's really only one other hypothesis that could be in my bag ("fire is not hot"), so I can rapidly determine which is the right one. If my hypothesis is complicated ("fire is hot, provided it's the first full moon of a year with zodiac symbol ...") there are tons of equally complex hypotheses, and some of them are bound to fit the data, so I'm unlikely to have chosen the right one.


In my job, I spend some time in the back rooms at medical offices, which means I hear nurses complain about doctors, and doctors complain about patients. One conversation I had with a dietition sticks into my memory: she was complaining about patients who expect the faddish, complicated dietary advice you hear on TV - "good" carbs, antioxidants etc. - but all she does is give people a calorie target, and recommend eating more fresh fruits and vegetables.

I told her to give her patients a brochure on Occam's razor. I doubt they've implemented my suggestion.

Postscript: This proof is a vague mishmash of the motivation for Bonferroni correction and VC theory. Any book on computational learning theory will have a better one, but you can see de Wolf's thesis for an explicit application of PAC learning to Occam's razor. You might also like my post why you will never see an eight-sided snowflake.

  1. That's not true. The usual resolution is to ignore the problem.