The Philadelphia Inquirer offers a nice roundup of the Words of the Year
for 2008. My favorites: likeable enough,
palling around,
toxic assets,
shovel ready,
Auto-Tune,
bleeping,
and the Philly's own new rallying cry, why can't us?
.
linguistics
Words of the year
SJSU courses in language technology
From 2003 until 2008 I was a part-time faculty member in the Linguistics Department at San Jose State University, where I mainly taught courses in language technology (speech and language processing). Occasionally I also got the chance to teach Introduction to Linguistics, as well as a general education course called "Sound and Communication" that covered animal communication as well as human language.
Lecture notes for some of my courses are available here:
The Stuff of Thought
Steven Pinker's fascinating new book, The Stuff of Thought
, is about conceptual semantics and what modern linguistics reveals about how human beings think. The book is in some sense an integration of his previous books The Language Instinct, Words and Rules, and How the Mind Works.
Human thought, Pinker argues, is built around certain primitive concepts, including space, force, dominance, agency, animacy, sex, and contamination. In the most interesting chapters he shows how our human conceptions of space, time, and matter are reflected in linguistic features like tense, aspect, and the count/mass distinction. The relatively recent research results of Beth Levin and her colleagues in the area of lexical semantics, summarized in Chapter 2, are particularly illuminating, as they reveal how seemingly random variations in verb subcategorization patterns actually reflect deep, underlying conceptual schemas in the mind.
In the final chapters Pinker offers the optimistic conclusion that we need not be permanently shackled by our limited primate brains; scientific progress relies on our remarkable ability to extend our knowledge to new domains through the use of metaphor, analogy, and linguistic combinatorics. The goal of education,
Pinker concludes, is to make up for the shortcomings in our instinctive ways of thinking about the physical and social world.
Speech and Language Processing, 2nd edition
I finally have my hands on the long-awaited second edition of Jurafsky & Martin's
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Before it arrived I'd been handing out draft preprints of the manuscript to students in my SJSU courses.
As the title suggests, this is a big book (just under a thousand pages) and unbelievably comprehensive. On the whole, the book is a major improvement over its predecessor. The first edition was plagued with typos on seemingly every page, and was also way too thin in certain places. I seem to remember them rushing through phonetics in a single page or two, and describing optimality theory in just a couple sentences! The second edition's coverage of the field is significantly broader and deeper. Phonetics now gets a good 15 pages. The typos are gone and the appearance of the book is also much improved, with nice-looking black-and-white diagrams on nearly every page.
I have one pedagogical quibble with the new edition. The first edition introduced readers to the Bayesian noisy channel model by applying it to the problem of spelling correction, as implemented in the classic paper by Kernighan et al. Because noisy channel spelling correction is so fiendishly simple, and the paper is so readable, this was the perfect way to introduce a student to Bayesian models of language. In the second edition, however, the authors decided to jump straight into noisy channel POS tagging, a much more challenging topic, and to relegate spelling correction to an Advanced
(?) section at the end of Chapter 5. This strikes me as a big mistake; they really should have started with spelling correction and then moved to tagging.
Quibbles aside, this book is a spectacular achievement. The first edition of Speech and Language Processing was a breathtaking synthesis of material, and it helped to unify the field of language technology, despite its flaws. This greatly updated second edition is a big improvement and will be the standard text in the field for years to come.
Japanese particles in newswire: a mystery
Recently I've been playing around with Japanese newspaper corpora and with the mecab morphological analyzer. In one experiment, I tagged about 100 million tokens of newspaper text and looked at the mean proportions of nouns, verbs, and particles by sentence length. Here are the results:

I was curious why the relative proportion of particles increases with sentence length. For example, sentences that are 11 tokens long or greater contain about 25% particles on average, whereas 8-token sentences contain about 22% particles, and 6-token sentences average only 18% particles.
After a little thought the explanation came to me. Given that some particles serve to connect clauses, one would expect the proportion of particles to rise. Suppose that all Japanese clauses were of the form N-p V. Then sentences would always be 1/3 nouns, 1/3 verbs, and 1/3 particles, independent of the number of clauses. Now suppose that every clause has a discourse particle connecting it to the following clause: for example, N-p V-p N-p V. Then a sentence containing n clauses would have n verbs and n nouns, but 2n-1 particles. The proportion of particles increases with sentence length (approaching 1/2 in this hypothetical example).
Here are the data plotted by particle type:

Some observations:
- The rise in conjunctions, or at least the clause conjunctions, is explained by the reasoning above. However, conjunctive particles include both noun and verb particles; it would be good to separate them.
- No surprise that the proportion of sentence-final particles declines, since they are sentence-final, not clause-final.
- Also no surprise that ha (and perhaps the other adverbial particles too?) declines, since topics typically scope over sentences, not clauses.
- But why does the proportion of case particles increase? I can't think of an explanation for this.
Here's a breakdown by specific case particle:
