hacks

About those quotations...

Drupal lets you specify a "mission statement" to be displayed on the main page of your web site. I like this idea, but I couldn't think up a clever enough "mission" for myself (although I was tempted to use Building a fighting force of extraordinary magnitude).

In the end I decided to make the server execute the venerable Unix fortune program to generate a random "mission statement" every time the main page is reloaded.

The fortune database contains hundreds of quotations, aphorisms, and adages, none of which were written by me, or are under my control, so the usual disclaimers apply.

Here's the relevant modification to the Drupal file page.tpl.php:

<?php if ($mission) { ?>
  <div id="mission">
  <?php 
     exec('/usr/games/fortune -s', $fortune);
     foreach($fortune as $line) { print htmlspecialchars($line)." ";}
  ?></div><?php } ?>

Serving Drupal content as application/xhtml+xml

Drupal, like virtually all other CMS's (and virtually all web developers for that matter), declares XHTML as its doctype but then sets the MIME type to text/html rather than XHTML. Arguments by Ian Hickson and Anne van Kesteren have convinced me that this is a Bad Idea. They suggest that most sites should eschew XHTML and stick to HTML.

I prefer XHTML. I'd rather keep Drupal's XHTML and serve it up as application/xhtml+xml as the W3C recommends. Of course, Internet Exploder doesn't support XHTML (and perhaps never will), which is why everyone had to resort to that text/html hack in the first place.

In any case, it's straightforward to hack Drupal's includes/common.inc to sniff the capabilities of the user agent on the server side and then set the most appropriate content type. Here's how I did it:

$content_type = "text/html";  /* default */
$preferred_types = array("application/xhtml+xml", "application/xml", "text/xml");
$http_accept = $_SERVER["HTTP_ACCEPT"];
foreach ($preferred_types as $type) {
  if (stristr($http_accept, $type)) {
    $content_type = $type;
    break;
  }
}

Then you interpolate $content_type into the Content-Type header. If you do this in Drupal, then you need to turn off caching for it to work.

Stochastic part-of-speech tagger in Ruby

I wrote a simple bigram POS tagger in Ruby. The program runs slower than molasses, but illustrates some basic concepts from statistical NLP such as Viterbi decoding and Good-Turing smoothing. It scores about 94% on the Penn Treebank, but can easily score above 95% with some simple morphological checking of unknown words (instead of just treating them as nouns by default).

Nuance batchrec analysis

My Nuance batchrec analysis script is a tool for developers who use the Nuance speech recognition engine. The tool helps you test grammars and optimize parameter settings in order to improve recognition performance.

Specifically, the script reads in the results of a Nuance batch recognition run, calculates Word Error Rate (WER) using the sclite package, and stores the results in a database for subsequent analysis. Once the results of several batch runs have been entered into the database, you can write SQL queries to compare them. For example, you might want to find out which Nuance grammar or parameter settings resulted in the lowest overall WER, or produced the most accurate semantic slots, or ran the fastest.

Parallel Japanese-English corpus

I've assembled a large parallel Japanese-English corpus in the domain of technology news. The parallel news stories are assembled from daily RSS feeds and are updated daily.

Because material on the Web is subject to copyright restrictions, I cannot distribute the news stories directly. Instead, I've published a list of URL pairs, which you can download yourself for personal use. The format is numbered, tab-separated parallel URLs, the same as the STRAND Bilingual Databases. The STRAND software for retrieving the data from the URLs also works on my data.

The list published here is updated weekly by a cron job. You're free to use this list for any purpose as long as you accept that it comes with NO WARRANTY OF ANY KIND.

The method used to assemble the corpus is described in the following paper:

Fry, John. Assembling a parallel corpus from RSS news feeds, in Proceedings of the Workshop on Example-Based Machine Translation, MT Summit X, Phuket, Thailand, September 2005.  [PDF]

If you use the list for published research, please cite the above paper, rather than this web page.

List of parallel Japanese-English URLs [gzipped]

UPDATE JULY 2006: Sorry, I am no longer maintaining or updating this corpus, since I'm no longer working in the field of machine translation!