Statistical NLP: Determine Language of Text (English, French, Italian, German example)

It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurrences of each unique word in each language corpus and rated words as indicating a language if the number of occurrences in any given language was greater than the other three languages combined.

Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example. Here is an example of weighted words for English and French:

 the 29185
	of 14812
	to 14065
	and 10908
	in 4369
	said 5418
	for 5186
 de 30183
	la 7483
	le 8205
	et 10137
	les 9617
	des 8848
	en 7527

To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."

This data is licensed for reuse under the LGPL. Attribution is appreciated.

Sample Ontology for news stories

News ontology

Note: I created this ontology in 2004 using the Protege modeling tool.