Statistical NLP: Determine Language of Text (English, French, Italian, German example)

It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurences of each unique word in each language corpus and rated words as indicating a language if the number of occurences in any given language was greater than the other three languages combined.

Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example. Here is an example of weighted words for English and French:

the 29185
of 14812
to 14065
and 10908
in 4369
said 5418
for 5186
     
de 30183
la 7483
le 8205
et 10137
les 9617
des 8848
en 7527

To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."

This data is licensed for reuse under the LGPL. Attribution is appreciated.

Sample Ontology for news stories

News ontology

Note: I created this ontology in 2004 using the Protégé modeling tool.