It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurences of each unique word in each language corpus and rated words as indicating a language if the number of occurences in any given language was greater than the other three languages combined.
Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example. Here is an example of weighted words for English and French:
the 29185 of 14812 to 14065 and 10908 in 4369 said 5418 for 5186 | de 30183 la 7483 le 8205 et 10137 les 9617 des 8848 en 7527 |
To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."
This data is licensed for reuse under the LGPL. Attribution is appreciated.
Note: I created this ontology in 2004 using the Protégé modeling tool.