It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurrences of each unique word in each language corpus and rated words as indicating a language if the number of occurrences in any given language was greater than the other three languages combined.
Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example. Here is an example of weighted words for English and French:
the 29185 of 14812 to 14065 and 10908 in 4369 said 5418 for 5186
de 30183 la 7483 le 8205 et 10137 les 9617 des 8848 en 7527
To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."
This data is licensed for reuse under the LGPL. Attribution is appreciated.
Note: I created this ontology in 2004 using the Protege modeling tool.