Traditional knowledge management tools relied on structured data often stored in relational databases. Adding new relations to this data would require changing the schemas used to store data which could negatively impact exisiting systems that used that data. Relationships between data in traditional systems was predefined by the structure/schema of stored data. With RDF and OWL based data modeling, relationships in data are explicitly defined in the data itself. Semantic data is inherently flexible and extensible: adding new data and relationships is less likely to break older systems that relied on the previous verisons of data.

A complementary technology for knowledge management is the automated processing of unstructured text data into semantic data using natural language processing (NLP) and statistical-base text analytics.

Statistical NLP: Determine Language of Text (English, French, Italian, German example)

It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurrences of each unique word in each language corpus and rated words as indicating a language if the number of occurrences in any given language was greater than the other three languages combined.

Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example.

To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."

Sample Ontology for news stories

News ontology

Note: this ontology was created in 2004 using the Protege modeling tool.

