a sole proprietorship company owned by Mark Watson as a vehicle for developing and monetizing Knowledge Management, Artificial Intelligence (AI), NLP, and Semantic Web technologies.

Traditional knowledge management tools relied on structured data often stored in relational databases. Adding new relations to this data would require changing the schemas used to store data which could negatively impact exisiting systems that used that data. Relationships between data in traditional systems was predefined by the structure/schema of stored data. With RDF and OWL based data modeling, relationships in data are explicitly defined in the data itself. Semantic data is inherently flexible and extensible: adding new data and relationships is less likely to break older systems that relied on the previous verisons of data.

A complementary technology for knowledge management is the automated processing of unstructured text data into semantic data using natural language processing (NLP) and statistical-base text analytics.

We will help you integrate semantic web and text analytics technologies into your organization by working with your staff in a mentoring role and also help as needed with initial development. All for reasonable consulting rates Technologies:


Statistical NLP: Determine Language of Text (English, French, Italian, German example)

It is relatively simple to determine the language text is written in. For this example, I collected a small corpus of text in English, French, Italian, and German. I counted the number of occurrences of each unique word in each language corpus and rated words as indicating a language if the number of occurrences in any given language was greater than the other three languages combined.

Here is a ZIP file containing the rated 'hot words' for each of the four languages that I use in this example.

To determine what language input text is written in, load each of the text files in this ZIP file into a hash map (words are keys, weighting is the map value). For each word in your input text, look it up in each of the four hash maps and accumulate a language score of each of the four example languages. The language with the highest score "wins."

This data is licensed for reuse under the LGPL. Attribution is appreciated.

Sample Ontology for news stories

News ontology

Note: this ontology was created in 2004 using the Protege modeling tool.

About is owned as a sole proprietor business by Mark and Carol Watson.

Mark Watson is an author of 16 published books and a consultant specializing in the JVM platform (Java, Scala, JRuby, and Clojure), artificial intelligence, and the Semantic Web.

Carol Watson helps prepare training data and serves as the editor for Mark's published books.

Carol and Mark in India

Privacy policy: this site collects no personal data or information on site visitors