As part of the SAP acquisition of Business Objects, SAP got a “bonus” acquisition — Inxight. Inxight allows systems to transform text into actionable information with a suite of federated search, text analysis and visualization technologies.
We’re going to talk primarily about text analysis here in this blog, but if you’re interested in the other stuff, we can talk about that, too.
What’s text analysis? Simply stated, it is the ability to “read” text and structure that text for downstream operations. Business Objects offers this capability in a SOAP server form (BusinessObjects Text Analysis, previously known as “SmartDiscovery”) and as C++ SDKs.
Linguistic Analysis: In its SDK form, this is known as LinguistX, and has been licensed for many years by SAP’s TREX team to help enable multilingual search for 30+ languages (a look at the TREX documentation will basically give you a list of these!). This is where it all starts — being able to detect a document’s language, each “word” (a challenge in languages like Chinese), the part of speech of each word (including “noun phrases”), and the stem or decompounded forms of each word.
(As a side note, I recently had the pleasure of meeting members of the TREX, SAP ES, and NetWeaver teams in Walldorf. It was great to find out what exciting things SAP has planned in search, and we look forward to being part of it!)
Entity, Relation, and Event Extraction: In its SDK form, this is called ThingFinder Professional, and it’s a part of Text Analysis. This uses linguistic analysis along with built-in lexicons in order to determine the “meaning” of words. For example, “Bob Smith” is a PERSON, “Bob Smith, Inc.” is a COMPANY, “SAP bought Business Objects” is an M&A event, and so forth. This functionality is currently available in 9 major languages (English, French, German, Spanish, Arabic, Persian/Farsi, Korean, Russian, and Simplified Chinese). There is also a language called CGUL and workbenches that can be used to extend extraction to new and novel applications, or for languages that are not yet covered.
Categorization: This is the only capability not currently available as a standalone external SDK. This is the ability to categorize an entire document into one or more categories according to a taxonomy. For example, one might write a rule that says a document containing the word “football” is a “sports” document. It’s a hybrid system allowing for rules-based and learn-by-example categorizaton.
So what is this all good for? Well, it’s used most commonly today in counter-terrorism, search, legal discovery, and other similar applications. We’re developing modules to enable it to be used for things like voice of the customer analysis, buzz analysis, and call center analysis, too.
In the weeks to come, we’ll talk more about those applications, as well as the research, development, and internal mashups we are doing on the technologies.