Using SAP BO Data Services Text Data Processing for entities extraction from RSS feeds
This is another blog entry of the unstructured data series, dedicated to retrieval and processing of unstructured (or, rather, semi-structured) data by SAP BusinessObjects Data Services, using JSONAdapter and Text Data Processing. After consumer sentiment analysis of Twitter messages, I would like to focus more on richer content — news articles.
An RSS news feed “Yahoo! Australia & NZ Finance” has been selected as data source. Earlier in this blog series I demonstrated how JSONAdapter retrieves data from RSS sources. As I mentioned there, Yahoo! only gives away a short snippet of each RSS item, not the full text of a news article – a URL for the latter is provided, though. It would be good to have a special “web crawler” Data Services adapter, that could reach a specified URL from a Function Call and return the full text of an article — and in some future I may develop one. Until then, a curl.exe has to do it (but with extra crutches), so I created a Data Services Job that would generate a BAT-file and load every article into an HTML-file with name NNNN.html corresponding to the ID of the RSS item in the database:
That way, 1000 news articles have been collected over the period 1st-16th of May, 2012. Then I ran a simple Text Data Processing job over those files, getting ~81000 entities in the “raw” analysis table:
The Entity Extraction transform result delivers a ‘spaghetti’ data consisting of entities and facts: from a sentence ‘Apples Ltd. acquired Oranges Inc.’, the transform would extract two entities of type Organization and one fact of type BuyEvent, all listed in the result dataset one after another. These records, however, have attributes helping to navigate in the context:
- ID uniquely identifies an entity in the document;
- PARENT_ID links a simpler entity to a more complex one: let’s say that, in the example above, the BuyEvent would have an ID=12 – then, both Apples Ltd. and Oranges Inc. will get PARENT_ID=12;
- PARAGRAPH_ID provides the paragraph number in the text, where the entity occurs. Data Services counts paragraphs not only for plain text, but for HTML, as well;
- SENTENCE_ID provides sentence number in the text, where the entity occurs.
- OFFSET provides the entity’s exact position in the text. If an entity is extracted twice by different dictionary or rule (that is allowed an), the offset helps to catch those duplicates.
Making appropriate joins will then do the trick. Later in this blog I will demonstrate how that works.
Another issue that one has to handle when working with unstructured content is the natural variability of entities’ names in the unstructured content. In the heart of data flows consolidating those varying names lies Match transform of Data Quality. It is probably the most complex transform in Data Services to configure – luckily, SAP provides a few blueprints that may serve as a good starting point.
A person may be mentioned by their full name or by family name – business news style does not usually allow mentioning a person by their first name only. Obviously, there may be two persons sharing the same family name. A news article from my set typically referred to someone as John Doe in the beginning and then just Doe or Mr Doe further throughout the article. I used that fact in names cleansing dataflow and consolidated mentions of persons (there is of course a chance of two persons sharing the same family name to be mentioned in the same news article, but let us ignore it for this exercise). The screenshot below demonstrates how different forms of mentioning Tony Abbott are consolidated into a standardized record – the left pane lists ‘consolidated’ persons, and the right one leads to mentions of each person in all documents of the set :
Company names have been processed in a similar way. Actually, the idea was to not only consolidate them between the news articles, but also to standardize, using the externally acquired ASX list as master data (whether ASX list contains legally correct names is another question), and enrich with ASX tickers, 3-5 letter acronyms of company names. That way, the structured data – ASX trading history, in this case – was made available to augment the unstructured part using joins by those ticker fields. Below, the left pane, again, shows consolidated and standardized company names and the right pane leads to individual entries in the articles:
Company names are somehow trickier than people names. In the screenshot above you may see that I set Qantas and Qantas Airways to count as the same company. However, the same setting made Bank Of Queensland and Bank Of China counting as the same company, either. This could be solved by having a list of ‘generic’ parts of company names – Bank Of, in this case – which would lead to a special matching process. Another finding was that a company name not accompanied with Ltd/Inc/Co/etc is sometimes recognized by Data Services as PROP_MISC, a generic catch-all bucket. Data Quality, obviously, may help to extract those, too.
There is a large amount of NOUN_GROUP entities, which by themselves seem to be not of much use, but can add some context – which is why I refer to them further as Topics. Using Data Quality Match and Associate functionality, those topics have been grouped into clusters – a trick from freely available blueprints by SAP. My data collection time interval included an event when Reserve Bank Of Australia had cut its cash rate. Surely, that event had been reflected in the context:
Particular settings of the process to cluster the topics may differ, but you get the idea.
One thing I consciously did not do was creation of custom dictionaries. They would have helped to extract more information and structure it better, however that development requires deeper dive into the subject area and more mundane task of collecting the related terms. In the real-world case, that would be a requirement.
Of many approaches to analysis of this data, so far I am picking the simplest one, a sort of descriptive/profiling one. I focused on the following cases:
- Company performance at ASX — simply, for a Company extracted from the news articles, the graph of its ASX trading history to be produced.
- Companies and Topics — by ‘relation’ here I defined co-occurrence of Topics and Companies in the same paragraph or the same article – the last criteria, obviously, more relaxed. I could then quantify the number of Topic mentions and produce the Top 5 Topics for the given Company and Date. The same model has been implemented for Companies and People. That quantification is prone to some skewing: for example, if a news article is duplicated in the RSS feed (and I have seen such cases), all entities mentioned in there will get a bump to their mentioning score.
- Related Events – a subset of facts, namely, BuyEvent, SellEvent or Action that may be related (again, on the basis of co-occurrence in the text) to be picked for the given Company.
- People Events/Relationships – here, for each Person I searched for facts extracted from the unstructured data and reconstructed the Events (Hire or Resign) these people might have participated in, or Organization relationships they might have.
- Topics and People – for each Topic, a list of possibly related mentioned Persons to be produced, the frequency of mentioning might then be quantified. Technically, it’s the same as Related Topics exercise.
A couple of BusinessObjects Universes have been created to report on the processed data. In order to make the data more interactive, I have created a Dashboard (Xcelsius). I turned out to be a good pick, with one exception: display of textual data grids in Xcelsius is limited. On the other hand, it allowed to take the best from both worlds: MS Excel grid and formulas and live data connectivity.
Xcelsius enabled navigation between analysis cases outlined above: as soon as I selected a Company, from its ASX performance graph I could navigate to Company-related Topics, People or Events:
The Company-related Events part for that selection was empty, hence not displayed. Here’s the Events part for another company, David Jones. The Events chart in the screenshot below is supposed to display a star for every day for which events are recorded, and navigate to the captured details in the box on the right:
More into People Events – I made it possible in Xcelsius to get list of Events by Person or by Date:
As I mentioned, these analyses belong more to data profiling of some sort, rather than to quantitative analysis. However, parsed and cleansed unstructured data may provide a foundation for analytical scenarios, especially if integrated with ‘traditional’ structured data – forming what is buzzing these days as Big Data.
Data amounts and potential complexity of joins promise to be high, and, in SAP ecosystem, suggest using HANA to boost performance of analytics: not only that in-memory data access is generally faster, but also a) as of SPS4 HANA is capable of full text search and b) includes a few statistical libraries that can be integrated with SAP BusinessObjects Predictive Analysis or R. As for unstructured data processing, this blog series concludes here.
– Roman Bukarev