Big Data at The Globe And Mail: Hadoop, HANA, & Cloud
Like most media organizations around the world, the Toronto-headquartered The Globe and Mail has struggled to make a profitable transition from physical newspapers to online journalism. But now a combination of Hadoop and SAP HANA in the cloud is helping make critical decisions about how and when to charge readers for online access to articles.
In print for 167 years, The Globe is Canada’s largest newspaper, with more than 300 journalists covering national, international, business, technology, arts, entertainment and lifestyle news for around 3.5 million readers a week across the country. Over the last decade, the company has invested in comprehensive data gathering and analysis systems, starting with SAP ERP in 2002 and a full enterprise data warehouse using SAP BW in 2007.
|Figure 1: The Globe and Mail paywall project architecture, featuring Hadoop on Amazon AWS and SAP HANA (click to enlarge)|
In early 2012, data analysis became an urgent business priority because of the company’s paywall project. The company knew casual readers were coming to the Web site and needed to work out how many articles the company should allow them to read before asking them to pay.
“If we set the bar too high, we won’t have enough people to pay for our content,” Sandy Yang, a functional analyst at The Globe, said. “But if too low, they might never come back, and then we might lose a big chunk of our advertising revenue.”
Ideally some readers wouldn’t even know about the paywall, while others would think that their money is well spent.
To find the right balance, the company uses Omniture to get insight into which articles readers are interested in, as well as key statistics, such as page views per period or unique visits per period per section. But answering more complex — and important — questions required further analysis on the raw clickstream data.
The internal IT teams first tried to import the web data from Omniture into a traditional relational database. But the data was complicated, stored in tab delimited text files with millions of lines, each having around 500 fields, and was growing at a rate of several gigabytes a day. The company turned to Hadoop to process the web data, but wasn’t ready to buy and maintain its own servers, so used Amazon’s Elastic MapReduce Architecture and stored the results in Amazon S3.
“The result is a whole lot of numbers,” Yang said. “Every time a job finished, I had to add column headers and reformat the data to explain what it meant.”
The numbers still couldn’t answer the company’s “what if” questions. And the batch process didn’t allow for drill down.
“I didn’t have better options,” Yang said.
|Figure 2: An example of a correlation analysis using SAP HANA Studio on the Globe and Mail clickstream data preprocessed in Hadoop (click to enlarge)|
On Time and On Budget
A version of SAP’s new in-memory platform that runs in the Amazon Cloud surprised Yang with its simplicity, easy implementation and low cost. HANA ONE also bridged the gap between The Globe’s big data and its creative business people, she said.
“The speed of the product — the real-time aspect instead of batch processing — was delivered as advertised,” Yang told Insider Profiles. “Now we can show people what their data looks like or perform data analysis tasks right in the meeting room — instead of saying, ‘I’ll get back to you as soon as possible,’ like we did previously.”
That helped make the implementation an easy business decision. Company’s usually build a business case for a purchase, and then buy and implement the products. But HANA ONE offers pay-as-you-go, so businesses don’t have to buy the product upfront.
“I can use it whenever I want, and all I pay for is the time we use it, nothing more,” Yang said. “For small businesses, and companies with no budgets, that’s extremely important.”
Yang spent less than C$100 (€73) on the solution in December: C$25 (€18) for HANA ONE and CA$63 (€64) for AWS cluster servers.
For more details about The Globe’s Hadoop and SAP HANA ONE project on Amazon AWS, watch this on-demand Web seminar.