Data Harvesting using Python Script for GNIP

Former Member · ‎09-29-2014

Disclaimer

This tutorial is intended as a guide for the creation of demo/test data only. The sample script provided is not intended for use in a productive system.

Purpose

The following tutorial explains a way of harvesting twitter data through GNIP. The pre-installed Python Interpreter from the SAP HANA client is used to execute a Python script from SAP HANA Studio. The script harvests the data from GNIP and extracts the useful data out of it and stores these details into Business Suite Foundation database tables SOCIAL DATA and SOCIALUSERINFO. Currently the script runs infinitely. If you want to stop harvesting the data, you can manually do it by stopping the execution of this script in the SAP HANA Studio. You can however modify the script to run for a specific period of time. To run the script, you will also need to make a few customizing and configuration settings in order to use the Pydev Plugin in SAP HANA Studio.

Prerequisites

Make sure that the following prerequisites are met before you start out :

• Installation of SAP HANA Studio and SAP HANA Client
Install SAP HANA Studio and SAP HANA Client and apply for a HANA user with Read, Write and Update authorization for foundation database tables SOCIALDATA and SOCIALUSERINFO

• Create a GNIP account

• Data Stream configuration in your GNIP account

Create a data stream for a source (like Twitter, Facebook, etc…) in your GNIP account. Remember, using a data stream you can harvest data from only a single source. So you should have different data streams for different data sources. After creating a data stream, define the rules in the ‘Rules’ tab to filter the data that you are getting from GNIP. For writing the rules refer the link : http://support.gnip.com/apis/powertrack/rules.html

Setup

1. Configuring Python in SAP HANA Studio Client

Python version 2.6 is already embedded in SAP HANA client, so you do not need to install Python from scratch. To configure Python API to connect to SAP HANA, proceed as follows.

1. Copy and paste the following files from C:\Program Files\SAP\hdbclient\hdbcli to C:\Program Files\SAP\hdbclient\Python\Lib

                a. _init_.py
                b. dbapi.py
                c. resultrow.py

2. Copy and paste the following files from C:\Program Files\SAP\hdbclient to C:\Program\Files\SAP\hdbclient\Python\Lib

a. pyhdbcli.pdb
b. pyhdbcli.pyd

Note:

In Windows OS, by default the installation path is C:\Program Files\SAP\.. for a 64 bit installation SAP HANA Studio and SAP HANA Database client

If you opted for a 32 bit Installation, the default path is C:\Program Files(x86)\sap\..

2. Setting up the Editor to run the file

2.1. Install Pydev plugin to use Python IDE for Eclipse

The preferred method is to use the Eclipse IDE from SAP HANA Studio. To be able to run the python script, you first need to install the Pydev plugin in SAP HANA Studio.

a. Open SAP HANA Studio. Click HELP on menu tab and select Install New Software
b. Click the button Add and enter the following information

Name : pydev

Location : http://pydev.org/updates

c. Select the settings as shown in this screenshot.

d. Press Next twice

e. Accept the license agreements, then press Finish.

f. Restart SAP HANA studio.

2.2. Configure the Python Interpreter

In SAP HANA studio, carry out the following steps:
a. Select the menu entries Window -> Preferences

b. Select PyDev -> Interpreters -> Python Interpreter

c. Click New button, type in an Interpreter name. Enter in filed Interpreter Executable the following executable file C:\Program Files\hdbclient\Python\Python.exe. Press OK twice.

2.3. Create a Python project

In SAP HANA Studio, carryout the following steps:

a. Click File -> New -> Project, then select Pydev project

b. Type in a project name, then press Finish

c. Right-click on your project. Click New -> File, then type your file name, press Finish.

Customizing and Running the Script

1. Customizing the python script

Copy and paste the below provided code into the newly created python file. Enter the values for the below parameters in the file.

a. URL – unique url for the datastream you have created in your GNIP account

(For ex : 'https://stream.gnip.com/accounts/<GNIP_USERNAME>/publishers/<STREAM>/streams/track/dev.json')

b. username_gnip – your GNIP account username

c. password_gnip – your GNIP account password

d. server – HANA server name (Ex : lddbq7d.wdf.sap.corp)

e. port – HANA server port

f. username_hana – HANA server username

g. password_hana – HANA server password

h. schema – schema name

i. client – client number



import urllib2

import base64

import zlib

import threading

from threading import Lock

import sys

import ssl

import json

from datetime import datetime

import calendar

import dbapi

from wsgiref.handlers import format_date_time

from time import mktime

CHUNKSIZE = 4*1024

GNIPKEEPALIVE = 30

NEWLINE = '\r\n'

URL = ''

username_gnip = ''

password_gnip = ''

HEADERS = { 'Accept': 'application/json',

            'Connection': 'Keep-Alive',

            'Accept-Encoding' : 'gzip',

            'Authorization' : 'Basic %s' % base64.encodestring('%s:%s' % (username_gnip, password_gnip))  }

server = ''

port =

username_hana = ''

password_hana = ''

schema = ''

client = ''

socialmediachannel = ''

print_lock = Lock()

err_lock = Lock()

class procEntry(threading.Thread):

    def __init__(self, buf):

        self.buf = buf

        threading.Thread.__init__(self)

    def unicodeToAscii(self, word):

        return word.encode('ascii', 'ignore')

    def run(self):

        for rec in [x.strip() for x in self.buf.split(NEWLINE) if x.strip() <> '']:

            try:

                jrec = json.loads(rec.strip())

                with print_lock:

                    verb = jrec['verb']

                    verb = self.unicodeToAscii(verb)

               

                    # SOCIALUSERINFO DETAILS

                    socialUser = jrec['actor']['id'].split(':')[2]

                    socialUser = self.unicodeToAscii(socialUser)

                    socialUserProfileLink = jrec['actor']['link']

                    socialUserProfileLink = self.unicodeToAscii(socialUserProfileLink)

                    socialUserAccount = jrec['actor']['preferredUsername']

                    socialUserAccount = self.unicodeToAscii(socialUserAccount)

                    friendsCount = jrec['actor']['friendsCount']

                    followersCount = jrec['actor']['followersCount']

                    postedTime = jrec['postedTime']

                    postedTime = self.unicodeToAscii(postedTime)

                    displayName = jrec['actor']['displayName']

                    displayName = self.unicodeToAscii(displayName)

                    image = jrec['actor']['image']

                    image = self.unicodeToAscii(image)

               

                    # SOCIALDATA DETAILS

                    socialpost = jrec['id'].split(':')[2]

                    socialpost = self.unicodeToAscii(socialpost)

                    createdbyuser = socialUser

                    creationdatetime = postedTime

                    socialpostlink = jrec['link']

                    creationusername = displayName

                    socialpostsearchtermtext = jrec['gnip']['matching_rules'][0]['value']

                    socialpostsearchtermtext = self.unicodeToAscii(socialpostsearchtermtext)

               

                    d = datetime.utcnow()

                    time = d.strftime("%Y%m%d%H%M%S")

               

                    creationdatetime_utc = datetime.strptime(postedTime[:-5], "%Y-%m-%dT%H:%M:%S")

                    creationdatetime_utc = creationdatetime_utc.strftime(("%Y%m%d%H%M%S"))

               

                    stamp = calendar.timegm(datetime.strptime(creationdatetime[:-5], "%Y-%m-%dT%H:%M:%S").timetuple())

                    creationdatetime = format_date_time(stamp)

                    creationdatetime = creationdatetime[:-4] + ' +0000'

               

                    if verb == 'post':

                        socialdatauuid = jrec['object']['id'].split(':')[2]

                        socialdatauuid = self.unicodeToAscii(socialdatauuid)

                   

                   

                        socialposttext = jrec['object']['summary']

                        socialposttext = self.unicodeToAscii(socialposttext)

                   

                        res = client + '\t' + socialmediachannel + '\t' + socialUser + '\t'  + socialUserAccount + '\t' + str(friendsCount) + '\t' + str

(followersCount) + '\t' + postedTime + '\t' + displayName + '\t' + displayName.upper() + '\t' + socialUserProfileLink + '\t' +image

                   

                    elif verb == 'share':

                        socialdatauuid = jrec['object']['object']['id'].split(':')[2]

                        socialdatauuid = self.unicodeToAscii(socialdatauuid)

                   

                        socialposttext = jrec['object']['object']['summary']

                        socialposttext = self.unicodeToAscii(socialposttext)

                   

                        res = client + '\t' + socialmediachannel + '\t' + socialUser + '\t'  + socialUserAccount + '\t' + str(friendsCount) + '\t' + str

(followersCount) + '\t' + postedTime + '\t' + displayName + '\t' + displayName.upper() + '\t' + socialUserProfileLink + '\t' +image

                   

                    print(res)

                    hdb_target = dbapi.connect(server, port, username_hana, password_hana)

                    cursor_target = hdb_target.cursor()

                   

                    sql = 'upsert ' + schema + '.SOCIALUSERINFO(CLIENT, SOCIALMEDIACHANNEL, SOCIALUSER, SOCIALUSERPROFILELINK, SOCIALUSERACCOUNT,

NUMBEROFSOCIALUSERCONTACTS, SOCIALUSERINFLUENCESCOREVALUE, CREATIONDATETIME, SOCIALUSERNAME, SOCIALUSERNAME_UC, SOCIALUSERIMAGELINK, CREATEDAT) values

(?,?,?,?,?,?,?,?,?,?,?,?) with primary key'

                    cursor_target.execute(sql, (client, socialmediachannel, socialUser, socialUserProfileLink, socialUserAccount, friendsCount,

followersCount, creationdatetime, displayName, displayName.upper(), image, time))

                    hdb_target.commit()

                   

                    sql = 'upsert ' + schema + '.SOCIALDATA(CLIENT, SOCIALDATAUUID, SOCIALPOST, SOCIALMEDIACHANNEL, CREATEDBYUSER, CREATIONDATETIME,

SOCIALPOSTLINK, CREATIONUSERNAME, SOCIALPOSTSEARCHTERMTEXT, SOCIALPOSTTEXT, CREATEDAT, CREATIONDATETIME_UTC) VALUES(?,?,?,?,?,?,?,?,?,?,?,?) WITH PRIMARY

KEY'               

                    cursor_target.execute(sql, (client, socialdatauuid, socialpost, socialmediachannel, createdbyuser, creationdatetime, socialpostlink,

creationusername, socialpostsearchtermtext, socialposttext, time, creationdatetime_utc))

                    hdb_target.commit()

            except ValueError, e:

                with err_lock:

                    sys.stderr.write("Error processing JSON: %s (%s)\n"%(str(e), rec))

def getStream():

    proxy = urllib2.ProxyHandler({'http': 'http://proxy:8080', 'https': 'https://proxy:8080'})

    opener = urllib2.build_opener(proxy)

    urllib2.install_opener(opener)

    req = urllib2.Request(URL, headers=HEADERS)

    response = urllib2.urlopen(req, timeout=(1+GNIPKEEPALIVE))

    decompressor = zlib.decompressobj(16+zlib.MAX_WBITS)

    remainder = ''

    while True:

        tmp = decompressor.decompress(response.read(CHUNKSIZE))

        if tmp == '':

            return

        [records, remainder] = ''.join([remainder, tmp]).rsplit(NEWLINE,1)

        procEntry(records).start()

if __name__ == "__main__":

    print('Started...')

    while True:

        try:

            getStream()

        except ssl.SSLError, e:

            with err_lock:

                sys.stderr.write("Connection failed: %s\n"%(str(e)))

2. Run the script from your editor

3. Checking the Results in the database tables SOCIALDATA and SOCIALUSERINFO.

Other blog posts on connecting Social Channels:

Twitter connector to harvest tweets into Social Intelligence tables using Python script

http://scn.sap.com/docs/DOC-53824

Historical data harvesting from GNIP using Python scripts

http://scn.sap.com/community/crm/marketing/blog/2014/10/16/historical-data-harvesting-from-gnip-usin...

Demo Social and Sentiment data generation using Python script

http://scn.sap.com/community/crm/marketing/blog/2015/01/12/demo-social-and-sentiment-data-generation...

(If you find any mistakes or if you have any doubts in this blog please leave a comment)

Data Harvesting using Python Script for GNIP

SAP Cloud for Customer Integration with ERP and CRM: How-to Guides and E-Learning

SAP Hybris Cloud for Customer - All About Integration (No Longer Updated)

Tips and Tricks - Troubleshooting CRM (Interaction Center)