Disclaimer

This tutorial is intended as a guide for the creation of demo/test data only. The sample script provided is not intended for use in a productive system.

Purpose

This tutorial explains how to create demo data for the Business Suite Foundation database tables SOCIALDATA and SMI_VOICE_CUST using a Python script. The data is saved as excel files. You can find more information about Analyze Sentiment, a Fiori app from Social Intelligence here – New videos on SAP Sentiment Analysis on YouTube available

It will help you to get the context of this post and also to have a basic idea on what is Social Intelligence about.
Prerequisites/Setup

Make sure that the following prerequisites are met before you start out :

• Installation of Python 2.x for windows

Install Python 2.x  for your platform – Download Python | Python.org
PS: During installation, select the option to add Python’s installation directory to Windows PATH variable.

Install the required python modules: setuptools, jdcal, openpyxl, xlrd.


Specifying Input and Customizing the scripts

There are two variations of the script that can be used depending on the use case.


Script 1 – gen_posts_count.py


When to use: This script can be used when you have a list of searchterms, the time range and the average number of posts per week for which you want to generate the demo data. If you use this script you cannot control the sentiment value in the posts. Sentiment indicates whether the social user is telling a good thing, neutral thing or a bad thing through the social post. So this script generates posts with random sentiment.

Input File: post_count_per_week.xlsx in which you have to maintain the products and the corresponding number of posts per week to be generated.

See the attached screenshot – post_count_total.PNG

Modification to the script: time range has to be specified in the python script at the end of the file. Open the script in a text editor and modify this line to give the start and end dates. – Number of weeks that the time span comprises of: gen_posts([1, 12, 2013], [29, 1, 2014], 8)


#!/bin/python
# Generates a collection of dummy social media data
from random import choice, randint, random
from time import strftime
from datetime import timedelta, datetime
from openpyxl import Workbook
import xlrd
def get_products_and_counts():
    book = xlrd.open_workbook('post_count_per_week.xlsx')
    sh = book.sheet_by_index(0)
    products = []
    counts = []
    for rownum in range(sh.nrows):
        products.append(sh.row_values(rownum)[0])
        counts.append(sh.row_values(rownum)[1])
    return products, counts
def randomN(prefix, ndigits):
    range_start = 10**(ndigits-1)
    range_end = (10**ndigits)-1
    return prefix + str(randint(range_start, range_end))
def random_date(start, end):
    return start + timedelta(
        seconds=randint(0, int((end - start).total_seconds())))
def gen_posts(s_date, e_date, no_of_weeks):
    social_filename = 'SOCIALDATA' + '.xlsx'
    voice_filename = 'SMI_VOICE_CUST' + '.xlsx'
    social_book = Workbook(optimized_write = True)
    social_sheet = social_book.create_sheet()
    voice_book = Workbook(optimized_write = True)
    voice_sheet = voice_book.create_sheet()
    start_datetime = datetime(s_date[2], s_date[1], s_date[0], 0, 0, 0)
    end_datetime = datetime(e_date[2], e_date[1], e_date[0], 0, 0, 0)
    client_list = ['005']
    user_list = ['Ashwin', 'Saiprabha', 'Anupama', 'Debasish', 'Ajalesh', 'Raghav', 'Dilip', 'Rajesh', 'Saju', 'Ranjit', 'Anindita', 'Mayank', 'Santosh', 'Kavya', 'Jithu']
    #product_list = ['Oz Automotive', 'Samba Motors', 'Smoggy Auto', 'Camenbert Cars', 'Curry Cars', 'Driftar', 'eRacer', 'Rouble Motor Company', 'MoonRider', 'Bumble']
    channel_list = ['TW', 'FB']
    adj_set = {"good" : ['good', 'zippy', 'beautiful'],
          "very_good" : ['exuberant'],
          "neutral" : ['ok'],
          "bad" : ['bad', 'annoying'],
          "very_bad" : ['awful']}
    adj_kind_from_senti = { 2 : "very_good",
                1 : "good",
                0 : "neutral",
                -1 : "bad",
                -2 : "very_bad"}
    post_templates = {"very_good" : ["Hey guys, try {0}, it is {1}! Dont miss!",
                      "People, I got the new {0} - {1}!! Brilliant performance! Give a try!",
                      "If you havent yet, try {0}. The speed is fantastic, It is {1}!",
                      "The brandnew {0} - The product quality is impressive!! Verdict - {1}",
                      "{0} is {1}. Highly recommended"],
            "good"      : ["Today I tried {0}. It is {1}.",
                            "The new {0}. Product quality is top, is {1} and worth a try",
                            "Did you checkout {0}?, {1} thing.",
                            "Latest version of {0} is {1}. Excellent performance for me!",
                            "Didnt know {0} is {1} stuff. Superb speed!. Do try it."],
            "neutral"  : ["Checked out {0}. It is {1}",
                            "The new {0} is {1}. Dont expect much.",
                            "Difficult to judge the new {0}. It is {1}.",
                            "Heard the new {0} is {1}. Any first hand info on the performance?",
                            "Anyone know how is {0}, reviews say it is {1}. Quality is what matters"],
            "very_bad"  : ["OMG!! Tried {0}. Its performance is damn too low. It is {1}",
                            "Never go for {0}, the speed is very less, {1} thing.",
                            "Oh, such a {1} thing {0} is!",
                            "Dont ever think of getting a {0}, very bad product quality. It is {1}",
                            "Why do we have {1} products like {0}? :("],
            "bad"      : ["Tried the new {0}. It is not recommended - {1}",
                            "Shouldnt have gone for the {1} {0}. Pathetic product quality.",
                            "First hand experience: {0} is {1}!",
                            "My {0} is {1}. The speed is way too less. Is it just me?!",
                            "The new {0} is {1}. Performance is disappointing. Fail!!"]}
    products, counts = get_products_and_counts()
    for j in range(len(products)):
        product = products.pop()
        count = int(counts.pop()) * no_of_weeks
        print product, count
        for k in range(count):
            sentiment = randint(-2, 2)
            sentiment_valuation = sentiment + 3 if sentiment else sentiment
            adj_kind = adj_kind_from_senti[sentiment]
            adj = choice(adj_set[adj_kind])
            client = choice(client_list)
            guid = randomN('POB', 29)
            user = choice(user_list)
            channel = choice(channel_list)
            post_template = choice(post_templates[adj_kind])
            posted_on = random_date(start_datetime, end_datetime)
            post = post_template.format(product, adj)
            social_sheet.append([client, guid, channel[:2].upper() + str(randomN('',6)), 'English', channel, user, posted_on.strftime("%a, %d %b %Y %H:%M:%S +0000"),'','','','','','','','','','','', product,'', post])
            voice_sheet.append([client, guid, 'Text Analysis', 'Sentiment', '', sentiment, sentiment_valuation,'', '', posted_on.strftime("%Y%m%d%H%M%S")])
            voice_sheet.append([client, guid, 'Text Analysis', 'PRODUCT', product, sentiment, sentiment_valuation,'', '', posted_on.strftime("%Y%m%d%H%M%S")])
    social_book.save(social_filename)
    voice_book.save(voice_filename)
    print 'Demo data saved in SOCIALDATA.xlsx, SMI_VOICE_CUST.xlsx'
#modify this line => gen_posts(start_date, end_date, no.of weeks for which data is to be generated)
gen_posts([1, 12, 2013], [28, 1, 2014], 8)

PS: You can configure the other aspects like usernames, channels, countries, locations, adjectives, post templates also.

Script 2 – gen_senti_count.py


When to use: This script can be used when you have a list of searchterms, the time range and the number positive, negative and neutral posts to be generated for each product in that time span. If you use this script you can control the sentiment value in the posts.

Input File: senti_count_per_week.xlsx in which you have to maintain the products and the corresponding number of posts per week to be generated. See the attached screenshot – senti_count_total.PNG

Modification to the script: time range has to be specified in the python script at the end of the file. Open the script in a text editor and modify this line to give the start and end dates. – Number of weeks that the time span comprises of: gen_posts([1, 12, 2013], [29, 1, 2014], 8)


#!/bin/python
# Generates a collection of dummy social media data
from random import choice, randint, random
from time import strftime
from datetime import timedelta, datetime
from openpyxl import Workbook
import xlrd
#Reads lines "NIKE 23 14 45" from 7days.xlsx which is the count of pos, neg and neu posts to be generated for NIKE in the given period
def get_products_and_senti_num():
    book = xlrd.open_workbook('senti_count_total.xlsx')
    sh = book.sheet_by_index(0)
    products = []
    senti_num = []
    for rownum in range(sh.nrows):
        products.append(sh.row_values(rownum)[0])
        senti_num.append(sh.row_values(rownum)[1:4])
    return products, senti_num
#Returns prefix + ndigits
def randomN(prefix, ndigits):
    range_start = 10**(ndigits-1)
    range_end = (10**ndigits)-1
    return prefix + str(randint(range_start, range_end))
def random_date(start, end):
    return start + timedelta(
        seconds=randint(0, int((end - start).total_seconds())))
def gen_posts(s_date, e_date):
    social_book = Workbook(optimized_write = True)
    social_sheet = social_book.create_sheet()
    voice_book = Workbook(optimized_write = True)
    voice_sheet = voice_book.create_sheet()
    start_datetime = datetime(s_date[2], s_date[1], s_date[0], 0, 0, 0)
    end_datetime = datetime(e_date[2], e_date[1], e_date[0] + 1, 0, 0, 0)
    client_list = ['001']
    user_list = ['John', 'William', 'James', 'Jacob', 'Ryan', 'Joshua', 'Michael', 'Jayden', 'Ethan', 'Christopher', 'Samuel', 'Daniel', 'Kevin', 'Elijah']
    channel_list = ['TW', 'FB']
    countries = ['India', 'Germany', 'France', 'The United States']
    locations = {"India" : ["Bangalore", "Chennai", "Delhi", "Mumbai"],
                "Germany": ["Berlin", "Munich", "Stuttgart", "Frankfurt"],
                "France": ["Paris", "Marseille", "Lyon"],
                "The United States": ["Florida", "Washington DC", "Texas", "Dallas"]}
    country_codes = {"India": "IN",
                    "Germany" : "DE",
                    "France" : "FR",
                    "The United States": "US"}
#The adj_set has the adjectives that will be used in the posts.
    adj_set = {"good" : ['good', 'nice'],
          "very_good" : ['refreshing', 'magical'],
          "neutral" : ['ok'],
          "bad" : ['not good', 'substandard', 'unpleasant', 'poor'],
          "very_bad" : ['awful', 'horrible', 'terrible']}
    adj_kind_from_senti = { 2 : "very_good",
                1 : "good",
                0 : "neutral",
                -1 : "bad",
                -2 : "very_bad"}
    post_templates = {"very_good" : ["Hey guys, try {0}, it is {1}! Dont miss!",
                      "People, I got the new {0} - {1}!! Brilliant! Give a try!",
                      "I'm loving {0}!!",
                      "Using {0} feels great!!",
                      "{0} is {1}. My body feels so refreshing",
                      "{0} - The product quality is impressive!! Verdict - {1}",
                      "{0} is {1}. Highly recommended",
                      "{0} gives instant refreshing moisturizing effect!"],
            "good"      : ["Today I tried {0}. It is {1}.",
                            "The new {0}. Product quality is top, is {1} and worth a try",
                            "Did you checkout {0}?, {1} thing.",
                            "I like {0}. It smells nice and so soft",
                            "Didnt know {0} is {1} stuff. Superb!. Do try it."],
            "neutral"  : ["Checked out {0}. It is {1}",
                            "The new {0} is {1}. Dont expect much.",
                            "Heard the new {0} is {1}. Any first hand info on the it?",
                            "Anyone know how is {0}, reviews say it is {1}. Quality is what matters"],
            "very_bad"  : ["OMG!! Tried {0}. Its not for you. It is {1}",
                            "Never go for {0}, the quality is very less, {1} thing.",
                            "Oh, such a {1} thing {0} is!",
                            "{0} is sold out in my area - Sad!",
                            "Couldnt find {0} in my local store. Bad that I cant get that.",
                            "Local stored have sold out {0}, please send in more!!",
                            "We need more stock of {0} in here. Out of stock everywhere I check",
                            "{0} is out of stock - So sad!",
                            "Dont ever think of getting a {0}, very bad product. It is {1}",
                            "Why do we have {1} products like {0}? :("],
            "bad"      : ["Tried the new {0}. It is not recommended - {1}",
                            "Shouldnt have gone for the {1} {0}. Pathetic product quality.",
                            "First hand experience: {0} is {1}!",
                            "10 stores and no {0}. I want it desperately",
                            "Tried finding {0}. Can't find it in any stores in my area.",
                            "My {0} is {1}. The quality is way too less. Is it just me?!",
                            "The new {0} is {1}. It is disappointing. Fail!!"]}
    products, senti_num = get_products_and_senti_num()
    for j in range(len(products)):
        product = products.pop()
        senti = senti_num.pop()
        pos = int(senti[0])
        neg = int(senti[1])
        neu = int(senti[2])
        print product, "-", pos, neg, neu, " posts created."
        for k in range(pos + neg + neu):
            if pos:
                sentiment = randint(1,2)
                pos -= 1
            elif neg:
                sentiment = randint(-2,-1)
                neg -= 1
            else:
                sentiment = 0
                neu -= 1
            sentiment_valuation = sentiment + 3 if sentiment else sentiment
            adj_kind = adj_kind_from_senti[sentiment]
            adj = choice(adj_set[adj_kind])
            client = choice(client_list)
            guid = randomN('POB', 29)
            user = choice(user_list)
            channel = choice(channel_list)
            post_template = choice(post_templates[adj_kind])
            posted_on = random_date(start_datetime, end_datetime)
            post = post_template.format(product, adj)
            num_of_votes = str(randint(0, 150))
            if channel == 'TW':
                post_link = 'http://twitter.com/' + user + randomN('', 5)
            if channel == 'FB':        
                post_link = 'http://facebook.com/' + user + randomN('', 5)
            post_type = choice(['Status', 'Link', 'Photo', 'Video'])
            country = choice(countries)
            location = choice(locations[country])
            country_code = country_codes[country]
            latitude = str(randomN("", 2) + '.' + str(randint(2, 20)))
            longitude = str(randomN("", 2) + '.' + str(randint(2, 20)))
            social_sheet.append([client, guid, channel[:2].upper() + str(randomN('',6)), 'English', channel, user, posted_on.strftime("%a, %d %b %Y %H:%M:%S +0000"), post_type, post_link, num_of_votes, location, country, latitude, longitude, '3', 'Demo post', user, 'Demo User Retrieval', product, posted_on.strftime("%Y%m%d%H%M%S"), post, posted_on.strftime("%Y%m%d%H%M%S"), 'Demo Post Parent', "DemoJ", country_code, 'DS'])
            voice_sheet.append([client, guid, 'TextAnalysis', 'Sentiment', 'DEMO', sentiment, sentiment_valuation, 'J', posted_on.strftime("%Y%m%d"), posted_on.strftime("%Y%m%d%H%M%S")])
            voice_sheet.append([client, guid, 'TextAnalysis', 'PRODUCT', product, sentiment, sentiment_valuation, 'J', posted_on.strftime("%Y%m%d"), posted_on.strftime("%Y%m%d%H%M%S")])
    social_book.save('SOCIALDATA.xlsx')
    voice_book.save('SMI_VOICE_CUST.xlsx')
    print 'Demo data saved in SOCIALDATA.xlsx, SMI_VOICE_CUST.xlsx'
#modify this line => gen_posts(start_date, end_date)
gen_posts([22, 05, 2014], [05, 06, 2014])



Running the script

Both of the above scripts can be run in the following manner:


1) Save the script and input excel file in a directory.

2) Press hold Shift key and Right click.

3) Select – ‘Open command window here’

4) In the commandline type: python <scriptname>

5) Done. If everything worked as expected, you will have SOCIALDATA.xlsx and SMI_VOICE_CUST.xlsx files generated in that folder with the dummy data.


Tailpiece

As mentioned in the disclaimer already, these scripts should be used only for demo purposes.

The screenshots attached show how the input excel files should look like.

If you run into any issues during the setup or execution of the script, please let me know in the comments section.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply