Innovation Weekend and the irony of “fixing SCN from within
I’m not really sure how I ended up at Innovation Weekend 2010 in Vegas.
Background and history
For me it goes back to Hackers night 2009 in Vienna – when Craig Cmehil (our very own Simon Cowell) and I were talking late at night, not long before Craig, Tom Jung and Duane Chaos and I got kicked out the MesseCentrum building. We were discussing the RIA hackers night, which I felt had turned into a late night learning session based on our own laptops. And whilst it was still a cool thing to do, I for one felt that it was missing something. Mostly, real hacking and a deliverable. And real purpose. And what I didn’t realise at that time was that Marilyn Pratt had already taken a load of hackers into the BPX slam and those guys were really hacking.
Craig and I kept in touch and whilst I don’t know how much I shaped the end product – he probably already had a plan, he usually does – the end result was the Innovation Weekend. An all night hackers night running from Sunday to Monday. I wasn’t able to attend in Berlin and as luck had it, the cheap tickets to Vegas were on Saturay, meant that I was just about recovered from jetlag by Sunday at 1pm.
Now I have to take you back again 6 months to the SAP Inside Track in London that Darren Hague kindly organised. I met a PhD student called Sarah Otner, who was doing a PhD on the recognition system in the SAP Community Network. I loved her passion and interest in the system and she was really frustrated, because she needed data in order to do the mining she needed to do to write her thesis. SAP were blocking her desire to get the data out, either for technical or legal reasons. I don’t think that it was an orchestrated attack – but rather that it was the typical problems that you see in a large corporation.
I saw her in Berlin last week and she looked slightly downtrodden – no progress on data in the preceding 6 months since I saw her in SIT London. I felt that for SAP there was no downside – free research and exposure for one of the most exciting community networks in the word.
Fast forward to Vegas
… and I found myself in the amazing Innovation Weekend masterminded by Marilyn Pratt and Craig Cmehil. Without those guys it would be nothing.
They had prepared 8 BPX focussed business cases and one of these was as follows:
8. “Physician: Heal Thyself”: Improving the SCN from within!
Posted by: [Sarah Otner | http://www.sdn.sap.com/irj/scn/bc?u=oczhwhdeywc%3d] GOAL: Improve the recognition systems of SCN by examining the historical data
– Does the SCN recognition system reward the right kinds of behaviors and contributions?
– What’s the
real
value of being a Top Contributor?
PROBLEM: Initial attempts to pull the source data already available on SCN into Excel failed as they only returned 10 lines and the same 10 lines upon each request (a problem when one Top Contributor table has 17,000 individuals).
CHALLENGE:
– A database of community members and their activity year-on-year for as many years as is available.
– Scrape the Contributor Recognition Program, the Top Contributors’ lists, the Topic Leaders’ lists, and the Mentors’ rosters into a format easily manipulable (by me! J) for analysis
What next?
Fellow SAP Mentor Thorsten Franz turned up at the table along with a number of other great individuals. And it became clear that this was a pretty easy technology challenge, provided we could get the data. So I set about getting the data whilst Thorsten, Arun, Laurant and others worked on analytics and presentation.
Mounting a DOS on SCN (aka making friends and influencing people)</p><p>So it turns out that the only way to get points data out of SCN is to read the RSS feeds on the contributor pages. Only the contributor page version is broken. The company version does however work, and it is possible to see points – by Company by Person by Year by Development Area. Can you see where I am headed?</p><p>So if you want to find out the contributors for Bluefinsolutions.com – for 2010 and for Mobile, you can go here:</p><p>feed://www.sdn.sap.com/irj/sdn/topcontributorsrss?periodid=y10&minimumpointscount=20&areaids=g&organization=bluefinsolutions.com </p><p>So all I needed to do was to write a script to get this for all companies, all years and all points areas. Simple, right. Here’s the bash script to do it:</p><p>for year in `cat ../year`; do for devel in `cat ../devel`; do for comp in `cat ../companynames`; do wget -O $year,$devel,$comp ‘http://www.sdn.sap.com/irj/sdn/topcontributorsrss?periodid=’$year’&minimumPointsCount=20&areaIds=’$devel’&organization=’$comp; done; done; done</p><p>Note that I downloaded the years, company names and development areas using the same techniques and put them in files – and note that the filename is cued to be part of the CSV name. But… I forgot to escape the & by surrounding it in inverted commas. So in doing so, I opened up 2500 threads (I used the top 2500 companies). And SDN died for 3 hours.</p><p>After SDN came back up I fixed my script and parallelised it by year – so just 8 threads running. It took 10 hours to download all the data into some 180,000 XML files. Thankfully, we have lots of CPU power these days. So I wrote some scripts around that too.</p><p>First, files that are 409 bytes long don’t actually have any data in them. So we strip them out the list of files to process as follows:</p><p>for a in `find -not -size 409c -print | sed ‘1d’| cut -c3-100`; do echo $a; done > ../filled </p><p>And then we strip the XML out, turn it into a flat file and append the filename that relates to it, to each line.</p><p>for a in `cat ../filled`; do cat $a | sed ‘1,9d’ | more | sed ‘:a;N;$!ba;s/</title>
/,/g’ | sed ‘s/<title>//’ | sed ‘:a;N;$!ba;s/</link>
/,/g’ | sed ‘s/<link>//’ | sed ‘:a;N;$!ba;s/</description>
/,/g’ | sed ‘s/<description>//’ | sed ‘:a;N;$!ba;s/</pubDate>
/,/g’ | sed ‘s/<pubDate>//’| sed ‘:a;N;$!ba;s/</scn:rank>
/,COMPANY/g’ | sed ‘s/<scn:rank>//’| sed ‘:a;N;$!ba;s/</item>
//g’ | sed ‘s/<item>//’| sed ‘:a;N;$!ba;s/</rss>//g’| sed ‘:a;N;$!ba;s/</channel>//g’ | sed ‘1d’ | sed ‘$d’ | sed s/COMPANY/$a/; done >> ../fillout.csv</p><p>This gives us a bunch of data that looks like this:</p><p>Jon Reed,https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.sdn.businesscard.sdnbusinesscard?u=glyawsx5bmi%3d,80,Tue, 10 Feb 2009 2:43:19,1,y08,P,jonerp.com</p><p>All we do then is convert the data and replace some years and development areas, and we’ve got a nice big CSV file with people by year, development area and company.</p><p>The rest is easy
The rest of our demo was easy – we uploaded the big CSV file into SAP’s cloud BI Service – http://bi.ondemand.com and used SAP BusinessObjects Explorer to look at the data. We also used the new beta BUPA dashboarding service which worked pretty well.
Conclusions
Well, we have done what we set out to achieve. We have 7 years of SCN data explorable by most of the metrics that Sarah was looking for. There are some things that were hard to do – especially scraping the master data from SCN business cards and that is a work in progress.
But what we’re hoping, and there’s a number of us that share this vision, is that as the SCN team start to realise the value of analysis by students of the data, we are able to break down the walls of getting more detailed information available to people like Sarah who want to run PhD theses into the community.
Huge thanks to Marilyn and Craig for making it possible. To Kai and Mark and Chip and everyone from the SCN team who I inconvenienced. Sorry about that.
thank you for sharing the story of innovation weekend. Great to see how a bit of wget + sed magic can make a difference. I hope with you that the SCN Team provides better ways for extracting data out of SCN. And yes, the business card is a mess :-(.
I've talked to Sarah at SAPTechEd Berlin too and pushed her to add her case to the Innovation Weekend Busines Case's. I'm glad that it worked out so well.
You, Thorsten and the rest of your team did an amazing job!
Best regards
Gregor
I'm sorry I didn't credit you already and I'm sorry for all the other people I have probably not credited either. Thanks because without you, there would have been no business case 8.
Regards,
John
I was particularly touched by John's evaluation of my drive and my frustrations: spot on. Moreover, I would like to underscore that SAP did not purposefully "torpedo" my research; some of the difficulties to date have arisen due to the different priorities and timelines that function in academic research versus *large* industry. However,after TechEd Berlin, I am more excited than ever about making SCN my thesis case. (<-- Geek Girl alert!)
So, "watch this space" for news about my research, and keep your fingers crossed for progress and juicy results! Thanks again,
~Sarah~
Is it possible to see the final result on BI.ondemand.com?
Congratulations guys
Regards,
Ivan
Should be easy to connect the 2 of you here on the exhibit floor/clubhouse of SAP TechEd. Let me know.
Moreover I wonder why a hack to bring down SCN easily is openly blogged on SCN.
But maybe you just had too little sleep to consider the consequences because of the 30 hour marathon of the great Innovation Weekend.
anton
Would love to hear some specific feedback on this because this is all publicly available information that we aggregated. No login to SCN is required.
Regards,
John
As to any controversy about showing how easy it is to get these data, if it's illegal in some countries it's probably sufficient to say so. Not that I'm a lawyer either.
Jim
anyway, I am neither the police nor a lawyer, I just feel slightly uncomfortable with this.
apart from that, have fun at TECHED10 LV.
Therefore, what John & team facilitated was the primary level request, and hopefully SAP will be able to help with the second request.
Thanks to all!
I hope that together we soon will find a solution! It will be my pleasure to deliver valuable results to this fine community.
~Sarah~
As for SAP officially supporting Sarah, I'm sure Sarah will attest to the many wonderful people she has met along the way and with a little bit of luck she will be overloaded with data here thanks to some meetings and discussions that took place in Berlin - she won't be able to publish all of the data she will hopefully be getting soon but she will be able to hopefully move forward very quickly.
As for the legality of it, nothing of this data effects data privacy as the data does not fall under the "personal data" in that sense and John took advantage of something we added to SCN back in March - RSS Baby! RSS! - we could down into all of the legal debates about it but the paragraphs I have from legal experts kind of trump everything for this particular set of data. What Sarah needs on the 2nd level is where things are tricky and taking so long to figure out (because we are not legal experts and having to keep asking questions and getting answers) and the part of the reason this first level of data is considered OK is because of nature of how you register and the "display name" you choose to give when doing so (as I have been informed).
Sorry to spoil the fun on the "I hacked and brought SCN down" and "we stole SCN data" but didn'T want you all to lose too much more sleep over all this 😉
What I think we all witnessed was incredible passion and intelligence (kudos John) being implemented real-time to help the community (and our management by extension) and the newest member of our SCN family, Sarah. Thanks Craig for the intro to another "Jersey Girl" and I hope we get to work together quickly.
The real point is that we seem to have moved Sarah on a few steps down her journey of getting all the real data she wants.
The rest of that data should be scrapable from the SCN business cards. Do I see a part 2 coming on?
I absolutely will agree that SAP is people-powered, and would like to borrow Marilyn's "family" language. I always enjoy myself AND learn something when interacting with SCN members, which helps me to love my job.
A great big "thank you" to everyone who has helped me - and who will help me; I am lucky to have had the "Otto Gold experience"! And now, back to the books...
Sarah is so nice and I really wanted to help her, but didn´t know how to do what you did (and am sure I wouldn´t make it alone). I am sure it was a great challenge and you made it!
I hope we will see some result from her side and will get something insightful about our community/scn which could help us with bettering this place:))
All the best,
Otto
To display the data, we used a prototype called Exploration Views (project codename BUDA) that leverages the SAP BusinessObjects Explorer engine, CVOM graphic libraries and BIOnDemand platform.
It should be available on the http://www.sdn.sap.com/irj/boc/innovation-center website the week of November 15th.
All the best!
Laurent