Skip to Content
Technical Articles

CPITracker is back up and running

As my fellow SAP Mentors Daniel Graversen and Eng Swee Yeoh kindly pointed out to me, the CPITracker Twitter feed has not been updating correctly for a while. They were right, of course. I looked into the problem and it turned out to be something that was painfully predictable…

As you may know, CPITracker tracks updates to the various underlying components of SAP Cloud Integration like Apache Camel, Java, Groovy etc. If this is the first time you hear about CPITracker, here’s a more in-depth blog post.

In addition to those components, it also tracked (past tense) the Adapter Development Kit, the Script API JAR file and Cloud Connector. It did so by extracting their version numbers directly from the SAP Development Tools page.

Specifically, my Groovy script would fetch the HTML of the page, parse it using the Jsoup library and pull out the version numbers. Now, this technique – known as screen scraping – has a very obvious problem: You can only pull bits of data out of an HTML page by making certain assumptions about the structure of that page. Once that structure changes, which it does sooner or later, your code will break.

And that’s exactly what happened here. I fixed the problem by removing the screen scraping code, which shouldn’t really have been in there in the first place. That means, however, that CPITracker no longer tracks those three version numbers.

So CPITracker is up and running again and the moral of the story is: Don’t screen scrape 🙂 And if you absolutely must do it, make sure to continuously check your assumptions.

You must be Logged on to comment or reply to a post.
  • Hi Morten,

    thanks for the update and the honest reflection of what has happened. For exactly this reason I don't like to use HTML DOM parsers for screen scraping. If I really have to scrape, I prefer using plain HTTP lib in combination with some regular expressions. From my experience this approach is often more robust (at least if one puts some time into thoughtful regex patterns). But as you said - no scraping is the safest option at the end. 😉

    • I had thought somewhat ahead, because I had a reasonably informative error message logged 😀 I just ended up dropping the scraping entirely, rather than update the code. I would say that regexes are prone to the same problem; they need to be based on assumptions about the page, that will change at some point.

      • Maybe I was a little bit unprecise in my last comment, because I fully agree on the point that at larger scale regex are also error prone. What I wanted to say is, that, when written with thought, regexes will survive one or the other DOM change whereas a DOM parsing library usually instantly breaks if the DOM (or the nesting/levels) inside the DOM change. But at the end it comes to the sales result: scraping usually is evil.

        Thinking about the last statement… Isn’t the new SAP iRPA also some kind of scraping? (Or better said a GUI tool to build scraping bots.) ?

        • It's usually more sophisticated than that (luckily). If you, for instance, record clicking an SAPUI5 button, it will understand that you're clicking a button and even if the button is moved to a different location on the screen, the button click can be replayed. But I see your point about there being some conceptual overlaps between screen scraping and RPA.