BSP Programming: Crawling SDN

former_member181879 · ‎07-06-2004

In a previous weblog (), a simple web crawler was built using the HTML Viewer that is integrated into SAPGUI. In this weblog, it is time to use the web crawler. First, the crawler is extended to a program to specifically fetch the information about all weblogs from SDN. In subsequent steps, the numbers are tortured to tell us the “story of success”. We use BSP Extension “graphics” to visualize the numbers.

The Phases of Crawling SDN

The basic crawler is complete, and was shown previously. Now it is time to build onto this work. Again, only parts of the code will be presented, the full source code is

available

.

For the SDN crawler, we will define a number ofphases: LOGON, GET_MONTHS (this was the fastest way to get an overview of all weblogs!) and QUIT. Each phase will have one method to create all the URLs that are required, and one method that will be called with the contents of each URL.

For the LOGON phase, only a POST request is required with the authentication data. This step must be done to gather the different session cookies, and get a SSO2 (single sign on) cookie. The returned content is not further parsed.

The GET_MONTHS phase is the more interesting part. We are interested in a complete list of all weblogs written, plus the basic data about each. However, not all of this information is available via RSS feeds. An alternative source was required. I saw that the monthly archives listed all weblogs since the start of SDN.

!https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_001.GIF|height=0 width=573 height=218 |width=0 width=573 height=218 |src=https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_001.GIF|border=0 width=573 height=218 !

Once the URL format for each month’s archive is known, it becomes very easy to build a list of URLs that are required to retrieve the information. With some reverse engineering, you quickly see that the first weblogs were written in May 2003.

<!~~code~~> METHOD get_months

.

<!~~code~~> next_phase = 'QUIT'.

<!~~code~~>

<!~~code~~> DATA: url TYPE string,

<!~~code~~> date TYPE D VALUE '20030501'.

<!~~code~~> WHILE date < sy-datum.

<!~~code~~> CONCATENATE 'get__https://www.sdn.sap.com/irj/sdn/weblogs?blog=/weblogs/date/'

date(4) '/' date+4(2)

<!~~code~~> INTO url.

<!~~code~~> APPEND url TO urls.

<!~~code~~> date = date + 31. date+6(2) = '01'.

<!~~code~~> ENDWHILE.

<!~~code~~> ENDMETHOD.

<!~~code~~>

The get_months method is called only once to queue all the URLs that are required for this phase. However, the content will be delivered per page. The input parameter is one string that contains the complete body of the loaded page. Simple ABAP string operations (mostly SPLIT!) are used to extract the relevant weblog information from the HTML string.

<!~~code~~> METHOD get_months_content

.

<!~~code~~> WHILE content CS '/pub/wlg/'.

<!~~code~~> ....

<!~~code~~> APPEND INITIAL LINE TO blogs ASSIGNING .

<!~~code~~> ....

<!~~code~~> SPLIT content AT '/pub/wlg/' INTO garbage content.

<!~~code~~> SPLIT content AT 'Permalink' INTO blog content.

<!~~code~~>

<!~~code~~> SPLIT blog AT '"' INTO -url blog.

<!~~code~~> CONCATENATE 'https://weblogs.sdn.sap.com/pub/wlg/' -url.

<!~~code~~>

<!~~code~~> SPLIT blog AT '-title blog.

<!~~code~~> SPLIT -title.

<!~~code~~>

<!~~code~~> SPLIT blog AT '

' INTO garbage blog.

<!~~code~~> SPLIT blog AT '-abstract blog.

<!~~code~~> ....

<!~~code~~> ENDWHILE.

<!~~code~~> ENDMETHOD.

<!~~code~~>

One interesting aspect is that the HTML received here does not match the actual HTML sent to the page. HTML source (as seen using “View Source” in the browser) is parsed into an HTML DOM (document object model), whichis a tree-like representation of the HTML source. The web crawler uses the outerHTML command to convert the HTML DOM back into a string. One example is the liberal use of “

” sequences, which are rendered as “

” by the outerHTML call.

The QUIT phase just dumps the complete internal table of weblogs into one BSP server side cookie.

Writing the SDN Web Crawler

The basic web crawler already contains all the functionality to handle the request for one URL, and to return the content of the document. Furthermore, it had a number of interesting methods that could be redefined: initialize, next and loaded_content. The SDN crawler is designed so that it will inherit from the simple web crawler. With this, we can just redefine these methods.

<!~~code~~> CLASS cl_sdn_crawler DEFINITION INHERITING FROMcl_html_crawler

.

<!~~code~~> PUBLIC SECTION.

<!~~code~~> METHODS initialize REDEFINITION.

<!~~code~~> METHODS next REDEFINITION.

<!~~code~~> METHODS loaded_content REDEFINITION.

<!~~code~~> ....

<!~~code~~> ENDCLASS.

<!~~code~~>

The phases have been designed and the basic concept is that each phase will first supply a list of URLs that it is interested in, and will then be called with the content per URL.

Thus the initialize method will just start the first phase, and call the phase method (see the use of a dynamic method call) to fill the URL list.

<!~~code~~> METHOD initialize.

<!~~code~~> phase = 'LOGON'.

<!~~code~~> CALL METHOD me->(phase).

<!~~code~~> me->next( ).

<!~~code~~> ENDMETHOD.

<!~~code~~>

The next method is now structured very simply. As long as there are URLs in the list to fetch, remove one from the list, and load it. Once the list is empty, switch to the next phase, and calls the phase method to fill the URL list again.

<!~~code~~> METHOD next.

<!~~code~~> IF LINES( urls ) IS INITIAL.

<!~~code~~> phase = next_phase.

<!~~code~~> CALL METHOD me->(phase).

<!~~code~~> ENDIF.

<!~~code~~>

<!~~code~~> IF LINES( urls ) IS INITIAL. RETURN. ENDIF.

<!~~code~~>

<!~~code~~> DATA: url TYPE STRING.

<!~~code~~> READ TABLE urls INDEX 1 INTO url.

<!~~code~~> DELETE urls INDEX 1.

<!~~code~~> me->load_url( url ).

<!~~code~~> ENDMETHOD.

<!~~code~~>

The final method is just to dispatch the incoming content to the correct phase method. Again a dynamic method call is used to call the correct phase handler for the content.

<!~~code~~> METHOD loaded_content.

<!~~code~~> DATA: handler TYPE STRING.

<!~~code~~> CONCATENATE phase '_CONTENT' INTO handler.

<!~~code~~> CALL METHOD me->(handler) EXPORTING content = content.

<!~~code~~> ENDMETHOD.

<!~~code~~>

With these few lines of code on the simple web crawler, it’s now possible to crawl through SDN and gather some interesting information.

One important remark: When crawling another site, do not overload the site with too many requests! We already know that the simple web crawler has a large delay of about five seconds for each URL fetched. Furthermore, this complete SDN crawl will only access seventeen URLs. This is a very low load, which any web server should easily be able to handle. (SDN also has the option to cache all the old archives, as these do not change at all. This would definitely reduce their load to a very minimum.) This should be acceptable under any common sense rules. I could not easily recommend more than this.

Examining the Catch

The SDN crawler saved the final output as a BSP server side cookie. As the first test, a simple BSP page is used to see the output from the crawler.

<!~~code~~> <%@page language="abap"%>

<!~~code~~> <%@extension name="htmlb" prefix="htmlb"%>

<!~~code~~>

<!~~code~~> <% CL_BSP_SERVER_SIDE_COOKIE=>GET_SERVER_COOKIE( ... ). %>

<!~~code~~>

<!~~code~~> <htmlb:content design="design2003">

<!~~code~~> <htmlb:page>

<!~~code~~> <htmlb:form>

<!~~code~~> <htmlb:tableView id = "tv1"

<!~~code~~> table = "<%=blogs%>" />

<!~~code~~> </htmlb:form>

<!~~code~~> </htmlb:page>

<!~~code~~> </htmlb:content>

<!~~code~~>

The output is as expected: a very nice table full of interesting data.

!https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_002.GIF|height=0 width=565 height=110 |width=0 width=565 height=110 |src=https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_002.GIF|border=0 width=565 height=110 !

It is time to extract one or two interesting bits and pieces. We will limit the statistics to a few examples; otherwise, we just might have to do it every month ()!</p>

Getting Perfect Output

As a first step, the statistical data was just computed and the tables displayed using the HTMLB tableView control. However, numbers pale in comparison to nice graphs. We definitely required something better!

For BSP there are two ways to get to graphics. The first is to use the very old HTMLB chart control, the other is to look at the BSP extension graphics. Unfortunately, in-depth information is not available in the online help system now. Just send a short email to graphics@sap.com (mailto:graphics@sap.com) and request the full package! (A little bird whistles a song about the SDN download area in future.)

For this test application, I quickly looked at the example SBSPEXT_HTMLBchart.bsp, and 15 minutes later, it was complete. The chart control has a very simple interface where the X and Y values are just stored in a table. Thereafter the requested chart type is configured, and the title is set. See the example page for a small example.

The Growth of SDN

, wrote the first Weblog on SDN on the 27th May 2003, followed three days later by one from DJ Adams (). That was a total of two for May 2003! However, after just over a year, we can see 355 weblogs written! So there has been a tremendous growth in the last year.

!https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_003.GIF|height=0 width=503 height=202 |width=0 width=503 height=202 |src=https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_003.GIF|border=0 width=503 height=202 !

No doubt, we will see a steady increase in the number of weblogs that are published. This is a growing community. Of course, if all these weblogs flow from the pen of one author, it’s not helping much.

But the data tells us that 76 authors have written for SDN! However, how many did each author write?

!https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_004.GIF|height=0 width=504 height=193 |width=0 width=504 height=193 |src=https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_004.GIF|border=0 width=504 height=193 !

This diagram shows that most authors have only written a few (one to three) weblogs. So SDN has a large number of weblog authors, but at first glance many don’t seem to be very active.

However, you have to remember that this is a growing community. To understand the statistics above better, you have to look at the “age” of the weblog authors. How much time has each author had to write? Let’s look at the start date for each weblog author (date of first publication). In this way, we can easily get a feeling of when people started to write for SDN.

!https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_005.GIF|height=0 width=507 height=196 |width=0 width=507 height=196 |src=https://weblogs.sdn.sap.com/weblogs/images/164/BP_CSDN_005.GIF|border=0 width=507 height=196 !</body>