Skip to Content
Author's profile photo Brian McKellar

BSP Programming: Writing a (Simple) Web Crawler

In a previous BSP Programming: RSS = HttpClient + XML + XSLT, ten lines of code using the HttpClient was sufficient to fetch an RSS feed via HTTP. Once I mastered the art of programming the HttpClient, I had the idea to crawl through SDN weblogs to gather some statistics. As the information about all weblogs is not available via RSS feeds, I decided to fetch the HTML pages, and parse them.

Accessing SDN required an HTTPS connection, which has a few fine-points that must be kept in mind. This weblog shows the usual pitfalls into which I also landed, plus the workarounds. However, at one moment I was stuck, and decided to raise the stakes. Instead of using the HttpClient, why not use a real browser (HTML Viewer) integrated into the SAPGUI? This weblog will quickly touch the pitfalls when making HttpClient connections, and then look at an interesting alternative.

Making an HTTPS Connection

Making an HTTP connection with the HttpClient is very easy. Making an HTTPS connection is just as easy, if you first remember to import the certificate of the SSL partner! For SSL connections, the two partners exchange their certificates. The outgoing connection will only be established if the partner certificate can be verified against a copy stored in the database.

SAP ships very few certificate as standard. For other certificates required, you must get them from the partner directly. As an alternative, if the certificate is already available in the web browser, you can export it from there.

Once the certificate is available, it can be imported using transaction STRUST. See also the documentation on this topic. With this additional step done, HTTPS connections work exactly the same way as HTTP connections. There is only one small difference: traffic for HTTPS connections are not traced in ICM, due to security reasons (otherwise one could have used HTTP:).</p>

Handling Redirects

The HttpClient will automatically handle all rc=302 (Redirect) requests. However, there is one case where special handling is required. It is possible for the server to set a cookie during the redirect phase, and these additional cookies must be kept in mind when following the redirect. This is not currently being done by the HttpClient (although it’s now under consideration).

For an example, see this trace (strongly edited!):

<!code>  GET HTTP/1.1

<!code>  accept: /

<!code>  host:


<!code>  HTTP/1.1 302 Object moved

<!code>  Date: Wed, 23 Jun 2004 20:44:47 GMT

<!code>  Location:

<!code>  Content-Type: text/html




<!code>  GET / HTTP/1.1

<!code>  accept: /

<!code>  host:


What we see is that the first GET request is answered by the server with an rc=302 (Redirect) and a “Location” header is supplied. In addition, the server sets a cookie. However, in the default handling of the redirect, the next GET request (to the new location) does not contain the cookie.

Handling the redirects is very easy. Then the traffic reduces to the usual send-receive cycles, and cookies are handled correctly. A small change was made, to flag that redirects should not be followed, and the case of rc=302 was specifically handled in code.

<!code>  http_client->propertytype_redirect = http->co_disabled.

<!code>  ….

<!code>  http_client->receive( ).

<!code>  http_client->response->get_status( IMPORTING code = rc ).


<!code>  IF rc = 302.

<!code>    location = http_client->response->get_header_field( ‘Location’ ).

<!code>    me->GET( url = location ).

<!code>    RETURN.

<!code>  ENDIF.


Note that the GET() method is part of the crawler development, so as to give a higher level interface to the HttpClient. It just packages a number of HttpClient calls into one method.

Special Situation: Headers in HTTP Outgoing Requests

This section can best be described by starting with a small trace of the traffic:

<!code>  POST /SAPPortal/common/CreateNewCookie.asp HTTP/1.1


<!code>  content-type: application/x-www-form-urlencoded

<!code>  content-length: 46

<!code>  user-agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)




<!code>  portalUserName=user&portalPassword=password



<!code>  HTTP/1.1 400 Host Required In Request

<!code>  Date: Thu, 24 Jun 2004 18:04:50 GMT

<!code>  Content-Type: text/html; charset=iso-8859-1

<!code>  Content-Length: 447





Host Header Required


<!code>    Description: Your browser did not send a “Host” HTTP header field and

<!code>    therefore the virtual host being requested could not be determined. To

<!code>    access this web site correctly, you will need toupgrade to a browser that

<!code>    supports the HTTP “Host” header field.

<!code>   </BODY>

<!code>  </HTML>


For performance reasons, the ICM (in the kernel) translates all HTTP headers to lower case. The HTTP headers are also sent out in their lower case form, and not in the usual capitalized form. For the interested reader, I would refer to the HTTP/1.1 spec (RFC 2616 ): “Each header field consists of a name followed by a colon (“:”) and the field value. <b>Field names are case-insensitive</b>.

The above message seems to be caused by a case-sensitive string processing sequence (where “Host” is not equal to “host”). Unfortunately, it is not clear who is returning the message. It could be any proxy, load balancer, dispatcher, server or even some custom written servlet along the way. This made it difficult to find (and negotiate) a workaround for this problem.

Using a Real Browser: The HTML Viewer

I was always comparing my requests with that of a real browser, so why not use a real browser? There is a complete browser integrated into SAPGUI (under Windows) that can be programmed with ABAP!

As the first step, I quickly read the online documentation , and looked at the test programs SAPHTML_DEMO1 and SAPHTML_EVENTS_DEMO. Then it was mostly cut-and-paste work to get the first version up and running.

Only small bits and pieces of the basic crawler will be discussed here. For a complete overview of the code, follow this link.

From the beginning, it was clear that any solution would have some JavaScript code in it (at least at the time of writing this text!). The problem is that the cl_gui_html_viewer does not expose its JavaScript APIs directly. These methods are flagged as protected. It was important to define a new class that inherits from the cl_gui_html_viewer class.

<!code>  CLASS cl_html_crawler DEFINITION

INHERITING FROM cl_gui_html_viewer


<!code>    PUBLIC SECTION.

<!code>    METHODS: load_url        IMPORTING uri TYPE string.

<!code>    METHODS: on_navigate_complete FOR EVENT navigate_complete OF cl_gui_html_viewer.

<!code>    METHODS: on_sapevent FOR EVENT sapevent OF cl_gui_html_viewer

<!code>                   IMPORTING action postdata.

<!code>    ….

<!code>  ENDCLASS.


The HTML viewer raises a number of events. The first interesting one is the on_navigation_complete event that is fired after the document has been loaded. In addition, it is possible to “talk” from the browser to the SAPGUI (effectively back to ABAP on the server) using special SAP events inside the browser. These will cause the on_sapevent method to be triggered.

Most of the glue code is copied from the demo programs and documentation and is not listed here. We will look at thee interesting aspects: the loading of a document in the HTML viewer, catching the event that the document has been loaded, and extracting the document content.

For the load_url function

, I wanted a very simple interface. However, the interface should still be powerful enough to distinguish between GET and POST methods, also include the URL to load, and form fields for the request. I decided to use a simple string interface, where all the necessary data will be passed as one string, separated by ‘__’ sequences. The format of the string was “GET|POST__url__ff1__ff2…__ffn”. This allowed me to quickly call the load_url method without complex programming.

For example:

<!code>  ‘GET__

<!code>  ‘POST__


For the GET sequences, the code was very simple. The HTML viewer already contained a method show_url that effectively handled the GET completely. However, for a POST request, you require a complete HTML document with form that can be posted. For POST requests, the load_url method would build a complete HTML document, load it into the browser, from which it could be posted to the URL.

<!code>  METHOD load_url.


<!code>    DATA:  url      TYPE char255,

<!code>           method   TYPE string,

<!code>           ffs      TYPE string,

<!code>           ff       TYPE string,

<!code>           ff_name  TYPE string,

<!code>           ff_value TYPE string.


<!code>    SPLIT uri AT ‘__’ INTO method url ffs.


<!code>    IF method = ‘GET’.

<!code>      me->show_url( url = url ).

<!code>      RETURN.

<!code>    ENDIF.


<!code>    DATA: html TYPE TABLE OF char255,

<!code>          line TYPE char255.

<!code>    APPEND “ TO html.

<!code>    CONCATENATE ‘

‘ INTO line.

<!code>    APPEND line TO html.

<!code>    WHILE ffs IS NOT INITIAL.

<!code>      SPLIT ffs AT ‘__’ INTO ff ffs.

<!code>      SPLIT ff  AT ‘=’  INTO ff_name ff_value.

<!code>      CONCATENATE ‘

‘ INTO line.

<!code>      APPEND line TO html.

<!code>    ENDWHILE.

<!code>    APPEND ‘</form></body></html>’ TO html.


<!code>    me->load_data( IMPORTING assigned_url = line CHANGING data_table = html ).

<!code>    me->show_url( url = line ).


<!code>  ENDMETHOD.


Given the example POST sequence above, the following HTML document is created:





<!code>  </form></body></html>


In this document, the onload is hooked, and once loaded, it will trigger a submit() call on the form. The form itself contains the action (target URL), as all the form fields are stored as hidden input fields in the form.

One line of source code is really worth highlighting:

<!code>    APPEND “ TO html.


ABAP is the only programming language that I know which supports two forms of quotes for creating strings. This allows us to write ABAP, HTML and JavaScript code in one line without any ‘escaping’ required.

Once the URL has been loaded, there is nothing to do but wait for the event signaling complete.

The problem is that there are API function to read the content from the browser. However, it is possible was to fire an event from the browser to the SAPGUI. Each SAP event is implemented as

that is submitted against the very special URL “SAPEVENT:”. The approach we want to follow, is to use a JavaScript function that will place the content of the document into a string inside a form, and submit this form (against the SAPGUI). The pseudo-code would be about:



<!code>  </form>


It is not possible to write this code directly into the loaded document, as that would destroy it. Therefore, I just used JavaScript to create the form dynamically (using createElement call), and to place the content into hidden fields. The content had to be split into a number of short strings, as the SAPGUI will place all input fields into a table of type CHAR255. So the JavaScript function just does the following steps: create new form, create a number of short input fields to hold segments of the content, and then hook this form into the document.

<!code>  METHOD on_navigate_complete.



<!code>          line TYPE STRING..


<!code>    APPEND `function _Dump() {`                                          TO js.

<!code>    APPEND `  var _frm = document.createElement(‘form’);`                TO js.

<!code>    APPEND `  frm.setAttribute( ‘id’,     ‘crawler’ );`                TO js.

<!code>    APPEND `  frm.setAttribute( ‘name’,   ‘crawler’ );`                TO js.

<!code>    APPEND `  _frm.setAttribute( ‘method’, ‘POST’ );`                    TO js.

<!code>    APPEND `  frm.setAttribute( ‘action’, ‘SAPEVENT:SAVEDOCUMENT’ );`  TO js.

<!code>    APPEND `  var _str = document.body.outerHTML;`                       TO js.

<!code>    APPEND `  var _idx = 0;`                                             TO js.

<!code>    APPEND `  while(_idx < _str.length) {`                               TO js.

<!code>    APPEND `  var _if = document.createElement(‘input’);`                TO js.

<!code>    APPEND `  if.setAttribute( ‘name’,  ‘content’ );`                  TO js.

<!code>    APPEND `  _if.setAttribute( ‘type’,  ‘hidden’ );`                    TO js.

<!code>    APPEND `  _if.setAttribute( ‘value’,  str.substr(idx,200) );`      TO js.

<!code>    APPEND `  _frm.appendChild( _if );`                                  TO js.

<!code>    APPEND `  _idx+=200;`                                                TO js.

<!code>    APPEND `  }`                                                         TO js.

<!code>    APPEND `  document.body.appendChild(_frm);`                          TO js.

<!code>    APPEND `  document.all[“_crawler”].submit();`                        TO js.

<!code>    APPEND `}`                                                           TO js.

<!code>    APPEND `window.setTimeout(“_Dump();”,5000);`                         TO js.


<!code>    me->set_script( script = js[] ).

<!code>    me->execute_script( ).


<!code>  ENDMETHOD.


Once the JavaScript function is injected into the browser, it is not immediately executed. From practical experience I saw that sometimes the browser was still busy loading images or executing JavaScript code. So a timer was set to execute the dump function only five seconds later. This also gave a few moments of time to see what was loaded, and verify that the crawler was still on the correct track.

Note that the functions to load the JavaScript code into the browser are protected, and can not be called when using an instance of the class cl_gui_html_viewer. This is the main reason for the inheritance approach, sothat we could actually get access to these two functions.

Five seconds later the form was submitted and the on_sapevent method is called. As input, a table is received that contains a number of rows, each of the sequence “_content=html”. All the lines are concatenated together again into one string. The final string is massaged to remove some of the HTML escaping that was done on the data, and the _content sequences.

<!code>  METHOD on_sapevent.


<!code>    DATA: content TYPE STRING,

<!code>          line    LIKE LINE OF postdata.

<!code>    LOOP AT postdata INTO line.

<!code>      CONCATENATE content line INTO content.

<!code>    ENDLOOP.


<!code>    REPLACE ALL   OCCURRENCES OF ‘%3D’        IN content WITH ‘=’.

<!code>    REPLACE ALL   OCCURRENCES OF ‘%3F’        IN content WITH ‘?’.

<!code>    REPLACE ALL   OCCURRENCES OF ‘&_content=’ IN content WITH ”.

<!code>    REPLACE FIRST OCCURRENCE  OF ‘_content=’  IN content WITH ”.


<!code>    me->loaded_content( content ).


<!code>  ENDMETHOD.

The other supporting code is not shown here, as it is mostly plumbing. The complete code can be found here.</p>

Final Words

Using the browser inside the SAPGUI was actually a rather interesting challenge, and we learned a lot about how the browser integration was done, and the possibilities that this enabled. It was now possible to write a web crawler using a true browser with all of its features and idiosyncrasies. In the next Weblog this simple web crawler will be used to build a small SDN crawler, and then extract some statistics from the Weblogs.

Assigned Tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member
      During the migration process problems have been reported for this blog. The blog content may look corrupt due to not supported HTML code on this platform. Please adjust the blog content manually before moving it to an official community.