Recently, a colleague of mine had to write an ABAP application with an import functionality for Excel files. I recommended him to use the abap2xlsx package which serves as a fully featured negotiator between the ABAP and the Excel world. In particular, abap2xlsx contains a reader class, ZCL_EXCEL_READER_2007, designed for importing a .xslx file into the ABAP data structures on which the package is based.

To my astonishment, the reader dumped with a memory overflow error for a sample Excel file.

  • Yes – with about 28’000 rows and 50 columns it was a large file!
  • And, yes, it is crazy to have a business process with files of these sizes! No discussion about that.

On the other hand, the file had a size of 6.5 MB, which was moderate compared to the heap memory limit of 2 GB which had been touched according to the short dump.

I was just curious how this memory consumption of a factor way larger than 100 from the original data could be explained.

/wp-content/uploads/2014/01/mem_alloc_failed_373528.png

So I wrote a little test report for analyzing the situation with the memory inspector.


report zz_test_abap2xlsx_reader.
parameters: p_file type string memory id fil.
at selection-screen on value-request for p_file.
   perform get_file(zz_file_io_forms) using p_file.
start-of-selection.
   perform start.
* ---
form start.
   data: lo_excel_reader type ref to zif_excel_reader,
         lo_excel type ref to zcl_excel.
   create object lo_excel_reader type zcl_excel_reader_2007.
   lo_excel = lo_excel_reader->load_file( p_file ).
   break-point.
endform.                    "start







Starting the report with one of those large Excel files and stepping through the code with the debugger, I arrive at the place where the worksheet is to be parsed. Everything looks normal. There are some 9 MB allocated for the XSTRING containing the file. At this point, this is the largest object in memory. No problem. The second largest object is the “table of X”, the result of the file read process (more or less a redundant copy of the first data – but negligible compared to the 4 GB we are talking about).

Peanuts!

/wp-content/uploads/2014/01/worksheet1_373429.png

At that point in the debugger, I am just before jumping into the method  get_ixml_from_zip_archive( ). This is a general-purpose method in the reader class, which loads an XML document contained in the zip archive; after loading, it will be parsed with the methods of the IF_IXML family, and a reference to the result XML DOM object  will be passed back to the caller.

Here, I am immediately before the call of the parse() method of the if_ixml_parser object. The memory consumption looks still more than modest:

/wp-content/uploads/2014/01/worksheet2_373499.png

Now, with a single step, I am triggering the parse() Method. Processing takes a few minutes until the control is given back to the debugger again. From a memory perspective, the result is amazing:

/wp-content/uploads/2014/01/workshhet3_373500.png

With this single method call (which is implemented “by kernel module” and thus cannot be analyzed further with ABAP means), the overall memory consumption  raises to more than 2 GB!

Even more amazing: The 2GB cannot be explained from the memory inspector’s individual object view :

/wp-content/uploads/2014/01/worksheet4_373512.png

Here, the top object only requires some 100 MB. This is the decompressed XML worksheet, as produced from the zip loader. From the rest of the 2GB, we see absolutely nothing.

So we have to face the fact that the iXML DOM parser in this case requires a factor 20 more memory than the raw document which is to be parsed.

Since at this point, nothing could be done (after all, iXML is a kernel component), I looked out for alternatives. My plan was: If we omitted the DOM parsing and instead are satisfied with gripping the element contents from the stream while we’re reading it, the memory footprint could be reduced considerably.

I could prove that this really is the case, by implementing an alternative parser class, ZCL_XLSX_PARSER, on the base of  the sXML family, in particular of the IF_SXML_READER implementation provided by this family. If you are happy with simply reading the raw data from the file (as we were in our particular case), omitting more functionality like macros, cell styles etc., the class ZCL_XLSX_PARSER will be fully sufficient.

At the same place as above, immediately before parsing, the memory in the debugger looks similar: On top, we see the decompressed XML file, allocating about 100 MB as above:

/wp-content/uploads/2014/01/worksheet5_373513.png

Now I execute the parse_cell_data( ) method.

/wp-content/uploads/2014/01/worksheet6_373526.png

For the same file that I tested above, the memory consumption only moderately increased through the parsing: It was 117,707,688 bytes before the call of parse_cell_data( ), and is 141,485,696 bytes afterwards. The difference – about 23 MB – is needed for storing the result (the deep structure ES_EXCEL above, containing internal tables with the cell data). All the figures can be explained by looking at the consumption of the ABAP data alone. We are still far away from the 4GB limit.


Also, the execution speed is better. With the sXML reader, the full reading process needs about 52 seconds. With the iXML reader, only the call of if_ixml_parser->parse() of the same worksheet took 340 seconds.

The main difference between IF_IXML_PARSER and IF_SXML_READER is that the latter is only a scanner. It doesn’t generate an internal image of the XML file, but only detects the beginning, the end, the attributes and the contents of XML elements during reading the stream. That’s it. With IF_SXML, I have to instrument the reader myself if I want to extract data from the reading process. Here is the implementation of the parse_cell_data( ) method, just to give you the idea. If you want to study the whole class, have a look into it here.


method parse_cell_data.
  data: ls_cell type ty_excel_cell,
        lv_number type decfloat34,
        lv_time type t,
        lv_date type d.
* Main parse loop
  while io_reader->node_type ne if_sxml_node=>co_nt_final.
    io_reader->next_node( ).
    case io_reader->name.
      when 'row'.  " A row
        check io_reader->node_type = if_sxml_node=>co_nt_element_open.
        add 1 to ls_cell-row.
        clear: ls_cell-col, ls_cell-ref, ls_cell-type.
      when 'c'.  " A column, child of a row
        case io_reader->node_type.
          when if_sxml_node=>co_nt_element_open.
            add 1 to ls_cell-col.
            ls_cell-type = get_cell_type( io_reader ).
          when if_sxml_node=>co_nt_element_close.
            insert ls_cell into table cs_excel-cells.
        endcase.
      when 'v'.  " A value - may be a reference to the string table
        if io_reader->node_type eq if_sxml_node=>co_nt_element_close.
          case ls_cell-type.
            when gc_cell_datatype-string.
              " Referenzindex für Stringtabelle
              ls_cell-ref = io_reader->value + 1. " ist 0-basiert in xlsx
            when gc_cell_datatype-number.
              lv_number = io_reader->value.
              append lv_number to cs_excel-numbers.
              ls_cell-ref = sy-tabix.
            when gc_cell_datatype-date.
* Convert to the ABAP-conformal internal date representation
              lv_number = io_reader->value + 693595.
              append lv_number to cs_excel-numbers.
              ls_cell-ref = sy-tabix.
            when gc_cell_datatype-time.
* Convert to the ABAP-conformal internal time representation
* Work with milliseconds, however, for further improvements
              lv_number = round( val = 86400 * io_reader->value dec = 3 ).
              append lv_number to cs_excel-numbers.
              ls_cell-ref = sy-tabix.
          endcase.
        endif.
      when 'is'.  " Inline Strings: Just append them to the stringtab
        if io_reader->node_type eq if_sxml_node=>co_nt_element_close.
          append io_reader->value to cs_excel-strings.
          ls_cell-ref  = sy-tabix.
          ls_cell-type = gc_cell_datatype-string.
        endif.
    endcase.
  endwhile.
endmethod.











By the way: The code has been written with support of unit tests. The above method, like the rest of the core methods, is covered to 100%. There are only a few unprocessed boundary cases in the unit tests: For example the case of a corrupt zip file (which cannot be unpacked properly). See the ABAP unit test section in the code repository for details.

Edit (6/30/2014)

In the meantime, I provided a class ZCL_EXCEL_READER_HUGE_FILE as part of the abap2xlsx package. See this blog for more details.

To report this post you need to login first.

17 Comments

You must be Logged on to comment or reply to a post.

  1. Kay Streubel

    Thanks for this excellent post! It’s always a pleasure to learn from your deep investigations.

    Will this be added to the abap2xlsx package? 😀

    (0) 
    1. Rüdiger Plantiko Post author

      Kay, thank you for your positive feedback!

      >Will this be added to the abap2xlsx package? 😀

      No five minutes after having published the blog, Ivan Femia suggested precisely this 🙂

      At first, I thought it will be much extra work to get all the xlsx features – and not only the raw data, as the current ZCL_XLSX_PARSER does. But on a second sight, it looks not too heavy.

      So let me make a commitment: Yes 🙂 If there is not already another sXML-based implementation of the Excel reader in that project (which I didn’t check), I am planning, within the next couple of weeks, to provide another, alternative implementation of ZIF_EXCEL_READER which could be used for large Excel files, and to check it in to the abap2xlsx project.

      Kind regards,

      Rüdiger

      (0) 
  2. Luís Pérez Grau

    Hi Rüdger,

    Thanks for you contribution, I’m stuck in the writing process, I had the same DUMP an looks like the XML generation consume too much memory (ZCL_EXCEL_WRITER_2007->CREATE_XL_SHEET if I’m not wrong), did you faced something like this?

    Cheers!

    Luis

    (0) 
    1. Rüdiger Plantiko Post author

      Hi Luis,

      in my use case, I only needed to read an Excel file: the rest of the job was to call some BAPI/APIs for updating allocation tables (some special documents in SAP retail).

      But you are right, the same problem can occur with the writer: As I just checked, the ZCL_EXCEL_WRITER_2007 also work with the iXML framework (with an IF_IXML_DOCUMENT instance).  In a similar fashion as I explained in the blog, one could make an alternative implementation of ZIF_EXCEL_WRITER, based on the class CL_SXML_STRING_WRITER.

      To go sure, I would start with a memory analysis before writing such an alternative implementation: Is it really the DOM memory consumption of the IF_IXML_DOCUMENT object, or is it already the ZCL_EXCEL instance itself which busts the memory?

      Regards,

      Rüdiger

      (0) 
      1. Luís Pérez Grau

        thanks for your feedback unfortunately I don’t have time to analyze this right know, I’m afraid I must skip abap2xlsx and build the file myself… I hope to have more time in the future to fix this, who knows…. 🙂

        (0) 
        1. Rüdiger Plantiko Post author

          I’m afraid I must skip abap2xlsx and build the file myself…


          Not necessarily – you could still use the ZCL_EXCEL class and the family of its related classes  to build the workbook – they provide a comfortable API for doing that – and could only provide an alternate implementation of the zif_excel_writer interface, that is of its single method


          methods WRITE_FILE

               importing

                 IO_EXCEL type ref to ZCL_EXCEL

               returning

                 value(EP_FILE) type XSTRING .


          But even if you decide not to do that, I would suggest to use the CL_SXML_STRING_WRITER class to build the XML (see the  ABAP Keyword Documentation for details), instead of working on string level (with statements like CONCATENATE and its friends).


          Regards,

          Rüdiger

          (0) 
          1. Luís Pérez Grau

            Sorry, I don’t understand you, If I have the problem because the generation of the file is performed in memory using the IXML class, I don’t see how can avoid this “only” changing the implementation class (ZCL_EXCEL_WRITER_2007 to ZCL_WHATEVER)

            One quick solution is to use the ZCL_EXCEL_WRITER_CSV but saves the file in UTF-8 format and the file is meant to be opened in excel to do pivot tables, CSV+UTF8+Excel = 😈

            (0) 
            1. Rüdiger Plantiko Post author

              Sorry, I don’t understand you, If I have the problem because the generation of the file is performed in memory using the IXML class, I don’t see how can avoid this “only” changing the implementation class (ZCL_EXCEL_WRITER_2007 to ZCL_WHATEVER)

              Answer: It’s only ZCL_EXCEL_WRITER_2007 and ZCL_EXCEL_READER_2007 which work with iXML. The rest of the ZCL_EXCEL family doesn’t use iXML and can hence be used (supposed it’s really iXML causing the memory problem – the chances are high for this being the case).

              You can build the Excel worksheet with the data model of the ZCL_EXCEL family (which is a data model based on internal tables). You can use methods like lo_worksheet->set_cell( ), ->set_visible( ), ->set_width( ) to build the worksheet. Altogether, the abap2xlsx classes provide a very comfortable API for building the worksheet. Only when it cames to saving (i.e. doing the output to a file), you’ll need to replace the implementation ZCL_EXCEL_WRITER_2007 of the method WRITE_FILE by an own implementation in an own class, avoiding iXML. Use that alternate implementation, based on CL_SXML_STRING_WRITER, to generate the Excel file from the internal data model of your worksheet in ZCL_EXCEL.

              I hope I could clarify.

              Regards,

              Rüdiger

              (0) 
                1. Merrill Schilling

                  Hi Luis

                  Were you successful with writing large xlsx files? I’m facing the same problem but I am not familiar with the concept of CL_SXML_STRING_WRITER, so I’m not able to to change from iXML to SXML like Rüdiger wrote.

                  Regards Merrill

                  (0) 
                  1. Luís Pérez Grau

                    Well, we didn’t have some much time to investigate so I used CSV as output format instead of xlsx, that reduced a lot the ammount of memory needed to create the file.

                    Cheers!

                    Luis

                    (0) 
              1. Merrill Schilling

                Hi Rüdiger

                I’m facing the same problem with the writer, but I am not familiar with the concept of CL_SXML_STRING_WRITER, so I’m not able to to change from iXMLto SXML like you wrote. Do you know, if that ist implemented somewhere? Or could you give me a path to follow?

                Regards Merrill

                (0) 
                1. Rüdiger Plantiko Post author

                  Hi Merrill,

                  you create an sXML string writer instance with CL_SXML_STRING_WRITER=>CREATE( ). This should be an instance attribute in your new general class ZCL_SXML_EXCEL_WRITER, implementing interface ZIF_EXCEL_WRITER. The Writer object offers the methods OPEN_ELEMENT, WRITE_ATTRIBUTE, WRITE_VALUE, and CLOSE_ELEMENT. These methods normally suffice to generate an XML document. The idea of the sXML writer is that it directly generates the output stream, which at the end can be retrieved as XSTRING via method GET_OUTPUT. So it doesn’t keep a complete image of the XML document in memory, you can’t search or inspect elements that you already have written into the stream. Now, the plan must be to go through the code of ZCL_EXCEL_WRITER_2007 and to rewrite its methods using the sXML writer instance, instead. So instead of “lo_document->create_simple_element( )”, you will write “go_writer->open_element”, for “lo_document->set_attribute” you will write go_writer->write_attribute( …)” and so on.

                  This will be quite a lot of work. But I believe the critical method which brings the memory to explosion in ZCL_EXCEL_WRITER_2007 is only CREATE_XL_SHEET. So maybe a clever mixture of iXML and sXML will make it: Methods like “CREATE_XL_STYLES” or “CREATE_XL_THEME” could be left in iXML mode, as they are now (since they don’t use much memory), and you write their output into the sXML writer. This way, only “CREATE_XL_SHEET” needs to be rewritten completely on sXML basis.

                  If you don’t look for a general writer class, look at your xl/sheet.xml for an Excel sheet with a smaller data amount, and try to reproduce this XML document structure from your ABAP data using the sXML methods.

                  Regards,

                  Rüdiger

                  (1) 

Leave a Reply