A Way of Reading Huge Excel Files
Recently, a colleague of mine had to write an ABAP application with an import functionality for Excel files. I recommended him to use the abap2xlsx package which serves as a fully featured negotiator between the ABAP and the Excel world. In particular, abap2xlsx contains a reader class, ZCL_EXCEL_READER_2007, designed for importing a .xslx file into the ABAP data structures on which the package is based.
To my astonishment, the reader dumped with a memory overflow error for a sample Excel file.
- Yes – with about 28’000 rows and 50 columns it was a large file!
- And, yes, it is crazy to have a business process with files of these sizes! No discussion about that.
On the other hand, the file had a size of 6.5 MB, which was moderate compared to the heap memory limit of 2 GB which had been touched according to the short dump.
I was just curious how this memory consumption of a factor way larger than 100 from the original data could be explained.
So I wrote a little test report for analyzing the situation with the memory inspector.
report zz_test_abap2xlsx_reader.
parameters: p_file type string memory id fil.
at selection-screen on value-request for p_file.
perform get_file(zz_file_io_forms) using p_file.
start-of-selection.
perform start.
* ---
form start.
data: lo_excel_reader type ref to zif_excel_reader,
lo_excel type ref to zcl_excel.
create object lo_excel_reader type zcl_excel_reader_2007.
lo_excel = lo_excel_reader->load_file( p_file ).
break-point.
endform. "start
Starting the report with one of those large Excel files and stepping through the code with the debugger, I arrive at the place where the worksheet is to be parsed. Everything looks normal. There are some 9 MB allocated for the XSTRING containing the file. At this point, this is the largest object in memory. No problem. The second largest object is the “table of X”, the result of the file read process (more or less a redundant copy of the first data – but negligible compared to the 4 GB we are talking about).
Peanuts!
At that point in the debugger, I am just before jumping into the method get_ixml_from_zip_archive( ). This is a general-purpose method in the reader class, which loads an XML document contained in the zip archive; after loading, it will be parsed with the methods of the IF_IXML family, and a reference to the result XML DOM object will be passed back to the caller.
Here, I am immediately before the call of the parse() method of the if_ixml_parser object. The memory consumption looks still more than modest:
Now, with a single step, I am triggering the parse() Method. Processing takes a few minutes until the control is given back to the debugger again. From a memory perspective, the result is amazing:
With this single method call (which is implemented “by kernel module” and thus cannot be analyzed further with ABAP means), the overall memory consumption raises to more than 2 GB!
Even more amazing: The 2GB cannot be explained from the memory inspector’s individual object view :
Here, the top object only requires some 100 MB. This is the decompressed XML worksheet, as produced from the zip loader. From the rest of the 2GB, we see absolutely nothing.
So we have to face the fact that the iXML DOM parser in this case requires a factor 20 more memory than the raw document which is to be parsed.
Since at this point, nothing could be done (after all, iXML is a kernel component), I looked out for alternatives. My plan was: If we omitted the DOM parsing and instead are satisfied with gripping the element contents from the stream while we’re reading it, the memory footprint could be reduced considerably.
I could prove that this really is the case, by implementing an alternative parser class, ZCL_XLSX_PARSER, on the base of the sXML family, in particular of the IF_SXML_READER implementation provided by this family. If you are happy with simply reading the raw data from the file (as we were in our particular case), omitting more functionality like macros, cell styles etc., the class ZCL_XLSX_PARSER will be fully sufficient.
At the same place as above, immediately before parsing, the memory in the debugger looks similar: On top, we see the decompressed XML file, allocating about 100 MB as above:
Now I execute the parse_cell_data( ) method.
For the same file that I tested above, the memory consumption only moderately increased through the parsing: It was 117,707,688 bytes before the call of parse_cell_data( ), and is 141,485,696 bytes afterwards. The difference – about 23 MB – is needed for storing the result (the deep structure ES_EXCEL above, containing internal tables with the cell data). All the figures can be explained by looking at the consumption of the ABAP data alone. We are still far away from the 4GB limit.
Also, the execution speed is better. With the sXML reader, the full reading process needs about 52 seconds. With the iXML reader, only the call of if_ixml_parser->parse() of the same worksheet took 340 seconds.
The main difference between IF_IXML_PARSER and IF_SXML_READER is that the latter is only a scanner. It doesn’t generate an internal image of the XML file, but only detects the beginning, the end, the attributes and the contents of XML elements during reading the stream. That’s it. With IF_SXML, I have to instrument the reader myself if I want to extract data from the reading process. Here is the implementation of the parse_cell_data( ) method, just to give you the idea. If you want to study the whole class, have a look into it here.
method parse_cell_data.
data: ls_cell type ty_excel_cell,
lv_number type decfloat34,
lv_time type t,
lv_date type d.
* Main parse loop
while io_reader->node_type ne if_sxml_node=>co_nt_final.
io_reader->next_node( ).
case io_reader->name.
when 'row'. " A row
check io_reader->node_type = if_sxml_node=>co_nt_element_open.
add 1 to ls_cell-row.
clear: ls_cell-col, ls_cell-ref, ls_cell-type.
when 'c'. " A column, child of a row
case io_reader->node_type.
when if_sxml_node=>co_nt_element_open.
add 1 to ls_cell-col.
ls_cell-type = get_cell_type( io_reader ).
when if_sxml_node=>co_nt_element_close.
insert ls_cell into table cs_excel-cells.
endcase.
when 'v'. " A value - may be a reference to the string table
if io_reader->node_type eq if_sxml_node=>co_nt_element_close.
case ls_cell-type.
when gc_cell_datatype-string.
" Referenzindex für Stringtabelle
ls_cell-ref = io_reader->value + 1. " ist 0-basiert in xlsx
when gc_cell_datatype-number.
lv_number = io_reader->value.
append lv_number to cs_excel-numbers.
ls_cell-ref = sy-tabix.
when gc_cell_datatype-date.
* Convert to the ABAP-conformal internal date representation
lv_number = io_reader->value + 693595.
append lv_number to cs_excel-numbers.
ls_cell-ref = sy-tabix.
when gc_cell_datatype-time.
* Convert to the ABAP-conformal internal time representation
* Work with milliseconds, however, for further improvements
lv_number = round( val = 86400 * io_reader->value dec = 3 ).
append lv_number to cs_excel-numbers.
ls_cell-ref = sy-tabix.
endcase.
endif.
when 'is'. " Inline Strings: Just append them to the stringtab
if io_reader->node_type eq if_sxml_node=>co_nt_element_close.
append io_reader->value to cs_excel-strings.
ls_cell-ref = sy-tabix.
ls_cell-type = gc_cell_datatype-string.
endif.
endcase.
endwhile.
endmethod.
By the way: The code has been written with support of unit tests. The above method, like the rest of the core methods, is covered to 100%. There are only a few unprocessed boundary cases in the unit tests: For example the case of a corrupt zip file (which cannot be unpacked properly). See the ABAP unit test section in the code repository for details.
Edit (6/30/2014)
In the meantime, I provided a class ZCL_EXCEL_READER_HUGE_FILE as part of the abap2xlsx package. See this blog for more details.
Thanks for this excellent post! It's always a pleasure to learn from your deep investigations.
Will this be added to the abap2xlsx package? 😀
Kay, thank you for your positive feedback!
>Will this be added to the abap2xlsx package? 😀
No five minutes after having published the blog, Ivan Femia suggested precisely this 🙂
At first, I thought it will be much extra work to get all the xlsx features - and not only the raw data, as the current ZCL_XLSX_PARSER does. But on a second sight, it looks not too heavy.
So let me make a commitment: Yes 🙂 If there is not already another sXML-based implementation of the Excel reader in that project (which I didn't check), I am planning, within the next couple of weeks, to provide another, alternative implementation of ZIF_EXCEL_READER which could be used for large Excel files, and to check it in to the abap2xlsx project.
Kind regards,
Rüdiger
Would be nice if you can post it to another blogentry when the enhancement in abap2xlsx project ist done.
Hi Rüdger,
Thanks for you contribution, I'm stuck in the writing process, I had the same DUMP an looks like the XML generation consume too much memory (ZCL_EXCEL_WRITER_2007->CREATE_XL_SHEET if I'm not wrong), did you faced something like this?
Cheers!
Luis
Hi Luis,
in my use case, I only needed to read an Excel file: the rest of the job was to call some BAPI/APIs for updating allocation tables (some special documents in SAP retail).
But you are right, the same problem can occur with the writer: As I just checked, the ZCL_EXCEL_WRITER_2007 also work with the iXML framework (with an IF_IXML_DOCUMENT instance). In a similar fashion as I explained in the blog, one could make an alternative implementation of ZIF_EXCEL_WRITER, based on the class CL_SXML_STRING_WRITER.
To go sure, I would start with a memory analysis before writing such an alternative implementation: Is it really the DOM memory consumption of the IF_IXML_DOCUMENT object, or is it already the ZCL_EXCEL instance itself which busts the memory?
Regards,
Rüdiger
thanks for your feedback unfortunately I don't have time to analyze this right know, I'm afraid I must skip abap2xlsx and build the file myself... I hope to have more time in the future to fix this, who knows.... 🙂
Not necessarily - you could still use the ZCL_EXCEL class and the family of its related classes to build the workbook - they provide a comfortable API for doing that - and could only provide an alternate implementation of the zif_excel_writer interface, that is of its single method
But even if you decide not to do that, I would suggest to use the CL_SXML_STRING_WRITER class to build the XML (see the ABAP Keyword Documentation for details), instead of working on string level (with statements like CONCATENATE and its friends).
Regards,
Rüdiger
Sorry, I don't understand you, If I have the problem because the generation of the file is performed in memory using the IXML class, I don't see how can avoid this "only" changing the implementation class (ZCL_EXCEL_WRITER_2007 to ZCL_WHATEVER)
One quick solution is to use the ZCL_EXCEL_WRITER_CSV but saves the file in UTF-8 format and the file is meant to be opened in excel to do pivot tables, CSV+UTF8+Excel = 😈
Answer: It's only ZCL_EXCEL_WRITER_2007 and ZCL_EXCEL_READER_2007 which work with iXML. The rest of the ZCL_EXCEL family doesn't use iXML and can hence be used (supposed it's really iXML causing the memory problem - the chances are high for this being the case).
You can build the Excel worksheet with the data model of the ZCL_EXCEL family (which is a data model based on internal tables). You can use methods like lo_worksheet->set_cell( ), ->set_visible( ), ->set_width( ) to build the worksheet. Altogether, the abap2xlsx classes provide a very comfortable API for building the worksheet. Only when it cames to saving (i.e. doing the output to a file), you'll need to replace the implementation ZCL_EXCEL_WRITER_2007 of the method WRITE_FILE by an own implementation in an own class, avoiding iXML. Use that alternate implementation, based on CL_SXML_STRING_WRITER, to generate the Excel file from the internal data model of your worksheet in ZCL_EXCEL.
I hope I could clarify.
Regards,
Rüdiger
Yup, It's pretty clear 🙂 thanks for you time, let's see what I can do with the time I have....
Thanks a lot!
Luis
Hi Luis
Were you successful with writing large xlsx files? I'm facing the same problem but I am not familiar with the concept of CL_SXML_STRING_WRITER, so I'm not able to to change from iXML to SXML like Rüdiger wrote.
Regards Merrill
Well, we didn't have some much time to investigate so I used CSV as output format instead of xlsx, that reduced a lot the ammount of memory needed to create the file.
Cheers!
Luis
Hi Rüdiger
I'm facing the same problem with the writer, but I am not familiar with the concept of CL_SXML_STRING_WRITER, so I'm not able to to change from iXMLto SXML like you wrote. Do you know, if that ist implemented somewhere? Or could you give me a path to follow?
Regards Merrill
Hi Merrill,
you create an sXML string writer instance with CL_SXML_STRING_WRITER=>CREATE( ). This should be an instance attribute in your new general class ZCL_SXML_EXCEL_WRITER, implementing interface ZIF_EXCEL_WRITER. The Writer object offers the methods OPEN_ELEMENT, WRITE_ATTRIBUTE, WRITE_VALUE, and CLOSE_ELEMENT. These methods normally suffice to generate an XML document. The idea of the sXML writer is that it directly generates the output stream, which at the end can be retrieved as XSTRING via method GET_OUTPUT. So it doesn't keep a complete image of the XML document in memory, you can't search or inspect elements that you already have written into the stream. Now, the plan must be to go through the code of ZCL_EXCEL_WRITER_2007 and to rewrite its methods using the sXML writer instance, instead. So instead of "lo_document->create_simple_element( )", you will write "go_writer->open_element", for "lo_document->set_attribute" you will write go_writer->write_attribute( ...)" and so on.
This will be quite a lot of work. But I believe the critical method which brings the memory to explosion in ZCL_EXCEL_WRITER_2007 is only CREATE_XL_SHEET. So maybe a clever mixture of iXML and sXML will make it: Methods like "CREATE_XL_STYLES" or "CREATE_XL_THEME" could be left in iXML mode, as they are now (since they don't use much memory), and you write their output into the sXML writer. This way, only "CREATE_XL_SHEET" needs to be rewritten completely on sXML basis.
If you don't look for a general writer class, look at your xl/sheet.xml for an Excel sheet with a smaller data amount, and try to reproduce this XML document structure from your ABAP data using the sXML methods.
Regards,
Rüdiger
That was a hell of an answer 🙂
Hi Rüdiger
Thank you very much for your answer. I think I need once a silent minute to do that and that's not the case in the moment.
Regards
Merrill
Great post Rudiger:)
THanks