Quick’n’dirty solution for parsing XLSX files on the go
the method I want to describe in this blog is not new and is probably known to ABAP gurus, however many of the beginners often ask questions regarding this topic, hence this blog has appeared. The described approach is not a comprehensive Excel parsing solution but rather a life ring when the customer urges for quick solution and the deadline passed “yesterday”.
Any of you I definitely faced such common task as Excel files processing, this task often is raised in different SAP environments like OData services, Webdynpro apps, FPM apps and so on. Very often everything you have is a XLSX file and/or an XSTRING data made from it, knowing nothing about the structure, and you need quickly parse the table contents for processing in your app. How would you proceed?
SAP provides many Excel-related programming tools, they are listed on this handy page:
why not to use them? There are couple of asterisks here:
- Almost all standard Excel reading tools are based on OLE and do not allow reading in batch or non-dialog mode
- All the rest are third-party and highly specific like PI libraries
One would say: why the heck we need another tool if we already have ABAP2XLSX?
ABAP2XLSX is certainly a great and powerful tool and I used it on my projects too but often there are difficulties with its installation. Some customers dogmatically refuse to install custom packages to system, some are bothered about license issues and some guys are open to third-parties but approval of installing the third-party into the ABAP system will take months in the big multinational dinosaurs like the one I’m working now.
And when the Excel parsing is required utterly, just now, because all the deadlines are over and it was expected “yesterday”, in such crucial situations I propose to clients the solution I will describe here, it cannot be really called a solution because it has many flaws, but it can do the basic job and transform Excel file into internal table in ABAP. I invite you to treat it like a proof-of-concept rather than solution, however it is fully functional and can be adapted to your needs easily (or not so easily).
History of the approaches
When I was analyzing possible solutions to the problem I considered many candidates but there were the reasons why nothing fitted my needs. I reviewed even the ancient time pieces (2009 was lon before the COVID) and some of them were quite cool, like this one from the long respected ABAP warrior Naimesh Patel Yes, the XML generation offered by Naimesh worked fast and has simplicity but the format itself is inferior by design. Spreadsheet ML despite the name is not a full-featured spreadsheet but just a subset of XML capable of representing tabular data with the numerous limitations.
There is also a marvelous piece of work done by Madhu Omer, and honestly I was very impressed by the work done, but for my task it seemed over-complicated and touching only XLSX generation part and I urgently sought a way to parse. Also, iXML Library she used in her tool has some performance drawbacks and is not adapted for big data.
Another good attempt to achieve the same result was done by Rajesh Rajgor. Unlike Madhu with her Simple Transformations he used direct construction of XML template through ABAP string templates. Unfortunately this approach is not very flexible and again it is just a Spreadsheet ML, not a full-blown XLSX spreadsheet, so his work is only a modern rethinking of Naimesh old converter.
What I found really interesting is an import/export solution by Trevor Zhang written in modern classes and a modern concise syntax. Modern problems require modern solutions, they say, hehe.
I wasn’t aware of cl_ehfnd_xlsx class that was introduced with ABAP AS 752 and it is still very little info about it in the Web, so it seems more like internal S4HANA stuff not intended for customers. Anyway, it is an option but yeah, it requires 752 release and our system was still on 750 so I was out of the luck again :((
Integral parts of the parser
Let me briefly outline the nuts and bolts of .XLSX parsing which one must know before starting the implementation:
- XLSX format in fact is not a single file, but a set of files which define how the Excel workbook will look like. This is a fundamental difference from old .XLS which was a binary non-extractable container and from primitive SpreadsheetML which defines worksheet in a single XML.
- The main decommissioning parts of XLSX file are
- sheet files (sheet1.xml, sheet2.xml, …), they contain markup for data placement on a sheet
- shared strings file (sharedStrings.xml), it contains a deduplicated array of values
- styles definition (styles.xml) which defines how the cells of worksheet will look like
- workbook file (workbook.xml) it set up the structure of workbook and worksheets in it
- many others…
For parsing task we are interested only in the first two: sheet1.xml and sharedStrings.xml, it is an absolute minimum required to successfully recreate the table in ABAP.
The sheet1.xml file describes data structure across the sheet, the core part of it is <sheetData>
<sheetData> <row r="1" x14ac:dyDescent="0.25" spans="1:107"> <c r="A1" t="s"> <v>21</v> </c> <c r="B1" t="s"> <v>22</v> </c> <c r="C1" t="s"> <v>23</v> </c> </row> </sheetData>
The shared data is nothing more than just an array of Excel sheet values stored in string format:
<sst count="309" uniqueCount="83"> <si> <t>MANDT</t> </si> <si> <t>CARRID</t> </si> <si> <t>CONNID</t> </si> <si> <t>COUNTRYFR</t> </si> </sst>
Let me share the steps I used for grabbing the table from XLSX file and turning it into ABAP itab:
- Extract the sheet structure (sheet1.xml) into XSTRING XML
- Extract the array of sheet values (sharedStrings.xml)
- Transform both XMLs into internal tables through ST transformations
- Construct the result internal table by mapping indices in sheet file against array of values
Enough words and let’s jump to the code.
Main parsing class xlsx_reader:
CLASS xlsx_reader DEFINITION. PUBLIC SECTION. METHODS: read IMPORTING file TYPE string first TYPE abap_bool ddic TYPE string EXPORTING tab TYPE REF TO data, extract_xml IMPORTING iv_xml_index TYPE i xstring TYPE xstring RETURNING VALUE(rv_xml_data) TYPE xstring. ENDCLASS. CLASS xlsx_reader IMPLEMENTATION. METHOD read. TYPES: BEGIN OF ty_row, value TYPE string, index TYPE abap_bool, END OF ty_row, BEGIN OF ty_worksheet, row_id TYPE i, row TYPE TABLE OF ty_row WITH EMPTY KEY, END OF ty_worksheet, BEGIN OF ty_si, t TYPE string, END OF ty_si. " Excel varaibles DATA: data TYPE TABLE OF ty_si, sheet TYPE TABLE OF ty_worksheet. " RTTS variables DATA: lo_struct TYPE REF TO cl_abap_structdescr, table TYPE abap_component_tab. FIELD-SYMBOLS: <table> TYPE STANDARD TABLE. TRY. " loading XLSX zip from file DATA(xstring_xlsx) = cl_openxml_helper=>load_local_file( file ). CATCH cx_openxml_not_found. ENDTRY. "Read the sheet XML DATA(xml_sheet) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 2 ). "Read the shared data XML DATA(xml_data) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 3 ). TRY. " transforming sheet structure into ABAP CALL TRANSFORMATION zsheet SOURCE XML xml_sheet RESULT root = sheet. " transforming shared data into ABAP CALL TRANSFORMATION zxlsx SOURCE XML xml_data RESULT root = data. CATCH cx_xslt_exception. CATCH cx_st_match_element. CATCH cx_st_ref_access. ENDTRY. DATA(header_line) = VALUE #( sheet[ 1 ]-row OPTIONAL ). IF first IS NOT INITIAL AND header_line IS NOT INITIAL. "building itab from first line table = VALUE #( BASE table FOR ls_key IN header_line ( name = data[ ls_key-value + 1 ]-t type = CAST #( cl_abap_datadescr=>describe_by_name( VALUE #( data[ ls_key-value + 1 ]-t OPTIONAL ) ) ) ) ). DELETE sheet INDEX 1. ELSE. "building itab of strings DELETE header_line WHERE value IS INITIAL. DO lines( header_line ) TIMES. APPEND VALUE #( name = 'field' && sy-index type = CAST #( cl_abap_typedescr=>describe_by_name( 'STRING' ) ) ) TO table. ENDDO. ENDIF. " creating structure from DDIC structure IF ddic IS NOT INITIAL. lo_struct ?= cl_abap_structdescr=>describe_by_name( ddic ). ELSEIF table IS NOT INITIAL. " create structure from previously constructed type handle TRY. lo_struct = cl_abap_structdescr=>create( table ). CATCH cx_sy_struct_creation . ENDTRY. ENDIF. " creating table from structure CHECK lo_struct IS BOUND. DATA(dyntable_type) = cl_abap_tabledescr=>create( p_line_type = lo_struct ). CREATE DATA tab TYPE HANDLE dyntable_type. ASSIGN tab->* TO <table>. * mapping structure and data LOOP AT sheet ASSIGNING FIELD-SYMBOL(<fs_row>). APPEND INITIAL LINE TO <table> ASSIGNING FIELD-SYMBOL(<line>). DELETE <fs_row>-row WHERE value IS INITIAL. LOOP AT <fs_row>-row ASSIGNING FIELD-SYMBOL(<fs_cell>). ASSIGN COMPONENT sy-tabix OF STRUCTURE <line> TO FIELD-SYMBOL(<fs_field>). CHECK sy-subrc = 0. <fs_field> = COND #( WHEN <fs_cell>-index = abap_false THEN <fs_cell>-value ELSE VALUE #( data[ <fs_cell>-value + 1 ]-t OPTIONAL ) ). ENDLOOP. ENDLOOP. ENDMETHOD. METHOD extract_xml. TRY. DATA(lo_package) = cl_xlsx_document=>load_document( iv_data = xstring ). DATA(lo_parts) = lo_package->get_parts( ). CHECK lo_parts IS BOUND AND lo_package IS BOUND. DATA(lv_uri) = lo_parts->get_part( 2 )->get_parts( )->get_part( iv_xml_index )->get_uri( )->get_uri( ). DATA(lo_xml_part) = lo_package->get_part_by_uri( cl_openxml_parturi=>create_from_partname( lv_uri ) ). rv_xml_data = lo_xml_part->get_data( ). CATCH cx_openxml_format cx_openxml_not_found. ENDTRY. ENDMETHOD. ENDCLASS.
Transformation zheet for sheet files
<?sap.transform simple?> <tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main"> <tt:root name="root"/> <tt:template name="main"> <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac= "http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3= "http://schemas.microsoft.com/office/spreadsheetml/2016/revision3"> <tt:skip count="4"/> <sheetData> <tt:loop name="row" ref="root"> <row> <tt:attribute name="r" value-ref="row_id"/> <tt:loop name="cells" ref="$row.ROW"> <c> <tt:cond><tt:attribute name="t" value-ref="index"/><tt:assign to-ref="index" val="C('X')"/></tt:cond> <tt:cond> <v> <tt:value ref="value"/> </v> </tt:cond> </c> </tt:loop> </row> </tt:loop> </sheetData> <tt:skip/> </worksheet> </tt:template> </tt:transform>
Transformation zxlsx for shared strings
<?sap.transform simple?> <tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main"> <tt:root name="ROOT"/> <tt:template name="main"> <sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <tt:loop name="line" ref=".ROOT"> <si> <t> <tt:value ref="t"/> </t> </si> </tt:loop> </sst> </tt:template> </tt:transform>
Sample calling program:
START-OF-SELECTION. PARAMETERS: p_file TYPE string LOWER CASE DEFAULT `C:\table.xlsx`. SELECTION-SCREEN BEGIN OF BLOCK out WITH FRAME TITLE text-s01. SELECTION-SCREEN BEGIN OF LINE. SELECTION-SCREEN COMMENT 1(25) text-002. PARAMETERS: p_hdr TYPE xfeld MODIF ID hdr USER-COMMAND hdr. SELECTION-SCREEN COMMENT 30(25) text-001. PARAMETERS: p_ddic TYPE string MODIF ID dic. SELECTION-SCREEN END OF LINE. SELECTION-SCREEN END OF BLOCK out. AT SELECTION-SCREEN ON VALUE-REQUEST FOR p_file. p_file = cl_openxml_helper=>browse_local_file_open( iv_title = 'Select XLSX File' iv_filename = '' iv_extpattern = 'All files(*.*)|*.*' ). AT SELECTION-SCREEN OUTPUT. IF p_hdr = abap_true. DATA(imp) = 1. CLEAR: p_ddic. ELSE. imp = 0. ENDIF. LOOP AT SCREEN. CASE screen-group1. WHEN 'DIC'. SCREEN-input = COND #( WHEN imp = 1 THEN 0 ELSE 1 ). ENDCASE. MODIFY SCREEN. ENDLOOP. AT SELECTION-SCREEN. FIELD-SYMBOLS: <fs_out> TYPE ANY. IF sy-ucomm = 'ONLI'. DATA(reader) = NEW xlsx_reader( ). reader->read( EXPORTING file = p_file first = p_hdr ddic = p_ddic IMPORTING tab = DATA(tab) ). ASSIGN tab->* TO FIELD-SYMBOL(<table>). ENDIF.
Samples of usage
The idea of the program is that user has two options for parsing: either to parse into generic string table or to receive a fully-typed internal table in his hands. Type can be specified explicitly on the selection screen (DDIC structure input box) or the program can derive type from the first line of Excel table provided it is filled with data element names (Use 1st line as structure checkbox).
Sample input for 1st line-typed table MARC (subset of columns)
and the output to fully-typed itab
Sample input for explicitly specified DDIC structure KALC
pay attention that in spite of the “number stored as string” values we received fully-typed table in ABAP, and even FLTP values are stored correctly
And finally the most simple case with conversion into string table
A couple of notes about limitations of the current sample implementation:
- It parses only first sheet in this variant (easily curable by adding couple lines of code)
- It does not respect blank columns in an Excel sheet
- Dates are not recognized properly ’cause in sharedStrings.xml they are stored in Epoch format
- Decimal number values may throw a dump while parsing if user locale settings differs from settings of those who sent the file
Come down to, I didn’t set a task to make a comprehensive tool for all situations, but rather show to community how it can be done in a simple and standard way, you are free to adapt and tailor the class to your needs.
And now about the advantages my solution posses over other approaches:
- Pretty and concise, the class consists of only 100 lines of code
- Absolutely standard, based on CL_XLSX_DOCUMENT class which is available for almost all releases, except the most ancient ones, particularly ≥7.02
- Built on simple transformations in contrast to XSLT ones like in unified approach,for example. ST are significantly faster on big data, they are more intuitive for adaptation to customer needs and (the sweetest!) they are two-sided, and can be used both for serialization and deserialization.
- Performance. I tested my parser on big amounts of data and it works good enough (10 sec on 100Mb file), whilst, e.g. ABAP2XLSX parsing is based on CL_IXML class (iXML library) and does not perform well on big datasets.
Feel free to comment and share your considerations, dear community!
Thank you for this very interesting approach. I'm currently working on a project where we use the class cl_fdt_xl_spreadsheet which seems to work fine for the moment. But we are lucky to work on a very new version of ABAP.
I didn't know the class cl_ehfnd_xlsx.
Yes, on the latest ABAP versions you have more options to consider.
Thank you for the post, now we will wait for ABAP2XLSX fans to talk about reinventing wheel every month or so 🙂
Haha, I know they will do it 🙂
Well, this blog at least mentions ABAP2XLSX, so might get a free pass. 🙂
Honestly though, every time I see "Excel" or "XLSX" in an ABAP blog title I feel like we need to update a sign "It's been X days since last Excel blog". 🙂 A bit of a subject fatigue. And yet SAP still hasn't gotten a clue that maybe some better standard functionality is needed for this. Go figure.
Thanks a lot for this fantastic blog post!! 🙂
Can you please give me a hint on how and where I shall adjust it so it will respect blank columns?
Thanks for your feedback, Michal.
The complexity with empty columns lays in the following: there is no single representation of them in OpenXML. Consider this sample based on the extracted from SE16 VBAK table:
in the table resulted from transformation of sheet1.xml they are presented in different ways
this happens because they stored like this in the sheet.xml file within XLSX, some of them are closed <c> tags with no value <v> inside, and some of them are presented by index-coded values
This 6th index in the shared.xml unique values file is nothing more than emptiness
So they are treated correctly with index representation, but they are not respected when put explicitly. I decided not to overcomplexify my class with that, to show the approach uncluttered.
You can try to remove the line 90 DELETE <fs_row>–row WHERE value IS INITIAL., however it may show inaccuracies with tables 26+ columns where some columns has 2-chars indices (AA, AB, AC…). I didn’t properly tested wide tables scenario.
I appreciate your reply Pavel!
I tried that removal of line 90 but it did not help. Empty cells are ignored and data is shifted to the left.
I guess it would require a more complex change starting from those simple transformations to not skip empty cells.
Either way, it was a very nice exercise and I learned quite a lot from your blog post and from the links you have included in it. Once more - thanks!!
In the meantime, I found out that cl_fdt_xl_spreadsheet class is available at my customer system so I could build final solution based on this one.
Many thanks for your post. By the way, seems this solution only works for .xlsx, but not for .xls or .csv. So do you konw what to do with .xls or .csv files ?
Yes indeed, this solution is only for OpenXML-based XLSX files. the old XLS are binary and cannot be decyphered like this. For reading XLS you can use these FMs
Thanks for your kindly replies. But seems these FMs cannot be used in background processing.I will try to find another way.
yes. all of them use OLE so they do not fit for background. Your only choice is ABAP2XLSX