In ETL, using Excel files as source is very common as it is the most user friendly tool for data collection. In Data Migration, MS Excel files are very common in collecting data by business users for data objects like materials, articles, bank information etc., One of the key challenges with using Excel data source is the date datatype.The purpose of this document is to provide a solution to such an issue with “date” data type when performing ETL using Excel Data Sources.
When extracting data from Excel workbooks using BusinessObjects Data Services, the Microsoft OLEDB Access Database Engine is used by the jobserver. This add-in from Microsoft is also necessary for viewing data in MS Excel using the quick data viewer option of BODS Designer software.Hence the settings of the OLEDB installation is the driver to decide how the data is displayed on the front end and when extracted into BODS into staging database. When using Citrix installation for development using the Data Services Designer, one should note that the OLEDB settings for the citrix client might be different to that of the jobserver or the desktop on which the excel is viewed.
Fig1 below shows the dates displayed in “DD/MM/YYYY” format.
Note that the the separator is character “/” wen viewed using the MS Excel installation on the laptop. This is because the default settings of the laptop is set to English UK where the format is set to “DD/MM/YYYY”.One should also note that the data is stored as date in the Excel work book.
Fig2 below shows the same Excel workbook when opened using the Quick data Viewer of BODS installation in the CITRIX Client.
Note that in the Fig2 above the separator has changed from “/” to “.”. This is because the OLEDB settings on the citrix client is different to that of the laptop on which the Excel file was originally viewed.The difference in the settings has changed the data from “DD/MM/YYYY” to “DD.MM.YYYY” format.
The the data flow is executed to extract data from this excel workbook to a database table, the format of the data extracted and stored is governed by the Microsoft OLEDB Access Database Engine settings of the jobserver. Fig3 below shows the output of the extraction stored into a database table.
Note that the data stored into the database table is stored in “MM/DD/YYYY” format. The data extracted is altered from “DD/MM/YYYY” format to “MM/DD/YYYY” format and stored into the database table. The data type definition for all the columns in Fig 3 for the Excel Workbook and table ‘EXCEL_TEST’ is set to Varchar (255) as shown in Fig 3a and Fig 3b.
Note that in both the cases of source and target structures, the datatype definition is set to Varchar(255).
One can also set the datatype of the date columns to “datetime” or “Date” in the Excel Workbook definition. However, it should be noted that the format of date held in the file and the actual datatype definition is influenced by manual intervention. This is because excel workbooks compiled are predominantly by users to help facilitate the data loads as opposed to using automatically generated/compiled Excel Workbooks.This means that defining the datatype as “Date” or “Datetime” poses the risk of data corruption or error during data extraction due to the Microsoft OLEDB Engine on the jobserver not being able to interpret the datatype definition correctly or due to incorrect date formats.
It is hence advisable to extract data from Excel workbook into a staging table as Varchar or character datatype to ensure that all the data is first extracted so as to allow any cleansing rules to be applied in the Data Services job during subsequent stages of the ETL.
One solution to this problem is provided in this document. The solution recommended is to define the date columns as text in the Excel Workbook so that it can be exported as text into the database table after which a combination or date and string functions can be used to conver the text into “Date” datatype.
Note that the formatting of the columns should not be performed using the “Format Cells”option of Microsoft Excel but by using the “Text to Columns” option of Microsoft Office. This is described below
When formatting the date data in Excel using the “Format Cells” option of Microsoft Excel, the data format is not preserved but actually converted from a date to a string of numeric charaters. In Fig 4 below, it can be see that when the format of the cell is changed from “Date” to “Text”, the value changes from a date to a string of numbers.
The alternative and more reliable approach to resolving this formatting issue in Microsoft Excel is to utilise the “Text to Columns” option in Microsoft Excel. The steps involved are shown below.
Select the entire column of data with the date that needs to be formatted to text. Then click on “Text to Columns” in the “Data” tab of the ribbon in Microsoft Excel.
It should be noted that the Ribbon in Microsoft Excel is the term used to refer to the various buttons and tabs on top of the Microsoft Excel Window. This term has become prominent in Microsoft Excel Version 2007 and higher.This will open the “Convert Text to Columns Wizard” dialog box. This is a three step process as explained below.
Step 1: As shown in Fig 6, ensure that the option is set to “Delimited” in the dialog box.
As described in Fig 6 above, it should be noted that the “Text to Columns” option is used to split data held in one Excel Column into multiple columns either using a delimiter or fixed width option. However in our case, we are using it to type cast the single column data into “Text” datatype whilst preserving the format and data.
Step 2: The dialog box for step 2 will appear with “Tab” as the delimiter. Leave this unaltered. If on the other hand there is a different delimiter chosen, change it to “Tab”. As we are not intending to split the data in the column to multiple columns, we can perform this action by leaving the delimiter as “Tab”. Also note that if the delimiter is chosen as one of the characters that is used a separator in the dates, it is likely to split the single date column into three columns. Hence care should be taken in this step to ensure that the delimiter is not the separator in the date.
As shown in Fig 7 above, select the delimiter as “Tab” and click on Next.
Step 3: In this step, change the “Column data format” from “General” to “Text”. This is show in Fig 8 below.
As shown in the Fig 8 above, it should be noted that after changing the data format from General to Text, the new data format will appear in te column under the “Data Preview” section of the dialog box. Check to ensure that this is the case and the click on Finish.
Now the data in the cell is unaltered but stored in text format.
From the Fig 9 above, it can be seen that the data in Column C is left justified to signiify that is it stored as text whilst in Fig 4 and Fig 5 the same date was rght justified signifying that it is stored in “Date” format within the Excel workbook.
When the same dataflow in Data Services is now executed, the data stored in the output table is in the same format as the source.
As shown in Fig 10 above, both the target data in the database table and the source data in Excel workbook hold the date in “DD.MM.YYYY” format thus preventing any data corruption. This approach provides a reliable method to ensure that the extraction from the Excel Workbook for date information is not affected by the settings of the Microsoft OLEDB Access Database Engine settings of client/server installation of the BODS Environment.
As mentioned in the earlier part of this document, the approach described here is one of many possible methods to overcome the formatting issues faced with Excel workbook. I welcome all feedback to improve/enhance this document to tackle more scenarios for “Date” datatype and tackling this when using Excel workbooks as source in ETL.