Recently I have been working with BI to do some management reporting. One of my tasks is to take some text files with some fairly dirty data and import them into our BI system.
Our file contains a list of semi-colon separated fields with double quotes surrounding fields containing semi-colons as part of the data. The first thing that threw me is that BI calls the surrounding quote characters the escape character. In my world (being a Java programmer for as long as I can remember), the escape character is normally the back slash and is used to change the way the next character is interpreted. So \n becomes a new line, \t is a tab and \” is used to put a double quote into the middle of a double quoted string. This is also how the system that produces our source data handles the issue – if a double quote shows up in the middle of a double quoted string, it “escapes” it with a back slash – \”.
File split error. Hmmm, not good.
So after a quick search threw up no ideas, I fell back on the swiss army knife of the Linux user – Perl. Before long I put together a quick Perl one liner to pull out these escaped double quote characters.
perl -pi -e ‘s/
“//g’ perl -pi -e ‘s/
“(?!;)//g’ for f in ls *; do echo Processing file $f; perl -pi -e ‘s/
“(?!;)//g’ $f; done</pre>