Skip to Content

how do i know the data has to be cleansed

Hi All,

          I am completely new to this and i have a basic question. I was just taking some classes on DS and i came up with a question. How do i know what kind of errors are there in the heterogeneous data sources that iam going to stage. Say if it has 1 million records. I know that i cant go through each and every record and i also think that i cant write a sql code because i dont know what kind of errors it has. Please do let me know how the flow or process actually works?

You must be Logged on to comment or reply to a post.
  • Hi Raghu

    You need to define what is valid data, and use a validation transform to validate the data.

    For eg: If customer name can be maximum 30 chars, cannot contain special chars.....create a validation logic for the same,  the data passing through a validation transform are divided into "pass" and "fail" records. You can create such validation rules for all columns of a table in a single validation transform.

  • Hi Raghu,

    As told by Debapriya Mandal you can use Validation transform & use its both the options of "Pass" & "Fail" . You attach another table for Fail condition so all the failed records are inserted in that table & you can see & analise the errors.