how do i know the data has to be cleansed
I am completely new to this and i have a basic question. I was just taking some classes on DS and i came up with a question. How do i know what kind of errors are there in the heterogeneous data sources that iam going to stage. Say if it has 1 million records. I know that i cant go through each and every record and i also think that i cant write a sql code because i dont know what kind of errors it has. Please do let me know how the flow or process actually works?
You need to define what is valid data, and use a validation transform to validate the data.
For eg: If customer name can be maximum 30 chars, cannot contain special chars.....create a validation logic for the same, the data passing through a validation transform are divided into "pass" and "fail" records. You can create such validation rules for all columns of a table in a single validation transform.
Thankyou for answering my question. As i said, iam just learning off the videos, so look forward for more questions. 🙂
As told by Debapriya Mandal you can use Validation transform & use its both the options of "Pass" & "Fail" . You attach another table for Fail condition so all the failed records are inserted in that table & you can see & analise the errors.