Functioning of gen_row_num_by_group function in BODS
I was using gen_row_num_by_group function provided by Data Services in my project and after experimenting on it I recorded my observations and thought of sharing with you all.
Suppose you get a requirement to find the rank. Then BODS provides a function called gen_row_num_by_group which allots ranking to incoming rows.
But be wary of one thing when you use this function and that is, in layman language, all your incoming records should be properly arranged and in technical terms the column for which you’ll apply ranking should be ordered by first.
Let’s understand it by an example.
Consider my below source and target,
Mapping of Source to target is as follows:
First time when job is executed then records of target table are:
Now before jumping to gen_row_num_by_group let’s have a look at gen_row_num function.
GEN_ROW_NUM: – As the name suggests Generate Row Number so it assigns number to every incoming row each time when you execute your dataflow residing inside your job. By default it starts to increment the incoming records by 1 but if you map it like,
Gen_row_num() + 1 then for every incoming row it’ll add this ‘1’.
In general, it can be
Let’s understand with an example,
I’ve added Result column and mapped it with gen_row_num() function as shown below:
So after executing the job my output looks like:
See as I mentioned it added a sequence number for every incoming record.
Now let’s understand the functionality of gen_row_num_by_group.
Gen_Row_Num_By_Group: – This means make a group of similar records and then apply gen_row_num functionality. ( Ha ha this definition is created by myself. :))
Now as I said it’ll make a group for a specified column and then will assign row number to them.
I’ve mapped my result column with gen_row_num_by_group as shown:
After executing below is my target table: –
So logically what it did?
For a group of similar incoming records it assigned row numbers.
Like for incoming records from 101 – 104 the column C2 had values ‘A’ so it assigned 1,2,3,4 to it.
Then from 105 the value of column C2 changed, i.e., it became AB, so it assigned it 1, record 106 also had the same value for C2 so it assigned it to number 2.
In a similar fashion it did for all incoming rows.
So ideally our function has assigned rank to the similar group but wait, check value of 109, it is having value ‘1’ but it should’ve value ‘5’ right as per the functionality of gen_row_num_by_group.
So it means I need to arrange my records first orderly.
So I apply order by as shown for the column which I want to assign the rank:
Again I execute my job and the output looks like:
But all in vain, again for C1 = 109 I’ve rank ‘1’ against it.
So this function,
Firstly applied gen_row_num_by_group functionality to all the incoming source records then applied order by and arranged the records in an order fashion.
See C2 which has value ‘A’ forms a group, ‘AB’ forms a group.
Similarly all similar records in C2 form a group.
By group I mean they are arranged orderly.
So to resolve this issue and to apply proper ranking to my incoming source I first need to arrange the incoming records in an ordered fashion and then apply the gen_row_num_by_group functionality.
So I modify my mapping as shown:
First order by,
Then gen_row_num_by_group functionality,
Finally after executing the job my target table looks like:
So you can see that all similar records of C2 are assigned a proper rank.
How it processed and assigned rank to the incoming data?
Firstly it applied order by functionality to all the incoming records and arranged them in a similar group and then it applied ranking to it.
Hope I’m able to explain the functionality of gen_row_num_by_group and the way it processes the incoming data.
Please let me know if I’ve missed anything.
This is great and very insightful. It helped me understand the function's way of processing the data.
You're most welcome Antriksh. 🙂
Thanks for the post.
I am wondering if there is any way I Can get the same rank for all same group. In your example, for all input A values I would like to give rank 1 and then for all B values I would give rank 2 and for all AB, I would give value 3 etc..
Welcome Sobhan! 🙂
Thanks for asking this question. I wrote another article after going through this question. Here is the link which might be helpful:
Thanks for the question as well! 🙂
gen_row_num_by_group is not currently in the built in functions in my DS designer. How can that function be added?
It's available in 4.2 version.
Hi, I want to split 4 million source dataset into 20000 records each and assign a unique Id to each set of rows. How can I achieve this. Today I have 4 million as source, tomorrow it could be more or less rows. Thanks in advance for your help!
For gen_row_in_group() the input must have a column to group by and it must be sorted.
In your example I would have a first query where I generate the ROW_ID with the mapping gen_row_num(). In the next query this column is used in an additional column MODULO with the mapping trunc(ROW_ID/20000). The output of this column is 0 for the first 20'000 rows, 1 for the next 20'000 etc. A perfect group_by column. And then a final query with the gen_row_num_per_group(MODULO) to assign a number from 0...19'999 for each group.
Thank you for the swift response.
I tried this solution. I used trunc + 1 as I wanted to get unique Id starting from 1, for eg first 20000 records will have batch id =1 ,the next 20000 will have 2 and so on.
I also tried ( gen_row_num() -1)/20000 + 1.
For both the solutions, batch id =1 gets assigned for only 10,000 records. The next 20,000 records gets batch id 2,3 and so on. Could you please help me understanding why the batch id 1 has an issue of getting assigned to just 10000 records? Thank you!
I do not have a BODS installation at hand to try for myself.
Having said that, I can think of only one option and that is the degree of parallelism set for the dataflow. Can you please rightclick the dataflow and set it to 1?
I still don't understand why that might have an impact as the gen_row_num() and gen_row_num_per_group() are singularization points, hence the DOP should be 1 for this sub segment but that also depends a bit on the entire dataflow. Anyway, worth a try.
Thank you, I still have the same issue just for the first batch-id assignment. The rest is fine and working well..
Nice blog, I also have one question.
I have for example 3 columns A, B, C and i want to generate the row numbers for column C grouping column A but the number should be generate like highest value from B column should get number 1 second highest number 2 and so on.
can you please tell in BODS how to achieve this