The article is meant for software developers that use random numbers in their code; and brings out the perspective that there are different “types” of random numbers.
In machine learning and statistics, we make use of random numbers for doing a variety of things, such as initializing learning parameters, or for generating sample data. Although we do get “random” numbers, sometimes the type of “randomness” matters for the application we are using them for. Using a wrong type of random numbers can give us wrong results and lead us to wrong conclusions. The different type of “randomness” is often associated with what is called “probability distribution”. Two of the more used types of distributions are: Uniform, and Normal. This article brings out uniform distribution from a real-life use case and uses a visualization to bring out insights of how the numbers are spread out.
Let us consider a real use case.
Recently someone needed to sample a large table from a data set and he did the following to get a random sample:
Get all rows; append a guid column to them; order by this new guid-column; and get top 500 rows. The intuition is that Guids are almost random numbers, so sorting by this column will give us randomly selected rows. Also this Microsoft site seems to indicate that the guid comparison is similar to a string comparison.
The question is, how random are the guids, really, and is there a way to verify the “randomness”?. To answer this we are going to plot a histogram and see the frequency distribution.
Consider the following SQL query
select top 5000 ‘0x’+SUBSTRING(convert(varchar(50), NEWID()), 1,8) d from trade order by d
This gives a 5000 guid samples (high order bits) in hex form. We will make sure that the output of this query goes to an ANSI text file so that we can properly read it in later.
In the Jupyter Notebook here we have the python code that just reads this file and plots a histogram.
There! A nice uniformly distributed set of 5000 guids.
So, the conclusion is that Guids are indeed random and any set of numbers (a reasonably large set like, say, 100 or more) can be tested for randomness using histograms
If the developers reading this are interested they can try studying distribution of other seemingly random numbers like machine-generated bank PIN numbers, OTP (one time passwords) for Internet transactions, multi factor authentications, . . . .