Sampling in the Age of Big Data
This is the fifth blog, in a series, on applying critical thinking to information to make well-informed and sound decisions. The blogs, and white paper, are the result of a collaboration between SAP and published authors, and professors of Critical Thinking, Dr. Sharon Bailin and Dr. Mark Battersby.
Sampling in the Age of Big Data
Sampling is the backbone of all polling, marketing surveys, and health studies. The idea of sampling is to survey a number of people in a population, find out some information about the sample (e.g., voting intentions, purchasing plans), then use the data from that sample to infer similar information about the population. Surprisingly, the typical national media polls only sample between 1000 and 1500 people and claim an accuracy of approximately +/-3 percentage points.[i] For example, if a sample shows a 52% approval rating for a candidate, then it is reasonable to infer that the approval rating in the population is somewhere between 49% and 55% .
This kind of sample accuracy depends on the sample being produced by random sampling of the target population—meaning everyone in the target population has the same chance of being counted in the sample. This is virtually impossible in actual polling. The inevitable result is selection bias which has many causes: people don’t answer their phone, people refuse to answer, people don’t have land lines, people don’t speak the language of the pollsters, etc. In addition, there is the challenge of making sure that those sampled are in the target population (e.g., voters, potential clients). For example, usually only about half of those eligible to vote actually vote. A simple random sample (assuming that was possible) of the adult population would contain 50% of non-voters, i.e., half the sample would not be members of the target population (see Battersby’s book: Is That a Fact? for a more comprehensive explanation).
This means that you cannot simply trust the mathematics of sampling when you infer from the sample to the population. There is always a question of just how generalizable your sample is to your target population.
The sample size in polling is paltry compared to the sample sizes in the age of “big data,” but big data is still a sample. Take for example consumer data bases with millions of records of consumer spending. Enormous as they are, they are still just a particular sample of the spending habits of the population of American consumers. In fact, it is only a sample (at a particular time) of the population of consumers in the database. Sample size is clearly not an issue, but any inference from this data must address the issue of generalizability to the greater population and to the future behavior of consumers in the data base.
One might think that big data would eliminate the problems of sampling, even selection bias. Not necessarily. One of the first great efforts at “big data” sampling resulted in one of the most famous failures in polling history. In 1936 The Literary Digest sent surveys to 10 million people in an effort to determine who would win the 1936 presidential election. They received 2.3 million surveys back. On the basis of this poll, the Literary Digest predicted that Landon, a Republican, would beat the incumbent Roosevelt by a margin of 3 to 2. Wrong. Roosevelt received over 60% of the vote, not the 40% that the Literary Digest had predicted! The problem? Selection bias. The 10 million were selected from phone lists, magazine lists etc., biasing the survey in the direction of the better off. Allowing people to choose whether to respond further biased the survey against the incumbent. [ii]
The solution? In the same year that the Literary Digest so badly miscalculated, George Gallup, made his name by using a representative sample to correctly call the election. Gallup’s approach was to create a sample that shared demographic characteristics with the population based on such obvious criteria as age, gender, geography, etc. While such an approach was successful in1936, in the 1948 election, he badly miscalculated the results using exactly the same “representative” method for obtaining a sample.[iii] Intuitively plausible as the idea of representative sampling is, it is very tricky to determine the relevant factors that constitute “representativeness”.
A “Big Data” Example:
When a large accounting website decided to create a statistical picture of the US consumer, they faced the problem that their data, while enormous, was not randomly selected from the American population; neither was their sample a representative sample, being typically wealthier, younger professionals. They tried to address the issue of representativeness by matching the distribution of characteristics like age, gender, location, and income level to that of the US population. This resulted in a much smaller sample, but one that was still in the millions! They were trying to take their manifestly non-representative, non-random sample and turn it into a representative sample–just like Gallop did in 1948.
The problem is that we don’t really know all the relevant considerations needed to make a sample truly representative. Ethnicity comes to mind, but it is also unlikely that the lower income users of the website are in any way representative of people in the lower income brackets generally. Of course the data from the website does provide an excellent snapshot of their contemporary users or trend lines of the users’ consumption patterns at a particular time. If a business thinks that this group is representative of their customers, then inferences to their customers’ habits can be reasonable (though there is still the problem of whether these trends will continue).
What’s to be done?
Realize the uncertainty and limits that such efforts involve. Big data is great, but like all samples, it will involve selection bias and be limited to a particular target population at a particular time. This is not a problem if that is kept in mind. But if the data is to be used to infer information about a population different from the one from which the data is collected (even after adjustment to make it “representative” ), it needs to be used with much more caution.
As the polling guru, Nate Silver, reminds us in his book The Signal and the Noise, the best way to make reasonable predictions is to use every reasonable source of data from a variety of different sources and perspectives and put them together in a thoughtful and credible manner. In other words, even with big data and impressive analytical software, we still need to think critically in order to make reasonable inferences and judgments.
[i] Battersby, M. Is that a Fact: A Field Guide to Statistical and Scientific Information.(Revised Edition) (Broadview Press. 2013), p. 36
[ii] Ibid. pp. 39-40
[iii] Ibid. p. 41