Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Consequently, in data science, randomness in your data can negate the value of a predictive analytics model.
It is easy to be fooled by randomness. We often see randomness when there is none, and vice versa. Here are 6 ways in which we can be fooled by randomness:
- We often tend to pick out and focus on the “most interesting” results in our data, and ignore the uninteresting cases. For example, if you toss a coin 2000 times, and you see a subsequence of 12 consecutive Heads in the sequence, then your attention is directed to this interesting subsequence (and you might conclude that there is something unfair about the coin or the coin tossing) even though it is statistically reasonable for such a subsequence to appear. This is selection bias, and it is also an example of “a posteriori” statistics (derived from observed facts, not from logical principles).
- We may unintentionally overlook the randomness in the data, especially in our rush to build predictive analytics models.
- Randomness sometimes appears to behave opposite to what our intuition would suggest. An example of this is the famous birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., less and less likely to avoid a repeating pattern in random data).
- Humans are good at seeing patterns and correlations in data, but humans are less good at remembering that correlation does not imply causation.
- The bigger the data set, the more likely you will see an “unlikely” pattern!
- When asked to pick the “random” statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse “randomness” with the “appearance of randomness”. A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior.
We consider 3 examples of randomness in order to test our ability to recognize it…
(continue reading here … http://www.analyticbridge.com/profiles/blogs/7-traps-to-avoid-being-fooled-by-statistical-randomness)
Follow Kirk Borne on Twitter @KirkDBorne