Tag Archives: Statistics

The Goldilocks Principle in Predictive Modeling and Data Science

In the field of statistics, there has been a lot written about statistical fallacies, logical fallacies, and fallacious reasoning. The following big list of fallacies is one that I like to use in my own undergraduate data science courses, particularly in my Data Ethics class where I teach my students about “lying with statistics”:


Many of these fallacies are relevant to data science modeling, including this one: Circular Reasoning, where the reasoner “begins with what he or she is trying to end up with; sometimes called assuming the conclusion.”

A broken clock is truly an example of circular reasoning (as the dial is circular, and the clock represents a particular measurement in a repeating circular perspective): “Even a broken clock is right twice a day.”

(source: http://tvtropes.org/pmwiki/pmwiki.php/Main/StoppedClock)

In the following article, I use the broken clock analogy for circular reasoning in describing the importance of verification and validation in predictive analytics models: Are your predictive models like broken clocks? Here’s how to fix them.”  The article also discusses the importance of training vs. test data sets, the bias-variance tradeoff in data science modeling, underfitting vs. overfitting, and the Goldilocks Principle applied to data science.

(continue reading here) 

Follow Kirk Borne on Twitter @KirkDBorne

Learning from Data Big and Small — What’s the Shape of Your Data?

(A version of this article was originally published on BigDataRepublic.com in July 2013 — that site no longer exists.)

Does discovery depend on the scale of your experiment? In some cases, no! Whether Christopher Columbus sailed with 3 ships or 3000, he still would have found the New World, probably in the same amount of time. In this case, the existence of the Americas is independent of the scale of the exploration resources. Conversely, there are many more cases where the potential for discovery does scale with the size of your resources. If those resources are Big Data, then prepare to say “hello, world” to many more new worlds (and new discoveries). The good news for small-to-mid scale projects is that, even without Big Data, you can still be a Columbus.

Learning from small data has justifiably acquired a faithful following of advocates (see this and this).  Let us illustrate this with a common example: Time Series Analysis.

In a simple single-parameter data stream, you can extract characterizations from the time series: (a) the change since the last value (y2-y1); (b) a running mean (e.g., the average of the last 3 data points = [y1+y2+y3]/2); (c) the slope of the trend line (= velocity = dy/dt = [y2-y1]/[t2-t1]); (d) the rate of change of the trend line slope (= acceleration = the 2nd derivative of the data d2y/dt = {[y3-y2]/[t3-t2] – [y2-y1]/[t2-t1]} / [t3-t1] ); (e) the rate of change of acceleration (= jerk); and so on.

Stock market day traders watch 2nd derivatives more closely than the other time series characterizations, since that parameter can signal an inflection point in the data series. Inflection points (a change in the sign of the 2nd derivative) can thus be used as a predictor of an impending turn-around point (maximum = time to sell; or minimum = time to buy) in the time series.

These simple statistical metrics are therefore valuable and informative in some circumstances.  Somewhat more interesting characterizations include the shape of the variation: e.g., U, V, or W. These symbolic representations of temporal behaviors can be quite powerful for sequence mining, pattern discovery, transition detection, and trend analysis in time series data, as well as for the all-important dimensionality reduction and indexing of massive complex data streams.

If the time series stream of data is dense (in time), then you can do a spectral (frequency) analysis to measure the strength of patterns in the time series on all scales (high-frequency to low-frequency) — this is called Fourier Analysis. This analysis gives you a large number of characterization metrics (e.g., the frequency components and their amplitudes) for dense time series.  You can monitor these metrics and alert the end-user only when the power spectrum of the different frequency components changes significantly, even if the change is in only one component (e.g., its phase or amplitude) or if a new component appears (e.g., an hourly fluctuation in data that previously only showed daily fluctuation).

Finally, imagine massive parallel streams of data: Big Time Series Data. Now the fun begins! Such parallel streams may be Twitter timelines for hundreds of millions of users, or streaming data from hundreds (or thousands) of sensors in an airplane or manufacturing plant, or streaming transaction data from millions of retail shoppers or for a large financial firm. Monitoring massively parallel data streams in this way may be a perfect job for a distributed computing environment: Map-Reduce and Hadoop.

At each step (or within each incremental time range) of such massive data streams, you can create a data distribution histogram of the data values Y (or a histogram of trend line slopes dY, or of 2nd derivatives d2Y) across the full ensemble of parallel data streams. You can then estimate a variety of statistical metrics for the separate data distributions (i.e., one set of metrics each for Y, dY, d2Y, and others) as a function of time: mean, median, mode, variance, skew, kurtosis, presence of a long tail, mixture models, and more.  (Of course, if the data are textual, as in Twitter comments, then some form of numerical coding of the text will yield a goldmine of value – that’s a story for another article.)

Exploiting a variety of statistical metrics (data stream characterizations) such as these is where the exploration and discovery potential expands significantly. Similar to the small-data cases described earlier, the values of these characteristic statistical metrics on massive data streams become a model for the state of the system that you are monitoring. The model itself can be monitored and flagged for significant changes in these characteristic statistical features or for the appearance of new features in the data streams. As long as the massive parallel data streams continue to behave in predictable consistent patterns (which is called a “stationary state”), then there is no need to alert the end-user. However, when the stationarity of the data stream model changes (perhaps triggered by a change in any one of the state parameters that exceeds a pre-specified threshold), then a signal is raised and the end-user verifies whether a truly new behavior or event has been discovered. Land ahoy! All hands on deck!

The point of these examples is to demonstrate that discovery and learning from small data is still useful and valuable. As the data set becomes increasingly larger, it is then possible (and likely) that more intricate, subtle, and descriptive features within the data will be revealed. The discovery potential of bigger data thereby increases (perhaps exponentially). Additionally, the nature and diversity of the discoveries become richer, and maybe so will you!

Follow Kirk Borne on Twitter @KirkDBorne

Variety is the Spice of Life for Data Scientists

“Variety is the spice of life,” they say.  And variety is the spice of data also: adding rich texture and flavor to otherwise dull numbers. Variety ranks among the most exciting, interesting, and challenging aspects of big data.  Variety is one of the original “3 V’s of Big Data” and is frequently mentioned in Big Data discussions, which focus too much attention on Volume.

A short conversation with many “old school technologists” these days too often involves them making the declaration: We’ve always done big data.”  That statement really irks me… for lots of reasons.  I summarize in the following article some of those reasons:  “Today’s Big Data is Not Yesterday’s Big Data.” In a nutshell, those statements focus almost entirely on Volume, which is really missing the whole point of big data (in my humble opinion)… here comes the Internet of Things… hold onto your bits!

The greatest challenges and the most interesting aspects of big data appear in high-Velocity Big Data (requiring fast real-time analytics) and high-Variety Big Data (enabling the discovery of interesting patterns, trends, correlations, and features in high-dimensional spaces). Maybe because of my training as an astrophysicist, or maybe because scientific curiosity is a natural human characteristic, I love exploring features in multi-dimensional parameter spaces for interesting discoveries, and so should you!

Dimension reduction is a critical component of any solution dealing with high-variety (high-dimensional) data. Being able to sift through a mountain of data efficiently in order to find the key predictive, descriptive, and indicative features of the collection is a fundamental required data science capability for coping with Big Data.

Identifying the most interesting dimensions of the data is especially valuable when visualizing high-dimensional data. There is a “good news, bad news” perspective here. First, the bad news: the human capacity for seeing multiple dimensions is very limited: 3 or 4 dimensions are manageable; and 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-n projection of a larger-N parameter space of variables.

In data visualization, a systematic ordered parameter sweep through an ensemble of small-n projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/n is large. Consequently, a data analytics approach that combines the best of both worlds (machine vision algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data.

One such approach is to use statistical and machine learning techniques to develop “interestingness metrics” for high-variety data sets.  As such algorithms are applied to the data (in parameter sweeps or grand tours), they can discover and then present to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data: “Numbers are powerful, especially in interesting combinations.”

The outcomes of such exploratory data analyses are even more enhanced when the analytics tool ranks the output models (e.g., the data’s “most interesting parameters”) in order of significance and explanatory power (i.e., their ability to “explain” the complex high-dimensional patterns in the data).  Soft10’s “automatic statistician” Dr. Mo is a fast predictive analytics software package for exploring complex high-dimensional (high-variety) data.  Dr. Mo’s proprietary modeling and analytics techniques have been applied across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at http://soft10ware.com/ and read more herehttp://soft10ware.com/big-data-complexity-requires-fast-modeling-technology/

Kirk Borne is a member of the Soft10, Inc. Board of Advisors.

Follow Kirk Borne on Twitter @KirkDBorne

6 Ways To Be Fooled by Randomness

Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Consequently, in data science, randomness in your data can negate the value of a predictive analytics model.

It is easy to be fooled by randomness. We often see randomness when there is none, and vice versa. Here are 6 ways in which we can be fooled by randomness:

  1. We often tend to pick out and focus on the “most interesting” results in our data, and ignore the uninteresting cases.  For example, if you toss a coin 2000 times, and you see a subsequence of 12 consecutive Heads in the sequence, then your attention is directed to this interesting subsequence (and you might conclude that there is something unfair about the coin or the coin tossing) even though it is statistically reasonable for such a subsequence to appear. This is selection bias, and it is also an example of “a posteriori” statistics (derived from observed facts, not from logical principles).
  2. We may unintentionally overlook the randomness in the data, especially in our rush to build predictive analytics models.
  3. Randomness sometimes appears to behave opposite to what our intuition would suggest. An example of this is the famous birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., less and less likely to avoid a repeating pattern in random data).
  4. Humans are good at seeing patterns and correlations in data, but humans are less good at remembering that correlation does not imply causation.
  5. The bigger the data set, the more likely you will see an “unlikely” pattern!
  6. When asked to pick the “random” statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse “randomness” with the “appearance of randomness”. A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior.

We consider 3 examples of randomness in order to test our ability to recognize it…

(continue reading herehttp://www.analyticbridge.com/profiles/blogs/7-traps-to-avoid-being-fooled-by-statistical-randomness)

Follow Kirk Borne on Twitter @KirkDBorne

Kurtosis: Four Momentous Uses of A Statistical Orphan in the Era of Big Data

We frequently see much use of and commentary on the mean, medians, and modes of statistical distributions, as well as lengthy discussions of variance and skew (including the now famous “long tail“). But, what about fat tails? Is that a taboo subject? Maybe it is! For example, in the widely respected book Numerical Recipes: The Art of Scientific Computing, the authors had the audacity to say “the skewness (or third moment) and the kurtosis (or fourth moment) should be used with caution or, better yet, not at all.” Those warnings notwithstanding, kurtosis is making a comeback. Not that it ever went away, but a recent search on Google Scholar found over 3000 articles mentioning kurtosis in the context of statistics within the first three months of 2014, and over 12,000 articles in 2013, though only about 4000 such articles were cited in the preceding three years combined. Many of those contributions focus on real-world uses of that particular characteristic of data distributions.

So, what is kurtosis and what applications can we find for it in the Big Data world of Data Science?

(continue reading herehttp://www.statisticsviews.com/details/feature/6047711/Kurtosis-Four-Momentous-Uses-for-the-Fourth-Moment-of-Statistical-Distributions.html)

Follow Kirk Borne on Twitter @KirkDBorne

Outlier Detection Gets a New Look – Surprise Discovery in Big Data

Novelty and surprise are two of the more exciting aspects of science – finding something totally new and unexpected can lead to a quick research paper, or it can make your career. As scientists, we all yearn to make a significant discovery. Petascale big data collections potentially offer a multitude of such opportunities. But how do we find that unexpected thing? These discoveries come under various names: interestingness, outlier, novelty, anomaly, surprise, or defect (depending on the application). Outlier? Anomaly? Defect? How did they get onto this list? Well, those features are often the unexpected, interesting, novel, and surprising aspects (patterns, points, trends, and/or associations) in the data collection. Outliers, anomalies, and defects might be insignificant statistical deviants, or else they could represent significant scientific discoveries.

(continue reading herehttp://stats.cwslive.wiley.com/details/feature/6597751/Outlier-Detection-Gets-a-Makeover—Surprise-Discovery-in-Scientific-Big-Data.html)

Follow Kirk Borne on Twitter @KirkDBorne

Numbers are Powerful, Especially in Combination

The phrase “Big Data” refers to a set of serious analytical challenges that arise when the data increase in quantity, real-time speed, and complexity.  The three V’s of big data (Volume, Velocity, and Variety) are now well known and well worn. Their familiarity and frequent association with “big data hype” may numb us to the important data challenges that they are meant to represent. These three characterizations have their counterparts in tools and technologies.  For example, Hadoop (Apache’s open source implementation of the MapReduce programming model) is the technology du jour for management and analysis of high-volume data.  The Hadoop Distributed File System (HDFS) is the file system for big data storage and access in Hadoop clusters.  Apache Spark is a computing framework (built on HDFS) for fast processing of high-velocity data.

But, what about high-variety data?  The storage and management challenges of such data are already addressed (see above), but the real challenge is in performing effective and efficient statistical modeling, data mining, and discovery across high-dimensional (complex) data sets.  Software tools like Soft10 Inc.‘s “automatic statistician” Dr. Mo are designed to address that specific challenge.

When considering complex (high-variety) data, it is important to note that even relatively small-volume data sets can pose huge challenges to modeling, mining, and analysis algorithms and tools. For example, consider a gigabyte data table with a billion entries. If those entries correspond to 500 million rows and 2 columns, then some relatively simple “textbook” techniques can be applied: e.g., correlation analysis, regression analysis, Naïve Bayes, etc. However, if those entries correspond to one million rows and 1000 columns, then the complexity of the data analysis explodes exponentially.

It is not hard to find data sets that are at least this complex, if not much worse.  For example, the human genome consists of 3 billion base pairs (of just four bases: A, C, G, T) – the number of possible sequences of length 3 billion that can be formed from just four items is 4 to the power of 3 billion (limited of course by various genetic constraints). Another example will be the astronomical database to be obtained in the 10-year survey of the sky by the Large Synoptic Survey Telescope (lsst.org) – the final source table will consist of approximately 20 trillion rows and over 200 columns of scientific information per source.  Analyses of all possible combinations of these scientific parameters (to discover new correlations, patterns, associations, clusters, etc.) would be prohibitive.

The combinatorial theorem in mathematics tells us that there are (2^N – 1) possible combinations of N things. For example, a statistical analysis of a data table with just 3 columns (A,B,C) would require 7 distinct analyses (statistical models) of the behavior of the data: A, B, C, A with B, B with C, A with C, and with all three taken at once.  A data table with 5 columns would require 31 distinct analyses; and a table with 25 columns would require over 33 million distinct analyses. My calculator tells me that the number of distinct combinations of 200 variables is greater than 10^60.  This extraordinarily rapid growth rate is called the “combinatorial explosion”.  While no software package could ever perform that many variations of high-dimensional data analysis, it is common to focus on joint combinations of fewer parameters.  Even pairs, triples, and similar small-number combinations can have significant correlation and covariance, consequently yielding important discoveries.

Therefore, in order to meet the challenge of big data complexity (high variety), fast modeling technology is needed.  Such tools provide big benefits to both statisticians and non-statisticians.  These benefits multiply favorably when the technology can automatically build and test a large number of models (different combinations of parameters) in parallel.  Furthermore, the power of the technology is even more enhanced when it ranks the output models and parameter selection in order of significance and correlation strength.  Soft10’s “automatic statistician” Dr. Mo does these things and more. Dr. Mo models complex high-dimensional data both row-wise and column-wise. Dr. Mo produces high-accuracy predictions.  Dr. Mo’s proprietary multi-model technology is a powerful tool for predictive modeling and analytics across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at http://soft10ware.com/ and read more herehttp://soft10ware.com/big-data-complexity-requires-fast-modeling-technology/

Kirk Borne is a member of the Soft10, Inc. Board of Advisors.

Follow Kirk Borne on Twitter @KirkDBorne