Learning from Data Big and Small — What’s the Shape of Your Data?

(A version of this article was originally published on BigDataRepublic.com in July 2013 — that site no longer exists.)

Does discovery depend on the scale of your experiment? In some cases, no! Whether Christopher Columbus sailed with 3 ships or 3000, he still would have found the New World, probably in the same amount of time. In this case, the existence of the Americas is independent of the scale of the exploration resources. Conversely, there are many more cases where the potential for discovery does scale with the size of your resources. If those resources are Big Data, then prepare to say “hello, world” to many more new worlds (and new discoveries). The good news for small-to-mid scale projects is that, even without Big Data, you can still be a Columbus.

Learning from small data has justifiably acquired a faithful following of advocates (see this and this).  Let us illustrate this with a common example: Time Series Analysis.

In a simple single-parameter data stream, you can extract characterizations from the time series: (a) the change since the last value (y2-y1); (b) a running mean (e.g., the average of the last 3 data points = [y1+y2+y3]/2); (c) the slope of the trend line (= velocity = dy/dt = [y2-y1]/[t2-t1]); (d) the rate of change of the trend line slope (= acceleration = the 2nd derivative of the data d2y/dt = {[y3-y2]/[t3-t2] – [y2-y1]/[t2-t1]} / [t3-t1] ); (e) the rate of change of acceleration (= jerk); and so on.

Stock market day traders watch 2nd derivatives more closely than the other time series characterizations, since that parameter can signal an inflection point in the data series. Inflection points (a change in the sign of the 2nd derivative) can thus be used as a predictor of an impending turn-around point (maximum = time to sell; or minimum = time to buy) in the time series.

These simple statistical metrics are therefore valuable and informative in some circumstances.  Somewhat more interesting characterizations include the shape of the variation: e.g., U, V, or W. These symbolic representations of temporal behaviors can be quite powerful for sequence mining, pattern discovery, transition detection, and trend analysis in time series data, as well as for the all-important dimensionality reduction and indexing of massive complex data streams.

If the time series stream of data is dense (in time), then you can do a spectral (frequency) analysis to measure the strength of patterns in the time series on all scales (high-frequency to low-frequency) — this is called Fourier Analysis. This analysis gives you a large number of characterization metrics (e.g., the frequency components and their amplitudes) for dense time series.  You can monitor these metrics and alert the end-user only when the power spectrum of the different frequency components changes significantly, even if the change is in only one component (e.g., its phase or amplitude) or if a new component appears (e.g., an hourly fluctuation in data that previously only showed daily fluctuation).

Finally, imagine massive parallel streams of data: Big Time Series Data. Now the fun begins! Such parallel streams may be Twitter timelines for hundreds of millions of users, or streaming data from hundreds (or thousands) of sensors in an airplane or manufacturing plant, or streaming transaction data from millions of retail shoppers or for a large financial firm. Monitoring massively parallel data streams in this way may be a perfect job for a distributed computing environment: Map-Reduce and Hadoop.

At each step (or within each incremental time range) of such massive data streams, you can create a data distribution histogram of the data values Y (or a histogram of trend line slopes dY, or of 2nd derivatives d2Y) across the full ensemble of parallel data streams. You can then estimate a variety of statistical metrics for the separate data distributions (i.e., one set of metrics each for Y, dY, d2Y, and others) as a function of time: mean, median, mode, variance, skew, kurtosis, presence of a long tail, mixture models, and more.  (Of course, if the data are textual, as in Twitter comments, then some form of numerical coding of the text will yield a goldmine of value – that’s a story for another article.)

Exploiting a variety of statistical metrics (data stream characterizations) such as these is where the exploration and discovery potential expands significantly. Similar to the small-data cases described earlier, the values of these characteristic statistical metrics on massive data streams become a model for the state of the system that you are monitoring. The model itself can be monitored and flagged for significant changes in these characteristic statistical features or for the appearance of new features in the data streams. As long as the massive parallel data streams continue to behave in predictable consistent patterns (which is called a “stationary state”), then there is no need to alert the end-user. However, when the stationarity of the data stream model changes (perhaps triggered by a change in any one of the state parameters that exceeds a pre-specified threshold), then a signal is raised and the end-user verifies whether a truly new behavior or event has been discovered. Land ahoy! All hands on deck!

The point of these examples is to demonstrate that discovery and learning from small data is still useful and valuable. As the data set becomes increasingly larger, it is then possible (and likely) that more intricate, subtle, and descriptive features within the data will be revealed. The discovery potential of bigger data thereby increases (perhaps exponentially). Additionally, the nature and diversity of the discoveries become richer, and maybe so will you!

Follow Kirk Borne on Twitter @KirkDBorne

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.