Tag Archives: Machine Learning

Learning from Data Big and Small — What’s the Shape of Your Data?

(A version of this article was originally published on BigDataRepublic.com in July 2013 — that site no longer exists.)

Does discovery depend on the scale of your experiment? In some cases, no! Whether Christopher Columbus sailed with 3 ships or 3000, he still would have found the New World, probably in the same amount of time. In this case, the existence of the Americas is independent of the scale of the exploration resources. Conversely, there are many more cases where the potential for discovery does scale with the size of your resources. If those resources are Big Data, then prepare to say “hello, world” to many more new worlds (and new discoveries). The good news for small-to-mid scale projects is that, even without Big Data, you can still be a Columbus.

Learning from small data has justifiably acquired a faithful following of advocates (see this and this).  Let us illustrate this with a common example: Time Series Analysis.

In a simple single-parameter data stream, you can extract characterizations from the time series: (a) the change since the last value (y2-y1); (b) a running mean (e.g., the average of the last 3 data points = [y1+y2+y3]/2); (c) the slope of the trend line (= velocity = dy/dt = [y2-y1]/[t2-t1]); (d) the rate of change of the trend line slope (= acceleration = the 2nd derivative of the data d2y/dt = {[y3-y2]/[t3-t2] – [y2-y1]/[t2-t1]} / [t3-t1] ); (e) the rate of change of acceleration (= jerk); and so on.

Stock market day traders watch 2nd derivatives more closely than the other time series characterizations, since that parameter can signal an inflection point in the data series. Inflection points (a change in the sign of the 2nd derivative) can thus be used as a predictor of an impending turn-around point (maximum = time to sell; or minimum = time to buy) in the time series.

These simple statistical metrics are therefore valuable and informative in some circumstances.  Somewhat more interesting characterizations include the shape of the variation: e.g., U, V, or W. These symbolic representations of temporal behaviors can be quite powerful for sequence mining, pattern discovery, transition detection, and trend analysis in time series data, as well as for the all-important dimensionality reduction and indexing of massive complex data streams.

If the time series stream of data is dense (in time), then you can do a spectral (frequency) analysis to measure the strength of patterns in the time series on all scales (high-frequency to low-frequency) — this is called Fourier Analysis. This analysis gives you a large number of characterization metrics (e.g., the frequency components and their amplitudes) for dense time series.  You can monitor these metrics and alert the end-user only when the power spectrum of the different frequency components changes significantly, even if the change is in only one component (e.g., its phase or amplitude) or if a new component appears (e.g., an hourly fluctuation in data that previously only showed daily fluctuation).

Finally, imagine massive parallel streams of data: Big Time Series Data. Now the fun begins! Such parallel streams may be Twitter timelines for hundreds of millions of users, or streaming data from hundreds (or thousands) of sensors in an airplane or manufacturing plant, or streaming transaction data from millions of retail shoppers or for a large financial firm. Monitoring massively parallel data streams in this way may be a perfect job for a distributed computing environment: Map-Reduce and Hadoop.

At each step (or within each incremental time range) of such massive data streams, you can create a data distribution histogram of the data values Y (or a histogram of trend line slopes dY, or of 2nd derivatives d2Y) across the full ensemble of parallel data streams. You can then estimate a variety of statistical metrics for the separate data distributions (i.e., one set of metrics each for Y, dY, d2Y, and others) as a function of time: mean, median, mode, variance, skew, kurtosis, presence of a long tail, mixture models, and more.  (Of course, if the data are textual, as in Twitter comments, then some form of numerical coding of the text will yield a goldmine of value – that’s a story for another article.)

Exploiting a variety of statistical metrics (data stream characterizations) such as these is where the exploration and discovery potential expands significantly. Similar to the small-data cases described earlier, the values of these characteristic statistical metrics on massive data streams become a model for the state of the system that you are monitoring. The model itself can be monitored and flagged for significant changes in these characteristic statistical features or for the appearance of new features in the data streams. As long as the massive parallel data streams continue to behave in predictable consistent patterns (which is called a “stationary state”), then there is no need to alert the end-user. However, when the stationarity of the data stream model changes (perhaps triggered by a change in any one of the state parameters that exceeds a pre-specified threshold), then a signal is raised and the end-user verifies whether a truly new behavior or event has been discovered. Land ahoy! All hands on deck!

The point of these examples is to demonstrate that discovery and learning from small data is still useful and valuable. As the data set becomes increasingly larger, it is then possible (and likely) that more intricate, subtle, and descriptive features within the data will be revealed. The discovery potential of bigger data thereby increases (perhaps exponentially). Additionally, the nature and diversity of the discoveries become richer, and maybe so will you!

Follow Kirk Borne on Twitter @KirkDBorne

Markov Models and Predictive Analytics with Cats

(A version of this article was originally published on BigDataRepublic.com in September 2013 — that site no longer exists.)

I have been teaching courses on data mining for over 10 years. One of my favorite lectures focuses on the use of Markov Models for predictive analytics. I enjoy giving this lecture because it always triggers interesting reactions from my students.  Since the lecture can be used to demonstrate advanced concepts (like Bayesian inference and probabilistic reasoning) as well as basic concepts (like conditional probability and statistical dependence), I use the lecture both in my graduate course and in my freshman class.  I start the lecture by telling the students that I will show them how to predict the future with a cat.

I begin my lecture with this question: how do you pronounce the word “cat”?  Before you stop reading this and ask “what does this have to do with data mining?” I will have to admit that my students also have a similar response, but they are my captive audience — they can’t go away — at least, they haven’t yet walked out on any of my lectures. 🙂  So, let us examine the cat question first, and then I will address the latent question “what does this have to do with predictive analytics or Big Data?”

Following the moments of confusion induced by my first question, I then bring the discussion back around to its data mining application with a second question: what are the phonemes (the perceptually distinct units of sound) that distinguish the word “cat” when you hear it spoken? We might think that there are 3 phonemes in “cat”: the “K”, “A”, and “T” sounds (phonemes P1, P2, and P3, respectively).  But, in fact, there are 4 phonemes — the 4th one (P4) being the momentarily brief “sound of silence” after the “T” sound.  That silent phoneme signals the end of the word, which is what clearly distinguishes “cat” from other words that begin with {P1,P2,P3} (e.g., catfish, catapult, Catalonia, catatonic).  This discussion reminds me of a riddle that might help to clarify my point: “Why can’t you die of hunger in the desert?” Answer: “Because of all the sand which is there” (which sounds a lot like “because of all the sandwiches there”, except for the very brief silence after the words “sand” and “which”).

Given a corpus of speeches (or speech fragments) for a specific person, you can build a comprehensive speech model that represents the words (specifically, the sequences of phonemes) that this person commonly uses.  The full distribution of conditional probabilities P[Pk|Pj] can be constructed, which then becomes the “model” for that person’s speech habits. [Note: the conditional probability expression P[Pk|Pj] refers to the transition probability that the phoneme Pk will occur immediately after the phoneme Pj has occurred.] The comprehensive speech model (i.e., the complete distribution of P[Pk,Pj] transition probabilities for a specific person) captures dialect, vernacular, peculiar pronunciations, utterances, and other recurring features in their speech (“ummm”, “you know”, “I mean”, etc.). These models are used in voice recognition software (and in our brains) both in verbal comprehension and in speaker recognition (i.e., unique identification = identifying a specific speaker, even if we can’t see them).  In this way, when you have a new voice sample, you can determine if it is consistent with the pattern of speech (the model) for a particular person.  Something similar to this was used to verify authenticity every time a new recording was released purporting to include the voice of Osama Bin Laden.

The important concepts in the “cat” example are the Markov Chain (which refers to the conditional sequence of data values) and the Markov Model (which is the model that is represented by the full set of conditional probabilities that characterize the Markov Chain).  A first-order Markov model is one in which the value of the next data point in the sequence is assumed to be statistically dependent only on the current data point.  In a second-order Markov model, the next data point is assumed to be dependent on the preceding two data points.  What is interesting and distinctive about Markov models is that most other statistical models rely on statistical independence of observed data points, whereas a Markov model (derived from and applied to Markov chains) absolutely and deliberately relies on the statistical dependence of the sequence of data points!

Therefore, we can apply Markov models in two complementary ways.  First, we can test whether an observed sequence of data values (e.g., the measured set of phoneme transitions within a speech) is consistent with a particular model (e.g., the set of transition probabilities for a known speaker).  Second, we can use the model as a predictor for what data value is likely to occur next in the sequence (i.e., use the model for predictive analytics).

In the era of Big Data, we can collect massive sequences of dependent data values (e.g., time series sequences of anything) for a large population of entities (e.g., customer purchase histories, web click logs, social events, human behaviors, speech patterns, weather reports, market quotes, device monitors, biosensors, video surveillance cameras, basketball play-by-play histories, etc.).  For each entity in the population, we can build a comprehensive Markov model.  If we do this for a full population of whatever it is that we are monitoring, we finally begin to fulfill one of the primary promises of Big Data: whole-population analytics.

From the historical training data that we have collected from all of our sources, we can construct and then use Markov Models to predict the future, including: tomorrow’s weather, or what products your customers are likely to buy, or the progression of an epidemic, or whether a cyber-attack is imminent, or whether LeBron James will pass the ball off in a 2-on-1 fast break.

Here is a simple predictive analytics example that uses a Markov model (i.e., the complete set of Markov chain transition probabilities) to predict the future.  Consider the following sequence of weather reports (a Markov chain) representing a series of 50 consecutive days (where S=sunny, R=rainy, and P=partly cloudy):

SSPPS PRRPP SSSPR RRRPS SSSPP PSSSS SPSSP PSPSS PRRPS SPRRR

This sequence has 3 possible states (S, R, and P).  We assume that tomorrow’s weather only depends on today’s weather — therefore, this represents a first-order Markov chain. We can then ask several questions. For example: (a) what is the most probable next state to follow after the end of this sequence? (b) What is the least likely next state to follow after the end of this sequence?  In order to answer such questions, we first calculate the full set of transition probabilities (the Markov model) from the above training data:

P(S|S) = 13/22

P(S|P) = 8/17

P(S|R) = 0

P(P|S) = 9/22

P(P|P) = 5/17

P(P|R) = 3/10

P(R|S) = 0

P(R|P) = 4/17

P(R|R) = 7/10

Therefore, we find: (a) the most probable next state after the end of this sequence is R (rainy day) since P(?|R) has the largest likelihood when ? is R; and (b) the least probable next state is S (sunny day) since P(?|R) has the smallest likelihood when ? is S.  Therefore, if the above sequence represents your weather for the past 50 days, then our first-order Markov model predicts that it will rain tomorrow, with 70% confidence.

In conclusion, we find that Markov modeling is powerful predictive analytics methodology, especially for Big Data CATS (Comprehensive Analysis of Time Series).

Follow Kirk Borne on Twitter @KirkDBorne

Where to get your Data Science Training or Apprenticeship

I am frequently asked for suggestions regarding academic institutions, professional organizations, or MOOCs that provide Data Science training.  The following list will be updated occasionally (LAST UPDATED: 2018 March 29) .

Also, be sure to check out The Definitive Q&A for Aspiring Data Scientists and the story of my journey from Astrophysics to Data Science. If the latter story interests you, then here are a couple of related interviews: “Data Mining at NASA to Teaching Data Science at GMU“, and “Interview with Leading Data Science Expert“.

Here are a few places to check out:

  1. The Booz Allen Field Guide to Data Science
  2. Do you have what it takes to be a Data Scientist? (get the Booz Allen Data Science Capability Handbook)
  3. http://www.thisismetis.com/explore-data-science-online-training (formerly exploredatascience.com at Booz-Allen)
  4. http://www.thisismetis.com/
  5. https://www.teamleada.com/
  6. MapR Academy (offering Free Hadoop, Spark, HBase, Drill, Hive training and certifications at MapR)
  7. Data Science Apprenticeship at DataScienceCentral.com
  8. (500+) Colleges and Universities with Data Science Degrees
  9. List of Machine Learning Certifications and Best Data Science Bootcamps
  10. NYC Data Science Academy
  11. NCSU Institute for Advanced Analytics
  12. Master of Science in Analytics at Bellarmine University
  13. http://www.districtdatalabs.com/ (District Data Labs)
  14. http://www.dataschool.io/
  15. http://www.persontyle.com/school/ 
  16. http://www.galvanize.it/education/#classes (formerly Zipfian Academy) includes http://www.galvanizeu.com/ (Data Science, Statistics, Machine Learning, Python)
  17. https://www.coursera.org/specialization/jhudatascience/1
  18. https://www.udacity.com/courses#!/data-science 
  19. https://www.udemy.com/courses/Business/Data-and-Analytics/
  20. http://insightdatascience.com/ 
  21. Data Science Master Classes (at Datafloq)
  22. http://datasciencemasters.org
  23. http://www.jigsawacademy.com/
  24. https://intellipaat.com/
  25. http://www.athenatechacademy.com/ (Hadoop training, and more)
  26. O’Reilly Media Learning Paths
  27. http://www.godatadriven.com/training.html
  28. Courses for Data Pros at Microsoft Virtual Academy
  29. 18 Resources to Learn Data Science Online (by Simplilearn)
  30. Learn Everything About Analytics (by AnalyticsVidhya)
  31. Data Science Masters Degree Programs

DataMiningTagCloudTagxedoWordCloud

Follow Kirk Borne on Twitter @KirkDBorne

When Big Data Gets Local, Small Data Gets Big

We often hear that small data deserves at least as much attention in our analyses as big data. While there may be as many interpretations of that statement as there are definitions of big data, there are at least two situations where “small data” applications are worth considering. I will label these “Type A” and “Type B” situations.

In “Type A” situations, small data refers to having a razor-sharp focus on your business objectives, not on the volume of your data. If you can achieve those business objectives (and “answer the mail”) with small subsets of your data mountain, then do it, at once, without delay!

In “Type B” situations, I believe that “small” can be interpreted to mean that we are relaxing at least one of the 3 V’s of big data: Velocity, Variety, or Volume:

  1. If we focus on a localized time window within high-velocity streaming data (in order to mine frequent patterns, find anomalies, trigger alerts, or perform temporal behavioral analytics), then that is deriving value from “small data.”
  2. If we limit our analysis to a localized set of features (parameters) in our complex high-variety data collection (in order to find dominant segments of the population, or classes/subclasses of behavior, or the most significant explanatory variables, or the most highly informative variables), then that is deriving value from “small data.”
  3. If we target our analysis on a tight localized subsample of entries in our high-volume data collection (in order to deliver one-to-one customer engagement, personalization, individual customer modeling, and high-precision target marketing, all of which still require use of the full complexity, variety, and high-dimensionality of the data), then that is deriving value from “small data.”

(continue reading here: https://www.mapr.com/blog/when-big-data-goes-local-small-data-gets-big-part-1)

Follow Kirk Borne on Twitter @KirkDBorne

Local Linear Embedding(Image source**: http://mdp-toolkit.sourceforge.net/examples/lle/lle.html)

**Zito, T., Wilbert, N., Wiskott, L., Berkes, P. (2009). Modular toolkit for Data Processing (MDP): a Python data processing frame work, Front. Neuroinform. (2008) 2:8. doi:10.3389/neuro.11.008.2008

New Directions for Big Data and Analytics in 2015

The world of big data and analytics is remarkably vibrant and marked by incredible innovation, and there are advancements on every front that will continue into 2015. These include increased data science education opportunities and training programs, in-memory analytics, cloud-based everything-as-a-service, innovations in mobile (business intelligence and visual analytics), broader applications of social media (for data generation, consumption and exploration), graph (linked data) analytics, embedded machine learning and analytics in devices and processes, digital marketing automation (in retail, financial services and more), automated discovery in sensor-fed data streams (including the internet of everything), gamification, crowdsourcing, personalized everything (medicine, education, customer experience and more) and smart everything (highways, cities, power grid, farms, supply chain, manufacturing and more).

Within this world of wonder, where will we wander with big data and analytics in 2015? I predict two directions for the coming year…

(continue reading herehttp://www.ibmbigdatahub.com/blog/new-directions-big-data-and-analytics-2015)

Follow Kirk Borne on Twitter @KirkDBorne

Feature Mining in Big Data

We love features in our data, lots of features, in the same way that we love features in our toys, mobile phones, cars, and other gadgets.  Good features in our big data collection empower us to build accurate predictive models, identify the most informative trends in our data, discover insightful patterns, and select the most descriptive parameters for data visualizations. Therefore, it is no surprise that feature mining is one aspect of data science that appeals to all data scientists. Feature mining includes: (1) feature generation (from combinations of existing attributes), (2) feature selection (for mining and knowledge discovery), and (3) feature extraction (for operational systems, decision support, and reuse in various analytics processes, dashboards, and pipelines).

Learn more about feature mining and feature selection for Big Data Analytics in these publications:

  1. Feature-Rich Toys and Data
  2. Interactive Visualization-enabled Feature Selection and Model Creation
  3. Feature Selection (available on the National Science Bowl blog site)
  4. Feature Selection Methods used with different Data Mining algorithms
  5. (and for heavy data science pundits) Computational Methods of Feature Selection

Follow Kirk Borne on Twitter @KirkDBorne

Machine Unlearning and The Value of Imperfect Models

Common wisdom states that “perfect is the enemy of good enough.” We can apply this wisdom to the machine learning models that we train and deploy for big data analytics. If we strive for perfection, then we may encounter several potential risks. It may be useful therefore to pay attention to a little bit of “machine unlearning.” For example:

Overfitting

By attempting to build a model that correctly follows every little nuance, deviation, and variation in our data set, we are consequently almost certainly fitting the natural variance in the data, which will never go away.  After building such a model, we may find that it has nearly 100% accuracy on the training data, but significantly lower accuracy on the test data set.  These test results are guaranteed proof that we have overfit our model. Of course, we don’t want a trivial model (an underfit model) either – to paraphrase Albert Einstein: “models should be as simple as possible, but no simpler.”

(continue reading herehttps://www.mapr.com/blog/machine-unlearning-value-imperfect-models)

Follow Kirk Borne on Twitter @KirkDBorne