Tag Archives: Machine Learning

Markov Models and Predictive Analytics with Cats

(A version of this article was originally published on BigDataRepublic.com in September 2013 — that site no longer exists.)

I have been teaching courses on data mining for over 10 years. One of my favorite lectures focuses on the use of Markov Models for predictive analytics. I enjoy giving this lecture because it always triggers interesting reactions from my students.  Since the lecture can be used to demonstrate advanced concepts (like Bayesian inference and probabilistic reasoning) as well as basic concepts (like conditional probability and statistical dependence), I use the lecture both in my graduate course and in my freshman class.  I start the lecture by telling the students that I will show them how to predict the future with a cat.

I begin my lecture with this question: how do you pronounce the word “cat”?  Before you stop reading this and ask “what does this have to do with data mining?” I will have to admit that my students also have a similar response, but they are my captive audience — they can’t go away — at least, they haven’t yet walked out on any of my lectures. 🙂  So, let us examine the cat question first, and then I will address the latent question “what does this have to do with predictive analytics or Big Data?”

Following the moments of confusion induced by my first question, I then bring the discussion back around to its data mining application with a second question: what are the phonemes (the perceptually distinct units of sound) that distinguish the word “cat” when you hear it spoken? We might think that there are 3 phonemes in “cat”: the “K”, “A”, and “T” sounds (phonemes P1, P2, and P3, respectively).  But, in fact, there are 4 phonemes — the 4th one (P4) being the momentarily brief “sound of silence” after the “T” sound.  That silent phoneme signals the end of the word, which is what clearly distinguishes “cat” from other words that begin with {P1,P2,P3} (e.g., catfish, catapult, Catalonia, catatonic).  This discussion reminds me of a riddle that might help to clarify my point: “Why can’t you die of hunger in the desert?” Answer: “Because of all the sand which is there” (which sounds a lot like “because of all the sandwiches there”, except for the very brief silence after the words “sand” and “which”).

Given a corpus of speeches (or speech fragments) for a specific person, you can build a comprehensive speech model that represents the words (specifically, the sequences of phonemes) that this person commonly uses.  The full distribution of conditional probabilities P[Pk|Pj] can be constructed, which then becomes the “model” for that person’s speech habits. [Note: the conditional probability expression P[Pk|Pj] refers to the transition probability that the phoneme Pk will occur immediately after the phoneme Pj has occurred.] The comprehensive speech model (i.e., the complete distribution of P[Pk,Pj] transition probabilities for a specific person) captures dialect, vernacular, peculiar pronunciations, utterances, and other recurring features in their speech (“ummm”, “you know”, “I mean”, etc.). These models are used in voice recognition software (and in our brains) both in verbal comprehension and in speaker recognition (i.e., unique identification = identifying a specific speaker, even if we can’t see them).  In this way, when you have a new voice sample, you can determine if it is consistent with the pattern of speech (the model) for a particular person.  Something similar to this was used to verify authenticity every time a new recording was released purporting to include the voice of Osama Bin Laden.

The important concepts in the “cat” example are the Markov Chain (which refers to the conditional sequence of data values) and the Markov Model (which is the model that is represented by the full set of conditional probabilities that characterize the Markov Chain).  A first-order Markov model is one in which the value of the next data point in the sequence is assumed to be statistically dependent only on the current data point.  In a second-order Markov model, the next data point is assumed to be dependent on the preceding two data points.  What is interesting and distinctive about Markov models is that most other statistical models rely on statistical independence of observed data points, whereas a Markov model (derived from and applied to Markov chains) absolutely and deliberately relies on the statistical dependence of the sequence of data points!

Therefore, we can apply Markov models in two complementary ways.  First, we can test whether an observed sequence of data values (e.g., the measured set of phoneme transitions within a speech) is consistent with a particular model (e.g., the set of transition probabilities for a known speaker).  Second, we can use the model as a predictor for what data value is likely to occur next in the sequence (i.e., use the model for predictive analytics).

In the era of Big Data, we can collect massive sequences of dependent data values (e.g., time series sequences of anything) for a large population of entities (e.g., customer purchase histories, web click logs, social events, human behaviors, speech patterns, weather reports, market quotes, device monitors, biosensors, video surveillance cameras, basketball play-by-play histories, etc.).  For each entity in the population, we can build a comprehensive Markov model.  If we do this for a full population of whatever it is that we are monitoring, we finally begin to fulfill one of the primary promises of Big Data: whole-population analytics.

From the historical training data that we have collected from all of our sources, we can construct and then use Markov Models to predict the future, including: tomorrow’s weather, or what products your customers are likely to buy, or the progression of an epidemic, or whether a cyber-attack is imminent, or whether LeBron James will pass the ball off in a 2-on-1 fast break.

Here is a simple predictive analytics example that uses a Markov model (i.e., the complete set of Markov chain transition probabilities) to predict the future.  Consider the following sequence of weather reports (a Markov chain) representing a series of 50 consecutive days (where S=sunny, R=rainy, and P=partly cloudy):


This sequence has 3 possible states (S, R, and P).  We assume that tomorrow’s weather only depends on today’s weather — therefore, this represents a first-order Markov chain. We can then ask several questions. For example: (a) what is the most probable next state to follow after the end of this sequence? (b) What is the least likely next state to follow after the end of this sequence?  In order to answer such questions, we first calculate the full set of transition probabilities (the Markov model) from the above training data:

P(S|S) = 13/22

P(S|P) = 8/17

P(S|R) = 0

P(P|S) = 9/22

P(P|P) = 5/17

P(P|R) = 3/10

P(R|S) = 0

P(R|P) = 4/17

P(R|R) = 7/10

Therefore, we find: (a) the most probable next state after the end of this sequence is R (rainy day) since P(?|R) has the largest likelihood when ? is R; and (b) the least probable next state is S (sunny day) since P(?|R) has the smallest likelihood when ? is S.  Therefore, if the above sequence represents your weather for the past 50 days, then our first-order Markov model predicts that it will rain tomorrow, with 70% confidence.

In conclusion, we find that Markov modeling is powerful predictive analytics methodology, especially for Big Data CATS (Comprehensive Analysis of Time Series).

Follow Kirk Borne on Twitter @KirkDBorne

Where to get your Data Science Training or Apprenticeship

I am frequently asked for suggestions regarding academic institutions, professional organizations, or MOOCs that provide Data Science training.  The following list will be updated occasionally (LAST UPDATED: 2016 August 16) .

Also, be sure to check out The Definitive Q&A for Aspiring Data Scientists and the story of my journey from Astrophysics to Data Science. If the latter story interests you, then here are a couple of related interviews: “Data Mining at NASA to Teaching Data Science at GMU“, and “Interview with Leading Data Science Expert“.

Here are a few places to check out:

  1. The Booz Allen Field Guide to Data Science
  2. Do you have what it takes to be a Data Scientist? (get the Booz Allen Data Science Capability Handbook)
  3. http://www.thisismetis.com/explore-data-science-online-training (formerly exploredatascience.com at Booz-Allen)
  4. http://www.thisismetis.com/
  5. https://www.teamleada.com/
  6. MapR Academy (offering Free Hadoop, Spark, HBase, Drill, Hive training and certifications at MapR)
  7. Data Science Apprenticeship at DataScienceCentral.com
  8. (500+) Colleges and Universities with Data Science Degrees
  9. List of Machine Learning Certifications and Best Data Science Bootcamps
  10. NYC Data Science Academy
  11. NCSU Institute for Advanced Analytics
  12. Master of Science in Analytics at Bellarmine University
  13. http://www.districtdatalabs.com/ (District Data Labs)
  14. http://www.dataschool.io/
  15. http://www.persontyle.com/school/ 
  16. http://www.galvanize.it/education/#classes (formerly Zipfian Academy) includes http://www.galvanizeu.com/ (Data Science, Statistics, Machine Learning, Python)
  17. https://www.coursera.org/specialization/jhudatascience/1
  18. https://www.udacity.com/courses#!/data-science 
  19. https://www.udemy.com/courses/Business/Data-and-Analytics/
  20. http://insightdatascience.com/ 
  21. Data Science Master Classes (at Datafloq)
  22. http://datasciencemasters.org
  23. http://www.jigsawacademy.com/
  24. https://intellipaat.com/
  25. http://www.athenatechacademy.com/ (Hadoop training, and more)
  26. O’Reilly Media Learning Paths
  27. http://www.godatadriven.com/training.html
  28. Courses for Data Pros at Microsoft Virtual Academy
  29. 18 Resources to Learn Data Science Online (by Simplilearn)


Follow Kirk Borne on Twitter @KirkDBorne

When Big Data Gets Local, Small Data Gets Big

We often hear that small data deserves at least as much attention in our analyses as big data. While there may be as many interpretations of that statement as there are definitions of big data, there are at least two situations where “small data” applications are worth considering. I will label these “Type A” and “Type B” situations.

In “Type A” situations, small data refers to having a razor-sharp focus on your business objectives, not on the volume of your data. If you can achieve those business objectives (and “answer the mail”) with small subsets of your data mountain, then do it, at once, without delay!

In “Type B” situations, I believe that “small” can be interpreted to mean that we are relaxing at least one of the 3 V’s of big data: Velocity, Variety, or Volume:

  1. If we focus on a localized time window within high-velocity streaming data (in order to mine frequent patterns, find anomalies, trigger alerts, or perform temporal behavioral analytics), then that is deriving value from “small data.”
  2. If we limit our analysis to a localized set of features (parameters) in our complex high-variety data collection (in order to find dominant segments of the population, or classes/subclasses of behavior, or the most significant explanatory variables, or the most highly informative variables), then that is deriving value from “small data.”
  3. If we target our analysis on a tight localized subsample of entries in our high-volume data collection (in order to deliver one-to-one customer engagement, personalization, individual customer modeling, and high-precision target marketing, all of which still require use of the full complexity, variety, and high-dimensionality of the data), then that is deriving value from “small data.”

(continue reading here: https://www.mapr.com/blog/when-big-data-goes-local-small-data-gets-big-part-1)

Follow Kirk Borne on Twitter @KirkDBorne

Local Linear Embedding(Image source**: http://mdp-toolkit.sourceforge.net/examples/lle/lle.html)

**Zito, T., Wilbert, N., Wiskott, L., Berkes, P. (2009). Modular toolkit for Data Processing (MDP): a Python data processing frame work, Front. Neuroinform. (2008) 2:8. doi:10.3389/neuro.11.008.2008

New Directions for Big Data and Analytics in 2015

The world of big data and analytics is remarkably vibrant and marked by incredible innovation, and there are advancements on every front that will continue into 2015. These include increased data science education opportunities and training programs, in-memory analytics, cloud-based everything-as-a-service, innovations in mobile (business intelligence and visual analytics), broader applications of social media (for data generation, consumption and exploration), graph (linked data) analytics, embedded machine learning and analytics in devices and processes, digital marketing automation (in retail, financial services and more), automated discovery in sensor-fed data streams (including the internet of everything), gamification, crowdsourcing, personalized everything (medicine, education, customer experience and more) and smart everything (highways, cities, power grid, farms, supply chain, manufacturing and more).

Within this world of wonder, where will we wander with big data and analytics in 2015? I predict two directions for the coming year…

(continue reading herehttp://www.ibmbigdatahub.com/blog/new-directions-big-data-and-analytics-2015)

Follow Kirk Borne on Twitter @KirkDBorne

Feature Mining in Big Data

We love features in our data, lots of features, in the same way that we love features in our toys, mobile phones, cars, and other gadgets.  Good features in our big data collection empower us to build accurate predictive models, identify the most informative trends in our data, discover insightful patterns, and select the most descriptive parameters for data visualizations. Therefore, it is no surprise that feature mining is one aspect of data science that appeals to all data scientists. Feature mining includes: (1) feature generation (from combinations of existing attributes), (2) feature selection (for mining and knowledge discovery), and (3) feature extraction (for operational systems, decision support, and reuse in various analytics processes, dashboards, and pipelines).

Learn more about feature mining and feature selection for Big Data Analytics in these publications:

  1. Feature-Rich Toys and Data
  2. Interactive Visualization-enabled Feature Selection and Model Creation
  3. Feature Selection (available on the National Science Bowl blog site)
  4. Feature Selection Methods used with different Data Mining algorithms
  5. (and for heavy data science pundits) Computational Methods of Feature Selection

Follow Kirk Borne on Twitter @KirkDBorne

Machine Unlearning and The Value of Imperfect Models

Common wisdom states that “perfect is the enemy of good enough.” We can apply this wisdom to the machine learning models that we train and deploy for big data analytics. If we strive for perfection, then we may encounter several potential risks. It may be useful therefore to pay attention to a little bit of “machine unlearning.” For example:


By attempting to build a model that correctly follows every little nuance, deviation, and variation in our data set, we are consequently almost certainly fitting the natural variance in the data, which will never go away.  After building such a model, we may find that it has nearly 100% accuracy on the training data, but significantly lower accuracy on the test data set.  These test results are guaranteed proof that we have overfit our model. Of course, we don’t want a trivial model (an underfit model) either – to paraphrase Albert Einstein: “models should be as simple as possible, but no simpler.”

(continue reading herehttps://www.mapr.com/blog/machine-unlearning-value-imperfect-models)

Follow Kirk Borne on Twitter @KirkDBorne