(A version of this article was originally published on BigDataRepublic.com in September 2013 — that site no longer exists.)
I have been teaching courses on data mining for over 10 years. One of my favorite lectures focuses on the use of Markov Models for predictive analytics. I enjoy giving this lecture because it always triggers interesting reactions from my students. Since the lecture can be used to demonstrate advanced concepts (like Bayesian inference and probabilistic reasoning) as well as basic concepts (like conditional probability and statistical dependence), I use the lecture both in my graduate course and in my freshman class. I start the lecture by telling the students that I will show them how to predict the future with a cat.
I begin my lecture with this question: how do you pronounce the word “cat”? Before you stop reading this and ask “what does this have to do with data mining?” I will have to admit that my students also have a similar response, but they are my captive audience — they can’t go away — at least, they haven’t yet walked out on any of my lectures. 🙂 So, let us examine the cat question first, and then I will address the latent question “what does this have to do with predictive analytics or Big Data?”
Following the moments of confusion induced by my first question, I then bring the discussion back around to its data mining application with a second question: what are the phonemes (the perceptually distinct units of sound) that distinguish the word “cat” when you hear it spoken? We might think that there are 3 phonemes in “cat”: the “K”, “A”, and “T” sounds (phonemes P1, P2, and P3, respectively). But, in fact, there are 4 phonemes — the 4th one (P4) being the momentarily brief “sound of silence” after the “T” sound. That silent phoneme signals the end of the word, which is what clearly distinguishes “cat” from other words that begin with {P1,P2,P3} (e.g., catfish, catapult, Catalonia, catatonic). This discussion reminds me of a riddle that might help to clarify my point: “Why can’t you die of hunger in the desert?” Answer: “Because of all the sand which is there” (which sounds a lot like “because of all the sandwiches there”, except for the very brief silence after the words “sand” and “which”).
Given a corpus of speeches (or speech fragments) for a specific person, you can build a comprehensive speech model that represents the words (specifically, the sequences of phonemes) that this person commonly uses. The full distribution of conditional probabilities P[Pk|Pj] can be constructed, which then becomes the “model” for that person’s speech habits. [Note: the conditional probability expression P[Pk|Pj] refers to the transition probability that the phoneme Pk will occur immediately after the phoneme Pj has occurred.] The comprehensive speech model (i.e., the complete distribution of P[Pk,Pj] transition probabilities for a specific person) captures dialect, vernacular, peculiar pronunciations, utterances, and other recurring features in their speech (“ummm”, “you know”, “I mean”, etc.). These models are used in voice recognition software (and in our brains) both in verbal comprehension and in speaker recognition (i.e., unique identification = identifying a specific speaker, even if we can’t see them). In this way, when you have a new voice sample, you can determine if it is consistent with the pattern of speech (the model) for a particular person. Something similar to this was used to verify authenticity every time a new recording was released purporting to include the voice of Osama Bin Laden.
The important concepts in the “cat” example are the Markov Chain (which refers to the conditional sequence of data values) and the Markov Model (which is the model that is represented by the full set of conditional probabilities that characterize the Markov Chain). A first-order Markov model is one in which the value of the next data point in the sequence is assumed to be statistically dependent only on the current data point. In a second-order Markov model, the next data point is assumed to be dependent on the preceding two data points. What is interesting and distinctive about Markov models is that most other statistical models rely on statistical independence of observed data points, whereas a Markov model (derived from and applied to Markov chains) absolutely and deliberately relies on the statistical dependence of the sequence of data points!
Therefore, we can apply Markov models in two complementary ways. First, we can test whether an observed sequence of data values (e.g., the measured set of phoneme transitions within a speech) is consistent with a particular model (e.g., the set of transition probabilities for a known speaker). Second, we can use the model as a predictor for what data value is likely to occur next in the sequence (i.e., use the model for predictive analytics).
In the era of Big Data, we can collect massive sequences of dependent data values (e.g., time series sequences of anything) for a large population of entities (e.g., customer purchase histories, web click logs, social events, human behaviors, speech patterns, weather reports, market quotes, device monitors, biosensors, video surveillance cameras, basketball play-by-play histories, etc.). For each entity in the population, we can build a comprehensive Markov model. If we do this for a full population of whatever it is that we are monitoring, we finally begin to fulfill one of the primary promises of Big Data: whole-population analytics.
From the historical training data that we have collected from all of our sources, we can construct and then use Markov Models to predict the future, including: tomorrow’s weather, or what products your customers are likely to buy, or the progression of an epidemic, or whether a cyber-attack is imminent, or whether LeBron James will pass the ball off in a 2-on-1 fast break.
Here is a simple predictive analytics example that uses a Markov model (i.e., the complete set of Markov chain transition probabilities) to predict the future. Consider the following sequence of weather reports (a Markov chain) representing a series of 50 consecutive days (where S=sunny, R=rainy, and P=partly cloudy):
SSPPS PRRPP SSSPR RRRPS SSSPP PSSSS SPSSP PSPSS PRRPS SPRRR
This sequence has 3 possible states (S, R, and P). We assume that tomorrow’s weather only depends on today’s weather — therefore, this represents a first-order Markov chain. We can then ask several questions. For example: (a) what is the most probable next state to follow after the end of this sequence? (b) What is the least likely next state to follow after the end of this sequence? In order to answer such questions, we first calculate the full set of transition probabilities (the Markov model) from the above training data:
P(S|S) = 13/22
P(S|P) = 8/17
P(S|R) = 0
P(P|S) = 9/22
P(P|P) = 5/17
P(P|R) = 3/10
P(R|S) = 0
P(R|P) = 4/17
P(R|R) = 7/10
Therefore, we find: (a) the most probable next state after the end of this sequence is R (rainy day) since P(?|R) has the largest likelihood when ? is R; and (b) the least probable next state is S (sunny day) since P(?|R) has the smallest likelihood when ? is S. Therefore, if the above sequence represents your weather for the past 50 days, then our first-order Markov model predicts that it will rain tomorrow, with 70% confidence.
In conclusion, we find that Markov modeling is powerful predictive analytics methodology, especially for Big Data CATS (Comprehensive Analysis of Time Series).
Follow Kirk Borne on Twitter @KirkDBorne