Tag Archives: Data Mining

Data Science Blogs-R-Us

[UPDATED December 31, 2022]

I have written articles in many places. I will be collecting links to those sources here. The list is not complete and will be constantly evolving. There are some older blogs that I will be including in the list below as I remember them and find them. Also included are some interviews in which I provided detailed answers to a variety of questions.

In 2019, I was listed as the #1 Top Data Science Blogger to Follow on Twitter.

And then there’s this — not a blog, but a link to my 2013 TedX talk: “Big Data, Small World.” (Many more videos of my talks are available online. That list will be compiled in another place soon.)

  1. Rocket-Powered Data Science (the website that you are now reading).
  2. https://medium.com/@kirk.borne
  3. https://www.the-yuan.com/search.html (Search for “Kirk Borne” blogs)
  4. https://www.datasciencecentral.com/author/kirkborne/
  5. https://medium.com/@relx/ai-adoption-in-2021-driven-by-many-external-factors-af5b848cee33
  6. https://muckrack.com/kirk-borne/articles
  7. https://www.govloop.com/author/kirkdborne/
  8. https://datamakespossible.westerndigital.com/tag/kirk-borne/
  9. https://www.linkedin.com/in/kirkdborne/detail/recent-activity/posts/
  10. https://www.linkedin.com/pulse/how-go-from-data-paradox-productivity-business-kirk-borne-ph-d-/
  11. https://blog.starburst.io/author/kirk-borne
  12. https://www.oreilly.com/people/kirk-borne/
  13. https://www.syntasa.com/blog/author/kirk-borne
  14. https://mapr.com/blog/author/kirk-borne/
  15. https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/bult.2013.1720390414
  16. https://www.thedatadreamer.com/insights/talk-the-walk-the-importance-of-fluency-in-data-storytelling/
  17. https://www.futureofbusinessandtech.com/business-ai/leveraging-artificial-intelligence-for-social-good/
  18. https://mindsdb.com/blog/predictions-at-the-speed-of-questions/?utm_source=kirk&utm_medium=blog&utm_campaign=wb
  19. https://blog.qlik.com/how-we-teach-the-leaders-of-tomorrow-to-be-curious-ask-questions-and-not-be-afraid-to-fail-fast-to-learn-fast
  20. https://www.boozallen.com/s/insight/blog/kirk-borne-on-building-data-science-models.html
  21. https://www.boozallen.com/s/insight/blog/the-power-of-data-science-and-ai-for-social-good.html
  22. https://odsc.com/blog/adapting-machine-learning-algorithms-to-novel-use-cases/
  23. https://www.kdnuggets.com/2019/01/data-scientist-dilemma-cold-start-machine-learning.html
  24. https://www.sas.com/en_us/insights/articles/analytics/data-scientist-data-literacy.html
  25. https://blogs.sas.com/content/sascom/2019/04/27/getting-practical-about-ai-with-kirk-borne/
  26. https://blogs.sas.com/content/sascom/2017/08/31/3-machine-learning-technologies-3-three-years/
  27. https://www.digitalistmag.com/future-of-work/2019/05/15/intelligent-enterprise-connecting-islands-of-innovation-06198471
  28. https://www.digitalistmag.com/cio-knowledge/2019/06/27/data-strategy-that-first-date-with-your-data-06199224
  29. https://blogs.oracle.com/author/kirk-borne
  30. https://blogs.thomsonreuters.com/answerson/doing-better-at-your-service-with-ai-as-a-service/
  31. https://www.aitimejournal.com/data-science-interview-with-kirk-borne-principal-data-scientist-booz-allen-hamilton
  32. https://insideanalysis.com/author/kirk-borne/
  33. http://researcher123.blogspot.com/2014/
  34. https://www.manthan.com/blogs/nrf-interview-with-kirk-borne-big-data-hype-the-worst-is-behind-us/
  35. https://www.thinkful.com/blog/meet-the-experts-dr-kirk-borne/
  36. https://itpeernetwork.intel.com/author/kirkborne/#gs.6zd0x8
  37. https://www.ibmbigdatahub.com/blog/author/kirk-borne
  38. https://www.laserfiche.com/ecmblog/3-questions-kirk-borne-about-big-data/

Data Science Training Opportunities

A few years ago, I generated a list of places to receive data science training. That list has become a bit stale. So, I have updated the list, adding some new opportunities, keeping many of the previous ones, and removing the obsolete ones.

Also, here is a thorough, informative, and interesting article that outlines the critical skills needed in order to be a good data scientist: https://www.toptal.com/data-science#hiring-guide

Here are 30 training opportunities that I encourage you to explore:

  1. The Booz Allen Field Guide to Data Science
  2. NYC Data Science Academy
  3. NVIDIA Deep Learning Institute
  4. Metis Data Science Training
  5. Leada’s online analytics labs
  6. Data Science Training by General Assembly
  7. Learn Data Science Online by DataCamp
  8. (600+) Colleges and Universities with Data Science Degrees
  9. Data Science Master’s Degree Programs
  10. Data Analytics, Machine Learning, & Statistics Courses at edX
  11. Data Science Certifications (by AnalyticsVidhya)
  12. Learn Everything About Analytics (by AnalyticsVidhya)
  13. Big Bang Data Science Solutions
  14. CommonLounge
  15. IntelliPaat Online Training
  16. DataQuest
  17. NCSU Institute for Advanced Analytics
  18. District Data Labs
  19. Data School
  20. Galvanize
  21. Coursera
  22. Udacity Nanodegree Program to Become a Data Scientist
  23. Udemy – Data & Analytics
  24. Insight Data Science Fellows Program
  25. The Open Source Data Science Masters
  26. Jigsaw Academy Post Graduate Program in Data Science & Machine Learning
  27. O’Reilly Media Learning Paths
  28. Data Engineering and Data Science Training by Go Data Driven
  29. 18 Resources to Learn Data Science Online (by Simplilearn)
  30. Top Online Data Science Courses to Learn Data Science

Follow Kirk Borne on Twitter @KirkDBorne

Field Guide to Data Science
Learn the what, why, and how of Data Science and Machine Learning here.

Meta-Learning For Better Machine Learning

In a related post we discussed the Cold Start Problem in Data Science — how do you start to build a model when you have either no training data or no clear choice of model parameters. An example of a cold start problem is k-Means Clustering, where the number of clusters k in the data set is not known in advance, and the locations of those clusters in feature space (i.e., the cluster means) are not known either. So, you start by assuming a value for k and making random assumptions about the cluster means, and then iterate until you find the optimal set of clusters, based upon some evaluation metric. See the related post for more details about the cold start challenge. See the attached graphic below for a simple demonstration of a k-Means Clustering application.

The above example (clustering) is taken from unsupervised machine learning (where there are no labels on the training data). There are also examples of cold start in supervised machine learning (where you do have class labels on the training data).

As an example of a cold start in supervised learning, we look at neural network models, where the weights on the edges that connect the various nodes in the network layers are not known initially. So, random values (e.g., all weights = 1) are assigned to all of the edge weights (which could number in the hundreds or thousands) — that’s the cold start. Following that, the weights can “learn” to get better through a technique known as backpropagation, which is applied through sequential iterations of the neural network learning process. A validation metric estimates the error in each model iteration in the sequence (i.e., the classification error on the validation or hold-out data set), then applies the backpropagation technique to assign some portion of the error to each of the edge weights. Each edge weight is adjusted accordingly using gradient descent (or some similar error correction rate estimator) for the next model in the sequence. The next iteration of the neural network modeling process is executed, applying the same steps as above, and the process continues until the validation metric converges to the optimal final model.

What is missing in the above discussion is the deeper set of unknowns in the learning process. This is the meta-learning phase. We can elucidate this phase through our two examples above.

From the first example above, k-Means Clustering:

  • What is the value of k?
  • Which features in the data set are most effective in creating distinct clusters in the data (i.e., to create the segments that are the most compact internally, and relatively the most separated from each other)? There might be dozens or hundreds or thousands of attributes to choose from, and a vast number of combinations of those attributes in which to explore clustering in different dimensions of parameter space.
  • What distance metric should be used to estimate separation (or what similarity metric should be used to estimate similarity), since clustering is a distance-based algorithm? There are some common choices for distance and similarity metrics (e.g., cosine similarity, Euclidean distance, Manhattan distance, Mahalanobis distance, Lp-Norm, etc.), but that is just the tip of a vast iceberg — just take a look at the 750-page book Encyclopedia of Distances.
  • What evaluation metric should we use to determine if the clusters are “good enough” or optimal (i.e., the most compact set of clusters relative to the separation of the clusters)? There are several choices for such evaluation metrics: Dunn index, Davies-Bouldin index, C-index, and Silhouette analysis are just a few examples.

We need to decide on all of these parameterizations of the clustering model before the cold start interations on the cluster means can begin.

From the second example above, Neural Network modeling, there are also many different preliminary tasks and parameterizations of the network that need to be decided and acted on before the cold start iterations on the edge weights can begin:

This now gets to the heart of meta-learning. It is focused on learning the right tasks to perform and on tuning the modeling hyper-parameters. These are the different tasks and “external” parameters that differentiate various instantiations of a specific model within a broader category of models — those tasks and external parameterizations must be explored before you start building, iterating, and validating a specific model’s “internal” parameters. For example:

  • You can cluster children’s toys in a toy store by color, or by shape, or by electronic vs. non-electronic, or by age-appropriateness, or by functionality, or cluster them by some combination of those features.
  • You can cluster (segment) your customers by the types of products they buy, or by their geographic location, or by their gender, or by their age, or by the day of week that they prefer to shop, or cluster them by some combination of those many different variables.
  • You can cluster medical drug treatments by the types of symptoms that they address, or by the medical diagnoses (outcomes) that they attempt to cure, or by their dosage amounts, or cluster them by the side-effects that are caused when different combinations of the drugs are used.

Deciding on the higher-level hyper-parameterizations of your clustering approach before you build the actual models is good data science and good business, no matter whether you are sorting toys, or discovering segments in your customer database, or prescribing different medications to medical patients.

Similar decisions must be made for the neural network example mentioned earlier as well as for numerous other machine learning modeling techniques. Meta-learning is important to make sure that you are aware of and attentive to the many choices of modeling tasks and parameterizations for the models that you are about to train. Meta-learning is also critical for demonstrating (proving) that you built the best (or optimal or most accurate) model, given the higher level characteristics (e.g., parameters, architecture, or input data sources) of the modeling effort:

  • What is the business case? What outcomes will be actionable?
  • What data do we have? Which combinations of data have we not explored yet?
  • What metric will demonstrate that we have achieved the globally optimal model (or approximately the global optimum), versus some locally good model that doesn’t generalize across a larger data set?

Genetic Algorithms (GAs) are an example of meta-learning. They are not machine learning algorithms in themselves, but GAs can be applied across ensembles of machine learning models and tasks, in order to find the optimal model (perhaps globally optimal model) across a collection of locally optimal solutions.

Learn more about meta-learning from these resources:

Finally, in addition to the awesome 750-page book Encyclopedia of Distances“, please check out some of these top-selling books on Data Science, AI, and Machine Learning:

Top Books in AI and Machine Learning

Top Books in AI and Machine Learning


Disclosure statement: As an Amazon Associate I earn from qualifying purchases.

Data Scientist’s Dilemma – The Cold Start Problem

The ancient philosopher Confucius has been credited with saying “study your past to know your future.” This wisdom applies not only to life but to machine learning also. Specifically, the availability and application of labeled data (things past) for the labeling of previously unseen data (things future) is fundamental to supervised machine learning.

Without labels (diagnoses, classes, known outcomes) in past data, then how do we make progress in labeling (explaining) future data? This would be a problem.

A related problem also arises in unsupervised machine learning. In these applications, there is no requirement or presumption regarding the existence of labeled training data — we are essentially parameterizing or characterizing the patterns in the data (e.g., the trends, correlations, segments, clusters, associations).

Many unsupervised learning models can converge more readily and be more valuable if we know in advance which parameterizations are best to choose. If we cannot know that (i.e., because it truly is unsupervised learning), then we would like to know at least that our final model is optimal (in some way) in explaining the data.

In both of these applications (supervised and unsupervised machine learning), if we don’t have these initial insights and validation metrics, then how does such model-building get started and get moving towards the optimal solution?

This challenge is known as the cold-start problem! The solution to the problem is easy (sort of): We make a guess — an initial guess! Usually, that would be a totally random guess.

That sounds so… so… random! How do we know whether it’s a good initial guess? How do we progress our model (parameterizations) from that random initial choice? How do we know that our progression is moving towards more accurate models? How? How? How?

This can be a real challenge. Of course nobody said the “cold start” problem would be easy. Anyone who has ever tried to start a very cold car on a frozen morning knows the pain of a cold start challenge. Nothing can be more frustrating on such a morning. But, nothing can be more exhilarating and uplifting on such a morning than that moment when the engine starts and the car begins moving forward with increasing performance.

The experiences for data scientists who face cold-start problems in machine learning can be very similar to those, especially the excitement when our models begin moving forward with increasing performance.

We will itemize several examples at the end. But before we do that, let’s address the objective function. That is the true key that unlocks performance in a cold-start challenge.  That’s the magic ingredient in most of the examples that we will list.

The objective function (also known as cost function, or benefit function) provides an objective measure of model performance. It might be as simple as the percentage of class labels that the model got right (in a classification model), or the sum of the squares of the deviations of the points from the model curve (in a regression model), or the compactness of the clusters relative to their separation (in a clustering analysis).

The value of the objective function is not only in its final value (i.e., giving us a quantitative overall model performance rating), but its great (perhaps greatest) value is realized in guiding our progression from the initial random model (cold-start zero point) to that final successful (hopefully, optimal) model. In those intermediate steps it serves as an evaluation (or validation) metric.

By measuring the evaluation metric at step zero (cold-start), then measuring it again after making adjustments to the model parameters, we learn whether our adjustments led to a better performing model or worse performance. We then know whether to continue making model parameter adjustments in the same direction or in the opposite direction. This is called gradient descent.

Gradient descent methods basically find the slope (i.e., the gradient) of the performance error curve as we progress from one model to the next. As we learned in grade school algebra class, we need two points to find the slope of a curve. Therefore, it is only after we have run and evaluated two models that we will have two performance points — the slope of the curve at the latest point then informs our next choice of model parameter adjustments: either (a) keep adjusting in the same direction as the previous step (if the performance error decreased) to continue descending the error curve; or (b) adjust in the opposite direction (if the performance error increased) to turn around and start descending the error curve.

Note that hill-climbing is the opposite of gradient descent, but essentially the same thing. Instead of minimizing error (a cost function), hill-climbing focuses on maximizing accuracy (a benefit function). Again, we measure the slope of the performance curve from two models, then proceed in the direction of better-performing models. In both cases (hill-climbing and gradient descent), we hope to reach an optimal point (maximum accuracy or minimum error), and then declare that to be the best solution. And that is amazing and satisfying when we remember that we started (as a cold-start) with an initial random guess at the solution.

When our machine learning model has many parameters (which could be thousands for a deep neural network), the calculations are more complex (perhaps involving a multi-dimensional gradient calculation, known as a tensor). But the principle is the same: quantitatively discover at each step in the model-building progression which adjustments (size and direction) are needed in each one of the model parameters in order to progress towards the optimal value of the objective function (e.g., minimize errors, maximize accuracy, maximize goodness of fit, maximize precision, minimize false positives, etc.). In deep learning, as in typical neural network models, the method by which those adjustments to the model parameters are estimated (i.e., for each of the edge weights between the network nodes) is called backpropagation. That is still based on gradient descent.

One way to think about gradient descent, backpropagation, and perhaps all machine learning is this: “Machine Learning is the set of mathematical algorithms that learn from experience. Good judgment comes experience. And experience comes from bad judgment.” In our case, the initial guess for our random cold-start model can be considered “bad judgment”, but then experience (i.e., the feedback from validation metrics such as gradient descent) bring “good judgment” (better models) into our model-building workflow.

Here are ten examples of cold-start problems in data science where the algorithms and techniques of machine learning produce the good judgment in model progression toward the optimal solution:

  • Clustering analysis (such as K-Means Clustering), where the initial cluster means and the number of clusters are not known in advance (and thus are chosen randomly initially), but the compactness of the clusters can be used to evaluate, iterate, and improve the set of clusters in a progression to the final optimum set of clusters (i.e., the most compact and best separated clusters).
  • Neural networks, where the initial weights on the network edges are assigned randomly (a cold-start), but backpropagation is used to iterate the model to the optimal network (with highest classification performance).
  • TensorFlow deep learning, which uses the same backpropagation technique of simpler neural networks, but the calculation of the weight adjustments is made across a very high-dimensional parameter space of deep network layers and edge weights using tensors.
  • Regression, which uses the sum of the squares of the deviations of the points from the model curve in order to find the best-fit curve. In linear regression, there is a closed-form solution (derivable from the linear least-squares technique). The solution for non-linear regression is not typically a closed-form set of mathematical equations, but the minimization of the sum of the squares of deviations still applies — gradient descent can be used in an iterative workflow to find the optimal curve. Note that K-Means Clustering is actually an example of piecewise regression.
  • Nonconvex optimization, where the objective function has many hills and valleys, so that gradient descent and hill-climbing will typically converge only to a local optimum, not to the global optimum. Techniques like genetic algorithms, particle swarm optimization (when the gradient cannot be calculated), and other evolutionary computing methods are used to generate lots of random (cold-start) models and then iterate each of them until you find the global optimum (or until you run out of time and resources, and then pick the best one that you could find). [See my graphic attached below that illustrates a sample use case for genetic algorithms. See also the NOTE below the graphic about Genetic Algorithms, which also applies to other evolutionary algorithms, indicating that these are not machine learning algorithms specifically, but they are actually meta-learning algorithms]
  • kNN (k-Nearest Neighbors), which is a supervised learning technique in which the data set itself becomes the model. In other words, the assignment of a new data point to a particular group (which may or may not have a class label or a particular meaning yet) is based simply upon finding which category (group) of existing data points is in the majority when you take a vote of the nearest neighbors to the new data point. The number of nearest neighbors that are to be examined is some number k, which can be initially arbitrary (a cold-start), but then it is adjusted to improve model performance.
  • Naive Bayes classification, which applies Bayes theorem to a large data set with class labels on the data items, but for which some combinations of attributes and features are not represented in the training data (i.e., a cold-start challenge). By assuming that the different attributes are mutually independent features of the data items, then one can estimate the posterior likelihood for what the class label should be for a new data item with a feature vector (set of attributes) that is not found in the training data. This is sometimes called a Bayes Belief Network (BBN) and is another example of where the data set becomes the model, where the frequency of occurrence of the different attributes individually can inform the expected frequency of occurrence of different combinations of the attributes.
  • Markov modeling (Belief Networks for Sequences) is an extension of BBN to sequences, which can include web logs, purchase patterns, gene sequences, speech samples, videos, stock prices, or any other temporal or spatial or parametric sequence.
  • Association rule mining, which searches for co-occurring associations that occur higher than expected from a random sampling of a data set. Association rule mining is yet another example where the data set becomes the model, where no prior knowledge of the associations is known (i.e., a cold-start challenge). This technique is also called Market Basket Analysis, which has been used for simple cold-start customer purchase recommendations, but it also has been used in such exotic use cases as tropical storm (hurricane) intensification prediction.
  • Social network (link) analysis, where the patterns in the network (e.g., centrality, reach, degrees of separation, density, cliques, etc.) encode knowledge about the network (e.g., most authoritative or influential nodes in the network), through the application of algorithms like PageRank, without any prior knowledge about those patterns (i.e., a cold-start).

Finally, as a bonus, we mention a special case, Recommender Engines, where the cold-start problem is a subject of ongoing research. The research challenge is to find the optimal recommendation for a new customer or for a new product that has not been seen before. Check out these articles  related to this challenge:

  1. The Cold Start Problem for Recommender Systems
  2. Tackling the Cold Start Problem in Recommender Systems
  3. Approaching the Cold Start Problem in Recommender Systems

We started this article mentioning Confucius and his wisdom. Here is another form of wisdomhttps://rapidminer.com/wisdom/ — the RapidMiner Wisdom conference. It is a wonderful conference, with many excellent tutorials, use cases, applications, and customer testimonials. I was honored to be the keynote speaker for their 2018 conference in New Orleans, where I spoke about “Clearing the Fog around Data Science and Machine Learning: The Usual Suspects in Some Unusual Places”. You can find my slide presentation here: KirkBorne-RMWisdom2018.pdf 

NOTE: Genetic Algorithms (GAs) are an example of meta-learning. They are not machine learning algorithms in themselves, but GAs can be applied across ensembles of machine learning models and tasks, in order to find the optimal model (perhaps globally optimal model) across a collection of locally optimal solutions.

Discovering and understanding patterns in highly dimensional data

Dimensionality reduction is a critical component of any solution dealing with massive data collections. Being able to sift through a mountain of data efficiently in order to find the key descriptive, predictive, and explanatory features of the collection is a fundamental required capability for coping with the Big Data avalanche. Identifying the most interesting dimensions of data is especially valuable when visualizing high-dimensional (high-variety) big data.

There is a “good news, bad news” angle here. First, the bad news: the human capacity for seeing multiple dimensions is very limited: 3 or 4 dimensions are manageable; 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-n projection of a larger-N parameter space of variables.

In data visualization, a systematic ordered parameter sweep through an ensemble of small-n projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/n is large.

Consequently, a data analytics approach that combines the best of both worlds (machine algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data. One such approach is to apply Computer Vision algorithms, which are designed to emulate human perception and cognitive abilities. Another approach is to generate “interestingness metrics” that signal to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data. A specific example of the latter is latent (hidden) variable discovery.

Latent variables are not explicitly observed but are inferred from the observed features, specifically because they are the variables that deliver the all-important (but sometimes hidden) descriptive, predictive, and explanatory power of the data set. Latent variables can also be concepts that are implicitly represented by the data (e.g., the “sentiment” of the author of a social media posting).  

Because some latent variables are “observable” in the sense that they can be generated through a “yet to be discovered” mathematical combination of several of the measured variables, these are therefore an obvious example of dimension reduction for visual exploration of large high-dimensional data.

Latent (Hidden) Variable Models are used in statistics to infer variables that are not observed but are inferred from the variables that are observed. Latent variables are widely used in social science, psychology, economics, life sciences and machine learning. In machine learning, many problems involve collection of high-dimensional multivariate observations and then hypothesizing a model that explains them. In such models, the role of the latent variables is to represent properties that have not been directly observed.

After inferring the existence of latent variables, the next challenge is to understand them. This can be achieved by exploring their relationship with the observed variables (e.g., using Bayesian methods) . Several correlation measures and dimensionality reduction methods such as PCA can be used to measure those relationships. Since we don’t know in advance what relationships exist between the latent variables and the observed variables, more generalized nonparametric measures like the Maximal Information Coefficient (MIC) can be used.

MIC has become popular recently, to some extent because it provides a straightforward R-squared type of estimate to measure dependency among variables in a high-dimensional data set.  Since we don’t know in advance what a latent variable actually represents, it is not possible to predict the type of relationship that it might possess with the observed variables. Consequently, a nonparametric approach makes sense in the case of large high-dimensional data, for which the interrelationships among the many variables is a mystery. Exploring variables that possess the largest values of MIC can help us to understand the type of relationships that the latent variables have with the existing variables, thereby achieving both dimension reduction and a parameter space in which to conduct visual exploration of high-dimensional data.

The techniques described here can help data end-users to discover and understand data patterns that may lead to interesting insights within their massive data collections.

Follow Kirk Borne on Twitter @KirkDBorne

Are Your Predictive Models like Broken Clocks?

A wise philosopher (or comedian) once said, “Even a broken clock is right twice a day.” That same statement might also apply to some predictive models. Since prediction is about the future (usually), then random chance (like broken clockwork) may allow our model to be right occasionally (just by accident). The important step in the data science process that aims to reduce the danger of this occurring is the all-important cross-validation phase (or model-testing phase, which uses an independent data set). This phase is devoted to validating that our model works accurately on previously unseen data that were not used in the model training (model-building) phase.

Another way of characterizing this phase can be found in the field of System Engineering: V&V (Verification and Validation). In the first phase (verification), we verify that the system was built correctly (according to a set of requirements and specifications). In the second phase (validation), we validate that we built the correct system (consistent with the operational needs that the end-user, customer, or client expects the system to satisfy). We sometimes say it this way: (1) in verification, we ask “Did we build the system right?”; and (2) in validation, we ask “Did we build the right system?”

Applying the V&V system engineering principle to data science means that we see model-testing as a two-step process. First, we verify that the model is a logical consequent of the input data used to train the model. Second, we validate that the model remains useful, accurate, and robust when applied to previously unseen data. Any data scientist who participates in Kaggle competitions understands and “lives” this process. It is often the case that our first data science model will do a great job on the “seen” data set (i.e., verification by using a “broken clock” that is right on occasion), but the model then performs poorly on the “unseen” data set. A model that does well on both data sets is a winning model (maybe not in every Kaggle competition, but certainly in real-world usage).

Using the same data set both to validate a model and to train the model would be the data science equivalent of “circular reasoning“. This will often lead to “overfitting“, where the initial model is incorrectly trained to reproduce every variation, bump, wiggle, nuance, and noisy deviation in the training data set, thus falsely exaggerating the importance of those fluctuations. “Complexity” describes our world, but it shouldn’t describe our models.

The other extreme in model-building can be just as bad: underfitting (or bias) introduced by using too few explanatory variables to model the behaviors seen in our data set. I like to believe that Albert Einstein understood data science modeling very well when he said “Everything should be made as simple as possible, but not simpler.” Building an excessively complex model (with too many parameters that follow the noise fluctuations in our data) is like putting too much confidence in a broken clock (“it’s exactly right… some of the time!”). George Box warned us to have a little humility in the face of complex data (and a complex world): “All models are wrong, but some are useful.”

Therefore, when faced with highly complex (high-variety) big data, we are also faced with how to choose the “right model”. We should apply the “Goldilocks principle” — choose a model that is not “too good” and not too bad (i.e., the model works well enough on the training data set and on the test data set).

Follow Kirk Borne on Twitter @KirkDBorne

model_complexity_error_training_test

 

(Source for graphic: http://gerardnico.com/wiki/data_mining/bias_trade-off)

Outliers, Inliers, and Other Surprises that Fly from your Data

Data can fly beyond the bounds of our models and our expectations in surprising and interesting ways. When data fly in these ways, we often find new insights and new value about the people, products, and processes that our data sources are tracking. Here are 4 simple examples of surprises that can fly from our data:

(1) Outliers — when data points are several standard deviations from the mean of your data distribution, these are traditional data outliers. These may signal at least 3 possible causes: (a) a data measurement problem (in the sensor); (b) a data processing problem (in the data pipeline); or (c) an amazing unexpected discovery about your data items. The first two causes are data quality issues that must be addressed and repaired. The latter case (when your data fly outside the bounds of your expectations) is golden and worthy of deeper exploration.

(2) Inliers — sometimes your data have constraints (business rules) that are inviolable (e.g., Fraction of customers that are Male + Fraction of customers that are Female = 1). A simple business example would be: Profit = Revenue minus Costs. Suppose an analyst examines these 3 numbers (Profit, Revenue, Costs) for many different entries in his business database, and he finds a data entry that is near the mean of the distribution for each of those 3 numbers. It appears (at first glance) that this entry is perfectly normal (an inlier, not an outlier), but in fact it might violate the above business rule. In that case, there is definitely a problem with these numbers — they have “flown” outside the bounds of the business rule.

(3) Nonlinear correlations — fitting a curve y=F(x) through data for the purpose of estimating values of y for new values of x is called regression. This is also an example of Predictive Analytics (we can predict future values based upon a function that was learned from the historical training data). When using higher-order functions for F(x) (especially polynomial functions), we must remember that the curves often diverge (to extreme values) beyond the range of the known data points that were used to learn the function. Such an extrapolation of the regression curve could lead to predictive outcomes that make no sense, because they fly far beyond reasonable values of our data parameters.

(4) Uplift — when two events occur together more frequently than you would expect from random chance, then their mutual dependence causes uplift. Statistical lift is simply measured by: P(X,Y)/[P(X)P(Y)]. The numerator P(X,Y) represents the joint frequency of two events X and Y co-occurring simultaneously. The denominator represents the probability that the two events X and Y will co-occur (at the same time) at random. If X and Y are completely independent events, then the numerator will equal the denominator – in that case (mutual independence), the uplift equals 1 (i.e., no lift). Conversely, if there is a higher than random co-occurrence of X and Y, then the statistical lift flies to values that are greater than 1 — that’s uplift! And that’s interesting. Cases with significant uplift can be marketing gold for your organization: in customer recommendation engines, in fraud detection, in targeted marketing campaigns, in community detection within social networks, or in mining electronic health records for adverse drug interactions and side effects.

These and other such instances of high-flying data are increasingly challenging to identify in the era of big data: high volume and high variety produce big computational challenges in searching for data that fly in interesting directions (especially in complex high-dimensional data sets). To achieve efficient and effective discovery in these cases, fast automatic statistical modeling can help. For this purpose, I recommend that you check out the analytics solutions from the fast automatic modeling folks at http://soft10ware.com/.

The Soft10 software package is trained to report automatically the most significant, informative and interesting dependencies in your data, no matter which way the data fly.

(Read the full blog, with more details for the 4 cases listed above, at: https://www.linkedin.com/pulse/when-data-fly-kirk-borne)

Follow Kirk Borne on Twitter @KirkDBorne

Reach Analytics Maturity through Fast Automatic Modeling

The late great baseball legend Yogi Berra was credited with saying this gem: “The future ain’t what it used to be.” In the context of big data analytics, I am now inclined to believe that Yogi was very insightful — his statement is an excellent description of Prescriptive Analytics.

Prescriptive Analytics goes beyond Descriptive and Predictive Analytics in the maturity framework of analytics. “Descriptive” analytics delivers hindsight (telling you what did happen, by generating reports from your databases), and “predictive” delivers foresight (telling you what will happen, through machine learning algorithms). Going one better, “prescriptive” delivers insight: discovering so much about your application domain (from your collection of big data and information resources, through data science and predictive models) that you are now able to take the actions (e.g., set the conditions and parameters) needed to achieve a prescribed (better, optimal, desired) outcome.

So, if predictive analytics can use historical training data sets to tell us what will happen in the future (e.g., which products a customer will buy; where and when your supply chain will need replenishing; which vehicles in your corporate fleet will need repairs; which machines in your manufacturing plant will need maintenance; or which servers in your data center will fail), then prescriptive analytics can alter that future (i.e., the future ain’t what it used to be).

When dealing with large high-variety data sets, with many features and measured attributes, it is often difficult to build accurate models that are generally useful under a variety of conditions and that capture all of the complexities of the response functions and explanatory variables within your business application. In such cases, fast automatic modeling tools are needed. These tools can help to identify the minimum viable feature set for accurate predictive and prescriptive modeling. For this purpose, I recommend that you check out the analytics solutions from the fast automatic modeling folks at http://soft10ware.com/.

The Soft10 software package is trained to observe quickly and report automatically the most significant, informative and explanatory dependencies in your data. Those capabilities are the “secret sauce” in insightful prescriptive analytics, and they coincide nicely with another insightful quote from Yogi Berra: “You can observe a lot by just watching.”

(Read the full blog at: https://www.linkedin.com/pulse/prescriptive-analytics-future-aint-what-used-kirk-borne)

Predictive versus Prescriptive Analytics

Predictive Analytics (given X, find Y) vs. Prescriptive Analytics (given Y, find X)

Follow Kirk Borne on Twitter @KirkDBorne

What Motivates a Data Scientist?

I recently had the pleasure of being interviewed by Manu Jeevan for his Big Data Made Simple blog.  He asked me several questions:

  • How did you get into data science?
  • What exactly is enterprise data science?
  • How does Booz Allen Hamilton use data science?
  • What skills should business executives have to effectively to communicate with data scientists?
  • How is big data changing the world? (Please give us interesting examples)
  • What are your go-to tools for doing data science?
  • In your TedX talk Big Data, Small World you gave special attention to association discovery, is there a specific reason for that?
  • The Data Scientist has been called the sexiest job of the 21st century. Do you agree?
  • What advice would you give to people aspiring for a long career in data science?

All of these questions were ultimately aimed at understanding the key underlying question: “What motivates you to work in data science?” The question about enterprise data science really comes the closest to identifying what really motivates me — that is, I am exceedingly fortunate every day to be given the opportunity to work with a fantastic team of data scientists at Booz Allen Hamilton, with the mandate to explore data as a corporate asset and to exploit data science as a core capability in order to achieve more profound discoveries, to make better (data-driven) decisions, and to propel new innovations across numerous domains, industries, agencies, and organizations. My Data Science Declaration also sums up these motivating factors for me.

You can see the full scope of my answers to the above questions here: http://bigdata-madesimple.com/interview-with-leading-data-science-expert-kirk-borne/.

Follow Kirk Borne on Twitter @KirkDBorne

Definitive Guide to Data Literacy For All – A Reading List

(UPDATE – April 8, 2021) See this excellent new book by Jordan Morrow: “Be Data Literate: The Data Literacy Skills Everyone Needs To Succeed”, available at amzn.to/3rdWfsn

Book: Be Data Literate: The Data Literacy Skills Everyone Needs to Succeed

Be Data Literate: The Data Literacy Skills Everyone Needs to Succeed

One of the most important roles that we should be embracing right now is training the next-generation workforce in the art and science of data. Data Literacy is a fundamental literacy that should be imparted at the earliest levels of learning, and it should continue through all years of education. Education research has shown the value of using data in the classroom to teach any subject — so, I am not advocating the teaching of hard-core data science to children, but I definitely promote the use of data mining and data science applications in the teaching of other subjects (perhaps, in all subjects!). See my “Using Data in the Classroom Reading List” here on this subject. See also the book “Data Literacy – A User’s Guide” and this book:

Book: The Basics of Data Literacy

The Basics of Data Literacy, available at https://amzn.to/2IZ2BYY

And see this book:

Book - Data Literacy: A User's Guide

Data Literacy: A User’s Guide, available at https://amzn.to/2uLPfKB

I encourage you to read a position paper that I wrote (along with a few astronomy colleagues) for the US National Academies of Science in 2009 that addressed the data science literacy requirements in astronomy. Though focused on the needs in astronomy workforce development for the coming decade, the paper also contains more general discussion of “data literacy for the masses” that is applicable to any and all disciplines, domains, and organizations: “Data Science For The Masses.”

Two “…For Dummies” books can help in those situations, to bring data literacy to a much larger audience (of students, business leaders, government agencies, educators, etc.). Those new books are: “Data Science For Dummies” by Lillian Pierson, and “Data Mining for Dummies” by Meta Brown.

Finally, here is one more that I believe is an excellent data literacy companion: The Data Journalism Handbook.

Update (April 2016) – The following site has a wealth of information on the use of “Data in Education”: http://www.ands.org.au/working-with-data/publishing-and-reusing-data/data-in-education

Data Mining For Dummies

(Read more here: http://www.datasciencecentral.com/profiles/blogs/dummies-for-data-science-a-reading-list)


Disclosure statement: As an Amazon Associate I earn from qualifying purchases.


Follow Kirk Borne on Twitter @KirkDBorne