Tag Archives: Big Data

Data Science Training Opportunities

A few years ago, I generated a list of places to receive data science training. That list has become a bit stale. So, I have updated the list, adding some new opportunities, keeping many of the previous ones, and removing the obsolete ones. Here are 30 training opportunities that I encourage you to explore:

  1. The Booz Allen Field Guide to Data Science
  2. NVIDIA Deep Learning Institute
  3. Metis Data Science Training
  4. Leada’s online analytics labs
  5. Data Science Training by General Assembly
  6. Learn Data Science Online by DataCamp
  7. (600+) Colleges and Universities with Data Science Degrees
  8. Data Science Master’s Degree Programs
  9. Data Analytics, Machine Learning, & Statistics Courses at edX
  10. Data Science Certifications (by AnalyticsVidhya)
  11. Learn Everything About Analytics (by AnalyticsVidhya)
  12. NYC Data Science Academy
  13. Big Bang Data Science Solutions
  14. CommonLounge
  15. IntelliPaat Online Training
  16. DataQuest
  17. NCSU Institute for Advanced Analytics
  18. District Data Labs
  19. Data School
  20. Galvanize
  21. Coursera
  22. Udacity Nanodegree Program to Become a Data Scientist
  23. Udemy – Data & Analytics
  24. Insight Data Science Fellows Program
  25. The Open Source Data Science Masters
  26. Jigsaw Academy Post Graduate Program in Data Science & Machine Learning
  27. O’Reilly Media Learning Paths
  28. Data Engineering and Data Science Training by Go Data Driven
  29. 18 Resources to Learn Data Science Online (by Simplilearn)
  30. Top Online Data Science Courses to Learn Data Science

Follow Kirk Borne on Twitter @KirkDBorne

Field Guide to Data Science
Learn the what, why, and how of Data Science and Machine Learning here.

Bias-Busting with Diversity in Data

Diversity in data is one of the three defining characteristics of big data — high data variety — along with high data volume and high velocity. We discussed the power and value of high-variety data in a previous article: “The Five Important D’s of Big Data Variety” We won’t repeat those lessons here, but we focus specifically on the bias-busting power of high-variety data, which was actually the last of the five D’s mentioned in the earlier article: Decreased model bias.

Here, we broaden our meaning of “bias” to go beyond model bias, which has the technical statistical meaning of “underfitting”, which essentially means that there is more information and structure in the data than our model has captured. In the current context, we apply a broader definition of bias: lacking a neutral viewpoint, or having a viewpoint that is partial. We will call this natural bias, since the examples can be considered as “naturally occurring” without obvious intent. This article does not elaborate on personal bias (which might be intentional), though the cause for that kind of prejudice is essentially the same: not considering and taking into account the full knowledge and understanding of the person or entity that is the subject of the bias.

We wrote a longer complete version of this article here: “Busting Bias with More Data Variety” at the Western Digital DataMakesPossible.com blog site.

In that full version of this article, we go on to describe several examples of natural bias and then to present a recommended bias-busting remedy for those of us working in the realm of data science. We refer to that remedy as the CCDI data & analytics strategy: Collect, Curate, Differentiate, and Innovate.

Here is one of the four examples of natural bias that you will find in the longer, complete version of the article:

  • An example of natural bias comes from a famous cartoon. The cartoon shows three or more blind men (or blindfolded men) feeling an elephant. They each feel a different aspect of the elephant: the tail, a tusk, an ear, the body, a leg — and consequently they each offer a different interpretation of what they believe this thing is (which they cannot see). They say it might be a rope (the tail), or a spear (the tusk), or a large fan (the ear), or a wall (the body), or a tree trunk (the leg). Only after the blindfolds are removed (or an explanation is given) do they finally “see” the full truth of this large complex reality. It has many different features, facets, and characteristics. Focusing on only one of those features and insisting that this partial view describes the whole thing would be foolish. We have similar complex systems in our organizations, whether it is the human body (in healthcare), or our population of customers (in marketing), or the Earth (in climate science), or different components in a complex system (like a manufacturing facility), or our students (in a classroom), or whatever. Unless we break down the silos and start sharing our data (insights) about all the dimensions, viewpoints, and perspectives of our complex system, we will consequently be drawn into biased conclusions and actions, and thus miss the key insights that enable us to understand the wonderful complexity and diversity of the thing in its entirety. Integrating the many data sources enables us to arrive at the “single correct view” of the thing: the 360 view!
Collecting high-variety data from diverse sources, connecting the dots, and building the 360 view of our domain is not only the data silo-busting thing to do. It is also the bias-busting thing to do. High-variety data makes that possible, and there is no shortage of biases for high-variety data to bust, including cognitive bias, confirmation bias, salience bias, and sampling bias, just to name a few! …
Read the full story here… “Busting Bias with More Data Variety

Data Scientist’s Dilemma – The Cold Start Problem

The ancient philosopher Confucius has been credited with saying “study your past to know your future.” This wisdom applies not only to life but to machine learning also. Specifically, the availability and application of labeled data (things past) for the labeling of previously unseen data (things future) is fundamental to supervised machine learning.

Without labels (diagnoses, classes, known outcomes) in past data, then how do we make progress in labeling (explaining) future data? This would be a problem.

A related problem also arises in unsupervised machine learning. In these applications, there is no requirement or presumption regarding the existence of labeled training data — we are essentially parameterizing or characterizing the patterns in the data (e.g., the trends, correlations, segments, clusters, associations).

Many unsupervised learning models can converge more readily and be more valuable if we know in advance which parameterizations are best to choose. If we cannot know that (i.e., because it truly is unsupervised learning), then we would like to know at least that our final model is optimal (in some way) in explaining the data.

In both of these applications (supervised and unsupervised machine learning), if we don’t have these initial insights and validation metrics, then how does such model-building get started and get moving towards the optimal solution?

This challenge is known as the cold-start problem! The solution to the problem is easy (sort of): We make a guess — an initial guess! Usually, that would be a totally random guess.

That sounds so… so… random! How do we know whether it’s a good initial guess? How do we progress our model (parameterizations) from that random initial choice? How do we know that our progression is moving towards more accurate models? How? How? How?

This can be a real challenge. Of course nobody said the “cold start” problem would be easy. Anyone who has ever tried to start a very cold car on a frozen morning knows the pain of a cold start challenge. Nothing can be more frustrating on such a morning. But, nothing can be more exhilarating and uplifting on such a morning than that moment when the engine starts and the car begins moving forward with increasing performance.

The experiences for data scientists who face cold-start problems in machine learning can be very similar to those, especially the excitement when our models begin moving forward with increasing performance.

We will itemize several examples at the end. But before we do that, let’s address the objective function. That is the true key that unlocks performance in a cold-start challenge.  That’s the magic ingredient in most of the examples that we will list.

The objective function (also known as cost function, or benefit function) provides an objective measure of model performance. It might be as simple as the percentage of class labels that the model got right (in a classification model), or the sum of the squares of the deviations of the points from the model curve (in a regression model), or the compactness of the clusters relative to their separation (in a clustering analysis).

The value of the objective function is not only in its final value (i.e., giving us a quantitative overall model performance rating), but its great (perhaps greatest) value is realized in guiding our progression from the initial random model (cold-start zero point) to that final successful (hopefully, optimal) model. In those intermediate steps it serves as an evaluation (or validation) metric.

By measuring the evaluation metric at step zero (cold-start), then measuring it again after making adjustments to the model parameters, we learn whether our adjustments led to a better performing model or worse performance. We then know whether to continue making model parameter adjustments in the same direction or in the opposite direction. This is called gradient descent.

Gradient descent methods basically find the slope (i.e., the gradient) of the performance error curve as we progress from one model to the next. As we learned in grade school algebra class, we need two points to find the slope of a curve. Therefore, it is only after we have run and evaluated two models that we will have two performance points — the slope of the curve at the latest point then informs our next choice of model parameter adjustments: either (a) keep adjusting in the same direction as the previous step (if the performance error decreased) to continue descending the error curve; or (b) adjust in the opposite direction (if the performance error increased) to turn around and start descending the error curve.

Note that hill-climbing is the opposite of gradient descent, but essentially the same thing. Instead of minimizing error (a cost function), hill-climbing focuses on maximizing accuracy (a benefit function). Again, we measure the slope of the performance curve from two models, then proceed in the direction of better-performing models. In both cases (hill-climbing and gradient descent), we hope to reach an optimal point (maximum accuracy or minimum error), and then declare that to be the best solution. And that is amazing and satisfying when we remember that we started (as a cold-start) with an initial random guess at the solution.

When our machine learning model has many parameters (which could be thousands for a deep neural network), the calculations are more complex (perhaps involving a multi-dimensional gradient calculation, known as a tensor). But the principle is the same: quantitatively discover at each step in the model-building progression which adjustments (size and direction) are needed in each one of the model parameters in order to progress towards the optimal value of the objective function (e.g., minimize errors, maximize accuracy, maximize goodness of fit, maximize precision, minimize false positives, etc.). In deep learning, as in typical neural network models, the method by which those adjustments to the model parameters are estimated (i.e., for each of the edge weights between the network nodes) is called backpropagation. That is still based on gradient descent.

One way to think about gradient descent, backpropagation, and perhaps all machine learning is this: “Machine Learning is the set of mathematical algorithms that learn from experience. Good judgment comes experience. And experience comes from bad judgment.” In our case, the initial guess for our random cold-start model can be considered “bad judgment”, but then experience (i.e., the feedback from validation metrics such as gradient descent) bring “good judgment” (better models) into our model-building workflow.

Here are ten examples of cold-start problems in data science where the algorithms and techniques of machine learning produce the good judgment in model progression toward the optimal solution:

  • Clustering analysis (such as K-Means Clustering), where the initial cluster means and the number of clusters are not known in advance (and thus are chosen randomly initially), but the compactness of the clusters can be used to evaluate, iterate, and improve the set of clusters in a progression to the final optimum set of clusters (i.e., the most compact and best separated clusters).
  • Neural networks, where the initial weights on the network edges are assigned randomly (a cold-start), but backpropagation is used to iterate the model to the optimal network (with highest classification performance).
  • TensorFlow deep learning, which uses the same backpropagation technique of simpler neural networks, but the calculation of the weight adjustments is made across a very high-dimensional parameter space of deep network layers and edge weights using tensors.
  • Regression, which uses the sum of the squares of the deviations of the points from the model curve in order to find the best-fit curve. In linear regression, there is a closed-form solution (derivable from the linear least-squares technique). The solution for non-linear regression is not typically a closed-form set of mathematical equations, but the minimization of the sum of the squares of deviations still applies — gradient descent can be used in an iterative workflow to find the optimal curve. Note that K-Means Clustering is actually an example of piecewise regression.
  • Nonconvex optimization, where the objective function has many hills and valleys, so that gradient descent and hill-climbing will typically converge only to a local optimum, not to the global optimum. Techniques like genetic algorithms, particle swarm optimization (when the gradient cannot be calculated), and other evolutionary computing methods are used to generate lots of random (cold-start) models and then iterate each of them until you find the global optimum (or until you run out of time and resources, and then pick the best one that you could find). [See my graphic attached below that illustrates a sample use case for genetic algorithms. See also the NOTE below the graphic about Genetic Algorithms, which also applies to other evolutionary algorithms, indicating that these are not machine learning algorithms specifically, but they are actually meta-learning algorithms]
  • kNN (k-Nearest Neighbors), which is a supervised learning technique in which the data set itself becomes the model. In other words, the assignment of a new data point to a particular group (which may or may not have a class label or a particular meaning yet) is based simply upon finding which category (group) of existing data points is in the majority when you take a vote of the nearest neighbors to the new data point. The number of nearest neighbors that are to be examined is some number k, which can be initially arbitrary (a cold-start), but then it is adjusted to improve model performance.
  • Naive Bayes classification, which applies Bayes theorem to a large data set with class labels on the data items, but for which some combinations of attributes and features are not represented in the training data (i.e., a cold-start challenge). By assuming that the different attributes are mutually independent features of the data items, then one can estimate the posterior likelihood for what the class label should be for a new data item with a feature vector (set of attributes) that is not found in the training data. This is sometimes called a Bayes Belief Network (BBN) and is another example of where the data set becomes the model, where the frequency of occurrence of the different attributes individually can inform the expected frequency of occurrence of different combinations of the attributes.
  • Markov modeling (Belief Networks for Sequences) is an extension of BBN to sequences, which can include web logs, purchase patterns, gene sequences, speech samples, videos, stock prices, or any other temporal or spatial or parametric sequence.
  • Association rule mining, which searches for co-occurring associations that occur higher than expected from a random sampling of a data set. Association rule mining is yet another example where the data set becomes the model, where no prior knowledge of the associations is known (i.e., a cold-start challenge). This technique is also called Market Basket Analysis, which has been used for simple cold-start customer purchase recommendations, but it also has been used in such exotic use cases as tropical storm (hurricane) intensification prediction.
  • Social network (link) analysis, where the patterns in the network (e.g., centrality, reach, degrees of separation, density, cliques, etc.) encode knowledge about the network (e.g., most authoritative or influential nodes in the network), through the application of algorithms like PageRank, without any prior knowledge about those patterns (i.e., a cold-start).

Finally, as a bonus, we mention a special case, Recommender Engines, where the cold-start problem is a subject of ongoing research. The research challenge is to find the optimal recommendation for a new customer or for a new product that has not been seen before. Check out these articles  related to this challenge:

  1. The Cold Start Problem for Recommender Systems
  2. Tackling the Cold Start Problem in Recommender Systems
  3. Approaching the Cold Start Problem in Recommender Systems

We started this article mentioning Confucius and his wisdom. Here is another form of wisdomhttps://rapidminer.com/wisdom/ — the RapidMiner Wisdom conference. It is a wonderful conference, with many excellent tutorials, use cases, applications, and customer testimonials. I was honored to be the keynote speaker for their 2018 conference in New Orleans, where I spoke about “Clearing the Fog around Data Science and Machine Learning: The Usual Suspects in Some Unusual Places”. You can find my slide presentation here: KirkBorne-RMWisdom2018.pdf 

NOTE: Genetic Algorithms (GAs) are an example of meta-learning. They are not machine learning algorithms in themselves, but GAs can be applied across ensembles of machine learning models and tasks, in order to find the optimal model (perhaps globally optimal model) across a collection of locally optimal solutions.

Variety is the Secret Sauce for Big Discoveries in Big Data

When I was out for a walk recently, I heard a loud low-flying aircraft passing overhead. This was not unusual since we live in the flight path of planes landing at a major international airport about 10 miles from our home. In this case, I thought to myself that the sound seemed more directly overhead and lower than normal as well as being suggestive of a larger than average jet aircraft.

I realized that in my one simple thought, I had made three different inferences from a single stream of data. The data stream was the audible sound of the aircraft. The three inferences were about the altitude (lower than normal), the size (larger than average), and the flight path (more overhead). When I looked up, my tri-inference hypothesis was confirmed. The plane was a very large, low-flying jet for a major overnight shipping company. The slightly unusual flight path may have been associated with the fact that these planes are probably instructed to land on a different runway at the airport than the usual commercial passenger airlines’ flights – consequently, the altitude and location were slightly different from the slightly smaller commercial passenger airlines that pass overhead every day.

This situation caused me to reflect on how often we can jump to conclusions, infer a hypothesis, and (maybe without as much proof as in this case) we assume that our conclusion is true.

For the modern digital organization, the proof of any inference (that drives decisions) should be in the data! Rich and diverse data collections enable more accurate and trustworthy conclusions.

I frequently refer to the era of big data as “the end of demographics”. By that, I mean that we now have many more features, attributes, data sources, and insights into each entity in our domain: people, processes, and products. These multiple data sources enable a “360 view” of the entity, thus empowering a more personalized (even hyper-personalized) understanding of and response to the needs of that unique entity. In “big data language”, we are talking about one of the 3 V’s of big data: big data Variety!

High variety is one of the foundational key features of big data — we now measure many more features, characteristics, and dimensions of insight into nearly everything due to the plethora of data sources, sensors, and signals that we measure, monitor, and mine. Consequently, we no longer need to rely on a limited number of features and attributes when making decisions, taking actions, and generating inferences. We can make better, tailored, more personalized decisions and actions. Every entity is unique! That marks the end of demographics.

Here is another example: suppose that a person goes to their doctor to report problems with painful headaches. That is a single symptom (headache pain) — a single data source, a single signal, a single sensor. However, one could imagine a large number of possible inferences from that one single signal. The headaches could be caused by insufficient sleep (sleep apnea), high blood pressure, pregnancy, or a brain tumor. Obviously, each one of these diagnoses carries a seriously different course of action and treatment.

In “data science language”, what we are describing are different segments (clusters) in the hyperspace of symptoms and causes in which the many causes (clusters) are projected on top of one another (overlap one another) in the symptom space. The way that a data scientist resolves that degeneracy (another data science word) is to introduce more parameters (higher variety data) in order to “look at” those overlapping clusters from different angles and perspectives, thus resolving the different diagnosis clusters. High variety data enables the discovery of multiple clusters, and eventually identifies the correct cluster (correct diagnosis, in this case).

Higher variety data means that we are adding data from other sensors, other signals, other sources, and of different types. Going back to our low-flying airplane example, this has the following application: I not only heard the aircraft (sound = audio data), but I also looked at it (sight = visual data) and I observed its flight path (dynamic change over time = time series data). The proof of my inference about the airplane was in the data! Additional data sources provided the variety of data signals that were needed in order to derive a correct conclusion.

Similarly, when you go to the doctor with that headache, the doctor will start asking about other symptoms (e.g., lack of appetite; or other pains) and may order other medical tests (blood pressure checks, or other lab results). Those additional data sources and sensors provide the variety of data signals that are needed in order to derive the correct diagnosis.

These examples (low-flying aircraft, and headache pain) are representative analogies of a large number of different use cases in every organization, every business, and every process. The more data you have, the better you are able to detect and discover interesting and important phenomena and events. However, the more variety of data you have, the better you are able to correctly diagnose, interpret, understand, gain insights from, and take appropriate action in response to those phenomena and events.

High-variety data is the fuel that powers these insights, because variety is definitely the secret sauce for bigger and better discovery from big data collections.

Follow Kirk on Twitter at @KirkDBorne

The Definitive Q&A Guide for Aspiring Data Scientists

I was asked five questions by Alex Woodie of Datanami for his article, “So You Want To Be A Data Scientist”. He used a few snippets from my full set of answers. The longer version of my answers provided additional advice. For aspiring data scientists of all ages, I provide in my article at MapR the full, unabridged version of my answers, which may help you even more to achieve your goal.  Here are Alex’s questions. (Note: I paraphrase the original questions in quotes below.)

1. “What is the number one piece of advice you give to aspiring data scientists?”

2. “What are the most important skills for an aspiring data scientist to acquire?”

3. “Is it better for a person to stay in school and enroll in a graduate program, or is it better to acquire the skills on-the-job?”

4. “For someone who stays in school, do you recommend that they enroll in a program tailored toward data science, or would they get the requisite skills in a ‘hard science’ program such as astrophysics (like you)?”

5. “Do you see advances in analytic packages replacing the need for some of the skills that data scientists have traditionally had, such as programming skills (Python, Java, etc.)?”

Find all of my answers at “The Definitive Q&A for Aspiring Data Scientists“.

Follow Kirk Borne on Twitter @KirkDBorne

Definitive Guides to Data Science and Analytics Things

The Definitive Guide to anything should be a helpful, informative road map to that topic, including visualizations, lessons learned, best practices, application areas, success stories, suggested reading, and more.  I don’t know if all such “definitive guides” can meet all of those qualifications, but here are some that do a good job:

  1. The Field Guide to Data Science (big data analytics by Booz Allen Hamilton)
  2. The Data Science Capability Handbook (big data analytics by Booz Allen Hamilton)
  3. The Definitive Guide to Becoming a Data Scientist (big data analytics)
  4. The Definitive Guide to Data Science – The Data Science Handbook (analytics)
  5. The Definitive Guide to doing Data Science for Social Good (big data analytics, data4good)
  6. The Definitive Q&A Guide for Aspiring Data Scientists (big data analytics, data science)
  7. The Definitive Guide to Data Literacy for all (analytics, data science)
  8. The Data Analytics Handbook Series (big data, data science, data literacy by Leada)
  9. The Big Analytics Book (big data, data science)
  10. The Definitive Guide to Big Data (analytics, data science)
  11. The Definitive Guide to the Data Lake (big data analytics by MapR)
  12. The Definitive Guide to Business Intelligence (big data, business analytics)
  13. The Definitive Guide to Natural Language Processing (text analytics, data science)
  14. A Gentle Guide to Machine Learning (analytics, data science)
  15. Building Machine Learning Systems with Python (a non-definitive guide) (data analytics)
  16. The Definitive Guide to Data Journalism (journalism analytics, data storytelling)
  17. The Definitive “Getting Started with Apache Spark” ebook (big data analytics by MapR)
  18. The Definitive Guide to Getting Started with Apache Spark (big data analytics, data science)
  19. The Definitive Guide to Hadoop (big data analytics)
  20. The Definitive Guide to the Internet of Things for Business (IoT, big data analytics)
  21. The Definitive Guide to Retail Analytics (customer analytics, digital marketing)
  22. The Definitive Guide to Personalization Maturity in Digital Marketing Analytics (by SYNTASA)
  23. The Definitive Guide to Nonprofit Analytics (business intelligence, data mining, big data)
  24. The Definitive Guide to Marketing Metrics & Analytics
  25. The Definitive Guide to Campaign Tagging in Google Analytics (marketing, SEO)
  26. The Definitive Guide to Channels in Google Analytics (SEO)
  27. A Definitive Roadmap to the Future of Analytics (marketing, machine learning)
  28. The Definitive Guide to Data-Driven Attribution (digital marketing, customer analytics)
  29. The Definitive Guide to Content Curation (content-based marketing, SEO analytics)
  30. The Definitive Guide to Collecting and Storing Social Profile Data (social big data analytics)
  31. The Definitive Guide to Data-Driven API Testing (analytics automation, analytics-as-a-service)
  32. The Definitive Guide to the World’s Biggest Data Breaches (visual analytics, privacy analytics)

Follow Kirk Borne on Twitter @KirkDBorne


Blogging My Way Through Data Science, Big Data, and Analytics

I frequently write blog posts on other sites.  You can find those articles here (updated March 21, 2016):

I also write “one-off” blog posts, such as these examples:

Follow Kirk Borne on Twitter @KirkDBorne

What Motivates a Data Scientist?

I recently had the pleasure of being interviewed by Manu Jeevan for his Big Data Made Simple blog.  He asked me several questions:

  • How did you get into data science?
  • What exactly is enterprise data science?
  • How does Booz Allen Hamilton use data science?
  • What skills should business executives have to effectively to communicate with data scientists?
  • How is big data changing the world? (Please give us interesting examples)
  • What are your go-to tools for doing data science?
  • In your TedX talk Big Data, Small World you gave special attention to association discovery, is there a specific reason for that?
  • The Data Scientist has been called the sexiest job of the 21st century. Do you agree?
  • What advice would you give to people aspiring for a long career in data science?

All of these questions were ultimately aimed at understanding the key underlying question: “What motivates you to work in data science?” The question about enterprise data science really comes the closest to identifying what really motivates me — that is, I am exceedingly fortunate every day to be given the opportunity to work with a fantastic team of data scientists at Booz Allen Hamilton, with the mandate to explore data as a corporate asset and to exploit data science as a core capability in order to achieve more profound discoveries, to make better (data-driven) decisions, and to propel new innovations across numerous domains, industries, agencies, and organizations. My Data Science Declaration also sums up these motivating factors for me.

You can see the full scope of my answers to the above questions here: http://bigdata-madesimple.com/interview-with-leading-data-science-expert-kirk-borne/.

Follow Kirk Borne on Twitter @KirkDBorne

Analytics Maturity Models

In the world of big data analytics, there are several emerging standards for measuring Analytics Capability Maturity within organizations.  One of these has been presented in the TIBCO Analytics Maturity Journey – their six steps toward analytics maturity are:  Measure, Diagnose, Predict and Optimize, Operationalize, Automate, and Transform.  Another example is presented through the SAS Analytics Assessment, which evaluates business analytics readiness and capabilities in several areas.  The B-eye Network Analytics Maturity Model mimics software engineering’s CMM (Capability Maturity Model) – their 6 levels of maturity are:  Level 0 = Incomplete; Level 1 = Performed; Level 2 = Managed; Level 3 = Defined, Level 4 = Quantitatively Managed; and Level 5 = Optimizing.

The most “mature” standard in the field is probably the IDC Big Data and Analytics (BDA) MaturityScape Framework.  This BDA framework (measured across the five core dimensions of intent, data, technology, process, and people) consists of five stages of maturity, which essentially parallel the others mentioned above:  Ad hoc, Opportunistic, Repeatable, Managed, and Optimized.

All of these are excellent models for analytics maturity.  But, if you find these different models to be too theoretical or opaque or unattainable, then I suggest a more practical model for your business analytics progression from ground zero all of the way up to cognitive analytics:  from Descriptive and Diagnostic, to Predictive, to Prescriptive, and finally to Cognitive.

A specific example from the field of Marketing is SYNTASA‘s PMI (Personalization Maturity Index). Personalization Capability Maturity parallels the Analytics Capability Maturity frameworks within the specific context of data-driven customer-centric one-to-one marketing and segmentation of one. Read more about this in the article The Battle for Customer Personalization – Divisive Clustering is Good For Youand in much more detail within SYNTASA’s PMI white paper linked above.

(continue reading here:  https://www.mapr.com/blog/raising-standard-big-data-analytics-profession)

Follow Kirk Borne on Twitter @KirkDBorne


Drilling Through Data Silos with Apache Drill

Enterprise data collections are typically stored in silos belonging to different business divisions. Sometimes these silos belong to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties. For example, streamlining important business processes are rendered more challenging, ranging from compliance to data discovery. But, breaking down the silos may not be so easy to accomplish. In fact, it is often infeasible due to ownership issues, governance practices, and regulatory concerns.

Big Data silos create additional complications including data duplication (and associated increased costs), complicated data replication solutions, high data latency, and data quality concerns, not to mention being an enabler of the real problematic situation where your data repositories could hold different versions of the truth. The silos also put a limit on business intelligence (discovery and actionable insights). As big data best practices rise above the hype and noise, we now know that actionable value is more easily extracted when multiple data sets can be integrated and viewed holistically.

Data analysts naturally want to break down silos to combine data from multiple data sources. Unfortunately, this can create its own bottleneck: a complex integration labyrinth—which is costly to maintain, rarely performs well, and can’t be guaranteed to provide consistent results.

In response, many companies have deployed Apache Hadoop to address the problem of segregated data. Hadoop enables multiple types of data to be directly processed in place, and it fully supports data integration from multiple sources across different data storage technologies.

Organizations that use Hadoop are finding additional benefits with Apache Drill, which is the open source version of Google’s Dremel system…

(continue reading here https://www.mapr.com/blog/drive-innovation-breaking-down-data-silos-apache-drill)

Follow Kirk Borne on Twitter @KirkDBorne