Tag Archives: Machine Learning

Variety is the Secret Sauce for Big Discoveries in Big Data

When I was out for a walk recently, I heard a loud low-flying aircraft passing overhead. This was not unusual since we live in the flight path of planes landing at a major international airport about 10 miles from our home. In this case, I thought to myself that the sound seemed more directly overhead and lower than normal as well as being suggestive of a larger than average jet aircraft.

I realized that in my one simple thought, I had made three different inferences from a single stream of data. The data stream was the audible sound of the aircraft. The three inferences were about the altitude (lower than normal), the size (larger than average), and the flight path (more overhead). When I looked up, my tri-inference hypothesis was confirmed. The plane was a very large, low-flying jet for a major overnight shipping company. The slightly unusual flight path may have been associated with the fact that these planes are probably instructed to land on a different runway at the airport than the usual commercial passenger airlines’ flights – consequently, the altitude and location were slightly different from the slightly smaller commercial passenger airlines that pass overhead every day.

This situation caused me to reflect on how often we can jump to conclusions, infer a hypothesis, and (maybe without as much proof as in this case) we assume that our conclusion is true.

For the modern digital organization, the proof of any inference (that drives decisions) should be in the data! Rich and diverse data collections enable more accurate and trustworthy conclusions.

I frequently refer to the era of big data as “the end of demographics”. By that, I mean that we now have many more features, attributes, data sources, and insights into each entity in our domain: people, processes, and products. These multiple data sources enable a “360 view” of the entity, thus empowering a more personalized (even hyper-personalized) understanding of and response to the needs of that unique entity. In “big data language”, we are talking about one of the 3 V’s of big data: big data Variety!

High variety is one of the foundational key features of big data — we now measure many more features, characteristics, and dimensions of insight into nearly everything due to the plethora of data sources, sensors, and signals that we measure, monitor, and mine. Consequently, we no longer need to rely on a limited number of features and attributes when making decisions, taking actions, and generating inferences. We can make better, tailored, more personalized decisions and actions. Every entity is unique! That marks the end of demographics.

Here is another example: suppose that a person goes to their doctor to report problems with painful headaches. That is a single symptom (headache pain) — a single data source, a single signal, a single sensor. However, one could imagine a large number of possible inferences from that one single signal. The headaches could be caused by insufficient sleep (sleep apnea), high blood pressure, pregnancy, or a brain tumor. Obviously, each one of these diagnoses carries a seriously different course of action and treatment.

In “data science language”, what we are describing are different segments (clusters) in the hyperspace of symptoms and causes in which the many causes (clusters) are projected on top of one another (overlap one another) in the symptom space. The way that a data scientist resolves that degeneracy (another data science word) is to introduce more parameters (higher variety data) in order to “look at” those overlapping clusters from different angles and perspectives, thus resolving the different diagnosis clusters. High variety data enables the discovery of multiple clusters, and eventually identifies the correct cluster (correct diagnosis, in this case).

Higher variety data means that we are adding data from other sensors, other signals, other sources, and of different types. Going back to our low-flying airplane example, this has the following application: I not only heard the aircraft (sound = audio data), but I also looked at it (sight = visual data) and I observed its flight path (dynamic change over time = time series data). The proof of my inference about the airplane was in the data! Additional data sources provided the variety of data signals that were needed in order to derive a correct conclusion.

Similarly, when you go to the doctor with that headache, the doctor will start asking about other symptoms (e.g., lack of appetite; or other pains) and may order other medical tests (blood pressure checks, or other lab results). Those additional data sources and sensors provide the variety of data signals that are needed in order to derive the correct diagnosis.

These examples (low-flying aircraft, and headache pain) are representative analogies of a large number of different use cases in every organization, every business, and every process. The more data you have, the better you are able to detect and discover interesting and important phenomena and events. However, the more variety of data you have, the better you are able to correctly diagnose, interpret, understand, gain insights from, and take appropriate action in response to those phenomena and events.

High-variety data is the fuel that powers these insights, because variety is definitely the secret sauce for bigger and better discovery from big data collections.

Follow Kirk on Twitter at @KirkDBorne

Are Your Predictive Models like Broken Clocks?

A wise philosopher (or comedian) once said, “Even a broken clock is right twice a day.” That same statement might also apply to some predictive models. Since prediction is about the future (usually), then random chance (like broken clockwork) may allow our model to be right occasionally (just by accident). The important step in the data science process that aims to reduce the danger of this occurring is the all-important cross-validation phase (or model-testing phase, which uses an independent data set). This phase is devoted to validating that our model works accurately on previously unseen data that were not used in the model training (model-building) phase.

Another way of characterizing this phase can be found in the field of System Engineering: V&V (Verification and Validation). In the first phase (verification), we verify that the system was built correctly (according to a set of requirements and specifications). In the second phase (validation), we validate that we built the correct system (consistent with the operational needs that the end-user, customer, or client expects the system to satisfy). We sometimes say it this way: (1) in verification, we ask “Did we build the system right?”; and (2) in validation, we ask “Did we build the right system?”

Applying the V&V system engineering principle to data science means that we see model-testing as a two-step process. First, we verify that the model is a logical consequent of the input data used to train the model. Second, we validate that the model remains useful, accurate, and robust when applied to previously unseen data. Any data scientist who participates in Kaggle competitions understands and “lives” this process. It is often the case that our first data science model will do a great job on the “seen” data set (i.e., verification by using a “broken clock” that is right on occasion), but the model then performs poorly on the “unseen” data set. A model that does well on both data sets is a winning model (maybe not in every Kaggle competition, but certainly in real-world usage).

Using the same data set both to validate a model and to train the model would be the data science equivalent of “circular reasoning“. This will often lead to “overfitting“, where the initial model is incorrectly trained to reproduce every variation, bump, wiggle, nuance, and noisy deviation in the training data set, thus falsely exaggerating the importance of those fluctuations. “Complexity” describes our world, but it shouldn’t describe our models.

The other extreme in model-building can be just as bad: underfitting (or bias) introduced by using too few explanatory variables to model the behaviors seen in our data set. I like to believe that Albert Einstein understood data science modeling very well when he said “Everything should be made as simple as possible, but not simpler.” Building an excessively complex model (with too many parameters that follow the noise fluctuations in our data) is like putting too much confidence in a broken clock (“it’s exactly right… some of the time!”). George Box warned us to have a little humility in the face of complex data (and a complex world): “All models are wrong, but some are useful.”

Therefore, when faced with highly complex (high-variety) big data, we are also faced with how to choose the “right model”. We should apply the “Goldilocks principle” — choose a model that is not “too good” and not too bad (i.e., the model works well enough on the training data set and on the test data set).

Follow Kirk Borne on Twitter @KirkDBorne

model_complexity_error_training_test

 

(Source for graphic: http://gerardnico.com/wiki/data_mining/bias_trade-off)

Definitive Guides to Data Science and Analytics Things

The Definitive Guide to anything should be a helpful, informative road map to that topic, including visualizations, lessons learned, best practices, application areas, success stories, suggested reading, and more.  I don’t know if all such “definitive guides” can meet all of those qualifications, but here are some that do a good job:

  1. The Field Guide to Data Science (big data analytics by Booz Allen Hamilton)
  2. The Data Science Capability Handbook (big data analytics by Booz Allen Hamilton)
  3. The Definitive Guide to Becoming a Data Scientist (big data analytics)
  4. The Definitive Guide to Data Science – The Data Science Handbook (analytics)
  5. The Definitive Guide to doing Data Science for Social Good (big data analytics, data4good)
  6. The Definitive Q&A Guide for Aspiring Data Scientists (big data analytics, data science)
  7. The Definitive Guide to Data Literacy for all (analytics, data science)
  8. The Data Analytics Handbook Series (big data, data science, data literacy by Leada)
  9. The Big Analytics Book (big data, data science)
  10. The Definitive Guide to Big Data (analytics, data science)
  11. The Definitive Guide to the Data Lake (big data analytics by MapR)
  12. The Definitive Guide to Business Intelligence (big data, business analytics)
  13. The Definitive Guide to Natural Language Processing (text analytics, data science)
  14. A Gentle Guide to Machine Learning (analytics, data science)
  15. Building Machine Learning Systems with Python (a non-definitive guide) (data analytics)
  16. The Definitive Guide to Data Journalism (journalism analytics, data storytelling)
  17. The Definitive “Getting Started with Apache Spark” ebook (big data analytics by MapR)
  18. The Definitive Guide to Getting Started with Apache Spark (big data analytics, data science)
  19. The Definitive Guide to Hadoop (big data analytics)
  20. The Definitive Guide to the Internet of Things for Business (IoT, big data analytics)
  21. The Definitive Guide to Retail Analytics (customer analytics, digital marketing)
  22. The Definitive Guide to Personalization Maturity in Digital Marketing Analytics (by SYNTASA)
  23. The Definitive Guide to Nonprofit Analytics (business intelligence, data mining, big data)
  24. The Definitive Guide to Marketing Metrics & Analytics
  25. The Definitive Guide to Campaign Tagging in Google Analytics (marketing, SEO)
  26. The Definitive Guide to Channels in Google Analytics (SEO)
  27. A Definitive Roadmap to the Future of Analytics (marketing, machine learning)
  28. The Definitive Guide to Data-Driven Attribution (digital marketing, customer analytics)
  29. The Definitive Guide to Content Curation (content-based marketing, SEO analytics)
  30. The Definitive Guide to Collecting and Storing Social Profile Data (social big data analytics)
  31. The Definitive Guide to Data-Driven API Testing (analytics automation, analytics-as-a-service)
  32. The Definitive Guide to the World’s Biggest Data Breaches (visual analytics, privacy analytics)

Follow Kirk Borne on Twitter @KirkDBorne

4_book_image-6fd6043b69f0bb051f45055c9481cccc

Reach Analytics Maturity through Fast Automatic Modeling

The late great baseball legend Yogi Berra was credited with saying this gem: “The future ain’t what it used to be.” In the context of big data analytics, I am now inclined to believe that Yogi was very insightful — his statement is an excellent description of Prescriptive Analytics.

Prescriptive Analytics goes beyond Descriptive and Predictive Analytics in the maturity framework of analytics. “Descriptive” analytics delivers hindsight (telling you what did happen, by generating reports from your databases), and “predictive” delivers foresight (telling you what will happen, through machine learning algorithms). Going one better, “prescriptive” delivers insight: discovering so much about your application domain (from your collection of big data and information resources, through data science and predictive models) that you are now able to take the actions (e.g., set the conditions and parameters) needed to achieve a prescribed (better, optimal, desired) outcome.

So, if predictive analytics can use historical training data sets to tell us what will happen in the future (e.g., which products a customer will buy; where and when your supply chain will need replenishing; which vehicles in your corporate fleet will need repairs; which machines in your manufacturing plant will need maintenance; or which servers in your data center will fail), then prescriptive analytics can alter that future (i.e., the future ain’t what it used to be).

When dealing with large high-variety data sets, with many features and measured attributes, it is often difficult to build accurate models that are generally useful under a variety of conditions and that capture all of the complexities of the response functions and explanatory variables within your business application. In such cases, fast automatic modeling tools are needed. These tools can help to identify the minimum viable feature set for accurate predictive and prescriptive modeling. For this purpose, I recommend that you check out the analytics solutions from the fast automatic modeling folks at http://soft10ware.com/.

The Soft10 software package is trained to observe quickly and report automatically the most significant, informative and explanatory dependencies in your data. Those capabilities are the “secret sauce” in insightful prescriptive analytics, and they coincide nicely with another insightful quote from Yogi Berra: “You can observe a lot by just watching.”

(Read the full blog at: https://www.linkedin.com/pulse/prescriptive-analytics-future-aint-what-used-kirk-borne)

Predictive versus Prescriptive Analytics

Predictive Analytics (given X, find Y) vs. Prescriptive Analytics (given Y, find X)

Follow Kirk Borne on Twitter @KirkDBorne

Fraud Analytics: Fast Automatic Modeling for Customer Loyalty Programs

It doesn’t take a rocket scientist to understand the deep and dark connection between big money and big fraud. One need only look at black markets for drugs and other controlled and/or precious commodities. But what about cases where the commodity is soft, intangible, and practically virtual? I am talking about loyalty and rewards programs.

A study by Colloquy (in 2011) estimated that the loyalty and rewards programs in the U.S. alone had an estimated outstanding value of $48 billion US dollars. This is “outstanding” value because it doesn’t carry tangible benefit until the rewards or loyalty points are cashed in, redeemed, or otherwise exchanged for something that you can “take to the bank”. In anybody’s book, $48 billion is really big value — i.e., big money rewards for loyal customers, and a big target for criminals seeking to defraud the rightful beneficiaries of these rewards.

The risk vs. reward equation in loyalty programs now has huge numbers on both sides of that equation. There’s great value for customers. There’s great return on investment for businesses seeking loyal customers. And that’s great bait to lure criminals into the game.

In the modern digital marketplace, it is now possible to manipulate payment systems on a larger scale, thereby defrauding the business of thousands of dollars in rewards points. The scale of the fraud could match the scale of the entire loyalty program for some firms, which would therefore bankrupt their supply of rewards for their loyal and faithful customers. This is a really big problem waiting to happen unless something is done about it.

The something that can be done about it is to take advantage of the fast predictive modeling capabilities for fraud detection that are enabled by access to more data (big data), better technology (analytics tools), and more insightful predictive and prescriptive algorithms (data science).

Fraud analytics is no silver bullet. It won’t rid the world of fraudsters and other criminals. But at least fast automatic modeling will give firms better defenses, more timely alerts, and faster response capabilities. This is essential because, in the digital era, it is not only business that is moving at the speed of light, but so also are the business disruptors.

Some simple use cases for fraud analytics within the context of customer loyalty reward programs can be found in the article “Where There’s Big Money, There’s Big Fraud (Analytics)“.

Payment fraud reaches across a vast array of industries: insurance (of all kinds), underwriting, social programs, purchasing and procurement, and now loyalty and rewards programs. Be prepared. Check out the analytics solutions from the fast automatic modeling folks at http://soft10ware.com/.

Follow Kirk Borne on Twitter @KirkDBorne

 

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

Some of the key players in efforts that use open data for social good include: DataKind, Bayes ImpactBooz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.

We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015.  We have now added more. The following seven V’s represent characteristics and challenges of open data:

  1. Validity:  data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value:  new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety:  the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice:  your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary:  the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability:  the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data):  maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Here are some sources and meta-sources of open data:

We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.

The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at http://sunlightfoundation.com/opendataguidelines/. These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.

Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including the following 10 quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):

  1. Statistics
  2. Availability
  3. Processability
  4. Accuracy
  5. Consistency
  6. Relevance
  7. Completeness
  8. Conformance
  9. Credibility
  10. Timeliness

Follow Kirk Borne on Twitter @KirkDBorne

 

Blogging My Way Through Data Science, Big Data, and Analytics

I frequently write blog posts on other sites.  You can find those articles here (updated March 21, 2016):

I also write “one-off” blog posts, such as these examples:

Follow Kirk Borne on Twitter @KirkDBorne

4 Reasons why an Accurate Analytics Model may not be Good Enough

Here are four reasons why the result of your analytics modeling might be correct (according to some accuracy metric), but it might not be the right answer:

  1. Your model may be underfit.

  2. Your model may be overfit.

  3. Your model may be biased.

  4. Your model may be suffering from the false positive paradox.

In data science, we are trained to keep searching even after we find a model that appears to be accurate. Data Scientists should continue searching for a better solution, for at least the four reasons listed above. Please note that I am not advocating “paralysis of analysis”, where never-ending searches for new and better solutions are just an excuse (or a behavior pattern) that prevents one from making a final decision. Good leaders know when an answer is “good enough”. We discussed this in a previous article: “Machine Unlearning – The Value of Imperfect Models”…

(For more discussion of the four cases listed above, continue reading herehttps://www.mapr.com/blog/4-reasons-look-further-accurate-answer-your-analytics-question)

Follow Kirk Borne on Twitter @KirkDBorne

The Definitive Q&A for Aspiring Data Scientists

I was recently asked five questions by Alex Woodie of Datanami for the article, “So You Want To Be A Data Scientist” that he was preparing. He used a few snippets from my full set of answers. The longer version of my answers provided additional advice. For aspiring data scientists of all ages, I provide here the full, unabridged version of my answers, which may help you even more to achieve your goal. (Note: I paraphrase Alex’s original questions in quotes below.)

1. “What is the number one piece of advice you give to aspiring data scientists?”

My number one piece of advice always is to follow your passions first. Know what you are good at and what you care about, and pursue that. So, you might be good at math, or programming, or data manipulation, or problem solving, or communications (data journalism), or whatever. You can do that flavor of data science within the context of any domain: scientific research, government, media communications, marketing, business, healthcare, finance, cybersecurity, law enforcement, manufacturing, transportation, or whatever. As a successful data scientist, your day can begin and end with you counting your blessings that you are living your dream by solving real-world problems with data. I saw a quote recently that summarizes this: “If you think your scarce data science skills could be better used elsewhere, be bold and make the move.” (Reference).

2. “What are the most important skills for an aspiring data scientist to acquire?”

There are many skills under the umbrella of data science, and we should not expect any one single person to be a master of them all. The best solution to the data science talent shortage is a team of data scientists. So I suggest…

(continue reading herehttps://www.mapr.com/blog/definitive-qa-aspiring-data-scientists)

Follow Kirk Borne on Twitter @KirkDBorne

These are a few of my favorite things… in Big Data and Data Science: A to Z

A while back, we made a list from A to Z of a few of our favorite things in big data and data science. We have made a lot of progress toward covering several of these topics. Here’s a handy list of the write-ups that I have completed so far:

AAssociation rule mining:  described in the article “Association Rule Mining – Not Your Typical Data Science Algorithm.”

C – Characterization:  described in the article “The Big C of Big Data: Top 8 Reasons that Characterization is ‘ROIght’ for Your Data.”

H – Hadoop (of course!):  described in the article “H is for Hadoop, along with a Huge Heap of Helpful Big Data Capabilities.” To learn more, check out the Executive’s Guide to Big Data and Apache Hadoop, available as a free download from MapR.

K – K-anything in data mining:  described in the article “The K’s of Data Mining – Great Things Come in Pairs.”

L – Local linear embedding (LLE):  is described in detail in the blog post series “When Big Data Goes Local, Small Data Gets Big – Part 1” and “Part 2

N – Novelty detection (also known as “Surprise Discovery”):  described in the articles “Outlier Detection Gets a Makeover – Surprise Discovery in Scientific Big Data” and “N is for Novelty Detection…” To learn more, check out the book Practical Machine Learning: A New Look at Anomaly Detection, available as a free download from MapR.

P – Profiling (specifically, data profiling):  described in the article “Data Profiling – Four Steps to Knowing Your Big Data.”

Q – Quantified and Tracked:  described in the article “Big Data is Everything, Quantified and Tracked: What this Means for You.”

R – Recommender engines:  described in two articles: “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore.” To learn more, check out the book Practical Machine Learning: Innovations in Recommendation, available as a free download from MapR.

S – SVM (Support Vector Machines):  described in the article “The Importance of Location in Real Estate, Weather, and Machine Learning.”

Z – Zero bias, Zero variance:  described in the article “Statistical Truisms in the Age of Big Data.”