Category Archives: Big Data

Variety is the Secret Sauce for Big Discoveries in Big Data

When I was out for a walk recently, I heard a loud low-flying aircraft passing overhead. This was not unusual since we live in the flight path of planes landing at a major international airport about 10 miles from our home. In this case, I thought to myself that the sound seemed more directly overhead and lower than normal as well as being suggestive of a larger than average jet aircraft.

I realized that in my one simple thought, I had made three different inferences from a single stream of data. The data stream was the audible sound of the aircraft. The three inferences were about the altitude (lower than normal), the size (larger than average), and the flight path (more overhead). When I looked up, my tri-inference hypothesis was confirmed. The plane was a very large, low-flying jet for a major overnight shipping company. The slightly unusual flight path may have been associated with the fact that these planes are probably instructed to land on a different runway at the airport than the usual commercial passenger airlines’ flights – consequently, the altitude and location were slightly different from the slightly smaller commercial passenger airlines that pass overhead every day.

This situation caused me to reflect on how often we can jump to conclusions, infer a hypothesis, and (maybe without as much proof as in this case) we assume that our conclusion is true.

For the modern digital organization, the proof of any inference (that drives decisions) should be in the data! Rich and diverse data collections enable more accurate and trustworthy conclusions.

I frequently refer to the era of big data as “the end of demographics”. By that, I mean that we now have many more features, attributes, data sources, and insights into each entity in our domain: people, processes, and products. These multiple data sources enable a “360 view” of the entity, thus empowering a more personalized (even hyper-personalized) understanding of and response to the needs of that unique entity. In “big data language”, we are talking about one of the 3 V’s of big data: big data Variety!

High variety is one of the foundational key features of big data — we now measure many more features, characteristics, and dimensions of insight into nearly everything due to the plethora of data sources, sensors, and signals that we measure, monitor, and mine. Consequently, we no longer need to rely on a limited number of features and attributes when making decisions, taking actions, and generating inferences. We can make better, tailored, more personalized decisions and actions. Every entity is unique! That marks the end of demographics.

Here is another example: suppose that a person goes to their doctor to report problems with painful headaches. That is a single symptom (headache pain) — a single data source, a single signal, a single sensor. However, one could imagine a large number of possible inferences from that one single signal. The headaches could be caused by insufficient sleep (sleep apnea), high blood pressure, pregnancy, or a brain tumor. Obviously, each one of these diagnoses carries a seriously different course of action and treatment.

In “data science language”, what we are describing are different segments (clusters) in the hyperspace of symptoms and causes in which the many causes (clusters) are projected on top of one another (overlap one another) in the symptom space. The way that a data scientist resolves that degeneracy (another data science word) is to introduce more parameters (higher variety data) in order to “look at” those overlapping clusters from different angles and perspectives, thus resolving the different diagnosis clusters. High variety data enables the discovery of multiple clusters, and eventually identifies the correct cluster (correct diagnosis, in this case).

Higher variety data means that we are adding data from other sensors, other signals, other sources, and of different types. Going back to our low-flying airplane example, this has the following application: I not only heard the aircraft (sound = audio data), but I also looked at it (sight = visual data) and I observed its flight path (dynamic change over time = time series data). The proof of my inference about the airplane was in the data! Additional data sources provided the variety of data signals that were needed in order to derive a correct conclusion.

Similarly, when you go to the doctor with that headache, the doctor will start asking about other symptoms (e.g., lack of appetite; or other pains) and may order other medical tests (blood pressure checks, or other lab results). Those additional data sources and sensors provide the variety of data signals that are needed in order to derive the correct diagnosis.

These examples (low-flying aircraft, and headache pain) are representative analogies of a large number of different use cases in every organization, every business, and every process. The more data you have, the better you are able to detect and discover interesting and important phenomena and events. However, the more variety of data you have, the better you are able to correctly diagnose, interpret, understand, gain insights from, and take appropriate action in response to those phenomena and events.

High-variety data is the fuel that powers these insights, because variety is definitely the secret sauce for bigger and better discovery from big data collections.

Follow Kirk on Twitter at @KirkDBorne

Data Makes Possible Many Things: Insights Discovery, Innovation, and Better Decisions

In the early days of the big data era (at the peak of the big data hype), we would often hear about the 3 V’s of big data (Volume, Variety, and Velocity). Then, people started adding more V’s, including Veracity and Value, plus many more! I was guilty of adding several more in my article “Top 10 Big Data Challenges – A Serious Look at 10 Big Data V’s“.

Through the years, I have decided on the following “4 V’s of Big Data” summary in my own presentations:

  • Volume = the most annoying V
  • Velocity = the most challenging V
  • Variety = the most rich V for insights discovery, innovation, and better decisions
  • Value = the most important V

As Dez Blanchfield once said, “You don’t need a data scientist to tell you big data is valuable. You need one to show its value.”

A series of articles that drive home the all-important value of data is being published on the DataMakesPossible.com site. The site’s domain name says it all: Data Makes Possible!

What does data make possible? I occasionally discuss these things in articles that I write for the MapR blog series, where I have summarized the big data value proposition very simply in my own list of the 3 D2D‘s of big data:

  • Data-to-Discovery (Class Discovery, Correlation Discovery, Novelty Discovery, Association Discovery)
  • Data-to-Decisions
  • Data-to-Dollars (or Data-to-Dividends)

The story is richer and much more impactful these days than those trivial 1-letter (V) or 3-letter (D2D) mnemonics would suggest. We are far past the peak of inflated expectations in the big data hype cycle, and we are even beyond the trough of disillusionment. We have truly entered the plateau of productivity from data in our organizations, though the analytics-driven culture is still going through growing pains.

So, what does data make possible? I document our progress toward deriving big value from big data in a series of articles that I am writing (with more to come) for the DataMakesPossible.com site. These articles include:

But don’t stop there! There are many more fabulous articles, insights, case studies, and impactful stories at DataMakesPossible.com — visit the site often, as there are new posts every week.

Let’s improve our world together through the insights and discoveries that large, comprehensive data collections can provide! That’s what data scientists love to do.

Follow Kirk Borne on Twitter @KirkDBorne

Machine Learning Making Big Moves in Marketing

Machine Learning is (or should be) a core component of any marketing program now, especially in digital marketing campaigns. The following insightful quote by Dan Olley (EVP of Product Development and CTO at Elsevier) sums up the urgency and criticality of the situation: “If CIOs invested in machine learning three years ago, they would have wasted their money. But if they wait another three years, they will never catch up.” This statement also applies to CMOs.

To illustrate and to motivate these emerging and growing developments in marketing, we list here some of the top Machine Learning trends that we see:

  1. Hyper-personalization (SegOne context-driven marketing)
  2. Real-time sentiment analysis and response (social customer care)
  3. Behavioral analytics (predictive and prescriptive)
  4. Conversational chatbots (using NLG: Natural Language Generation)
  5. Agile analytics (DataOps)
  6. Influencer marketing (amplification of your message to specific audiences)
  7. Journey Sciences (using graph and linked data modeling)
  8. Context-based customer engagement through IoT (knowing the knowable via ubiquitous sensors)

You can read more details about each of these developments in my MapR blog.

And check out the many excellent resources and consulting services (in Big Data Analytics, Data Science, Machine Learning, and Machine Intelligence) at Booz Allen Hamilton, to help all of your data-driven campaigns make big moves and move forward more effectively.

Four Ways to Harness Big Data in the Energy Sector

The big change in every industry – energy included – is the sensoring of the world. We have put sensors on just about everything, and that’s definitely true in the energy sector, whether it’s in electricity, oil and gas, supply chain and manufacturing, or even customer interaction.

We call that big data, which people sometimes take to mean as big volume, but I like to think of as big value. Because with all this information, we can understand how our systems work – and put them to better use to create greater value for our business.

Prescriptive models

There is greater insight than ever before into how energy is distributed, both in the course of the day and across specific locations. In NASA, we tracked potential ‘killer asteroids’ – asteroids that have the potential to do serious damage to our earth.

In the energy industry that increased insight will help when averting its own disasters, or ‘killer asteroids’. Both predictive and prescriptive models are particularly useful here, for predicting future outcomes and for prescribing different (better, optimal) outcomes: if enough data are collected from factors such as the environment, devices used, and contextual information such as weather patterns or energy usage, models now know how to set the user parameters to prevent disaster.

On an energy grid, if a bad outcome is upcoming, conditions can be set to prevent machine failure, whether through reducing the operating temperature or speed, or increasing the frequency of on/off cycles.

Digital twins

An exciting proposition for the industrial sector is the digital twin, a full Computer-Aided Design (CAD) model of any device that can be run alongside its original to track behaviour and identify the cause of any failures. In the energy industry, this could be a digital replica of an offshore wind turbine, for example, monitoring power usage, production and efficiencies, and replaying any anomalies to identify their cause.

Crowdsourcing

When I was at NASA, we worked with scientists worldwide to create an online web portal called Zooniverse that presents large data collections to the public, enabling everyone to contribute to scientific discovery in datasets that are too large for the science teams to explore by themselves.

With the mountains of data we had it might take years, even centuries, to look through. But if you put it online and create an interesting enough challenge around that data, people across the world are going to volunteer their time in the name of scientific research.

We had some incredible results from this crowdsourcing mission. A Dutch schoolteacher called Hanny van Arkel discovered a strange green object while she was looking through images, which turned out to be an entirely new type of astronomical object (and is now named after her).

If I was to hypothesize how you could similarly crowdsource data in the energy sector, perhaps you could have people look at data prior to blackouts. There could be a pattern to the outage, or they could identify places where the power stays on during these blackouts – such as schools or grocery stores – which could be pinpointed on an app so people could use them when there are cuts.

Move beyond reporting

Energy industry executives should be interested in big data, because it allows them to look at all aspects of their business. I have conversations with publicly traded companies who only report on an annual or quarterly basis. I understand this is a regulatory requirement, but it seems to me to be only acting in hindsight.

You can’t drive a vehicle by looking in the rear-view mirror. You have to understand what’s coming, and the only way you can do that is by taking in all the information that is available to you: from the child playing on the street beside the road, to the truck two cars ahead.

Big data can deal with hindsight – events that have happened in the past; oversight – events happening now; and foresight – events happening in the future. And prescriptive modelling gives insight into all of these events. At a time where the global energy industry is undergoing huge change, don’t we need that forward-looking view of the rich information embedded within our big data reservoirs?

Summary

Therefore, the energy industry can benefit from these four approaches (prescriptive modelling, digital twinning, crowdsourcing, and moving beyond reporting) as ways to manage the digital disruption and the flood of new data coming from sensors everywhere. Industry leaders used to encourage organizations to adopt certain products or methodologies to help their data analytics move at the speed of business. I believe now that this is not a viable approach. What you really need are solutions that can help your business move at the speed of your data!

———————–

NOTE: This post was originally published at https://www.linkedin.com/pulse/four-ways-harness-big-data-energy-sector-kirk-borne/

Smart Cities at the Nexus of Emerging Data Technologies and You

One of the most significant characteristics of the evolving digital age is the convergence of technologies. That includes information management (databases), data collection (big data), data storage (cloud), data applications (analytics), knowledge discovery (data science), algorithms (machine learning), transparency (open data), computation (distributed computing: e.g., Hadoop), sensors (internet of things: IoT), and API services (microservices, containerization). One more dimension in this pantheon of tech, which is the most important, is the human dimension. We see the human interaction with technology explicitly among the latest developments in digital marketing, behavioral analytics, user experience, customer experience, design thinking, cognitive computing, social analytics, and (last, but not least) citizen science.

Citizen Scientists are trained volunteers who work on authentic science projects with scientific researchers to answer real-world questions and to address real-world challenges (see discussion here). Citizen Science projects are popular in astronomy, medicine, botany, ecology, ocean science, meteorology, zoology, digital humanities, history, and much more. Check out (and participate) in the wonderful universe of projects at the Zooniverse (zooniverse.org) and at scistarter.com.

In the data science community, we have seen activities and projects that are similar to citizen science in that volunteers step forward to use their skills and knowledge to solve real-world problems and to address real-world challenges. Examples of this include Kaggle.com machine learning competitions and the Data Science Bowl (sponsored each year since 2014 by Booz Allen Hamilton, and hosted by Kaggle). These “citizen science” projects are not just for citizens who are untrained in scientific disciplines, but they are dominated by professional and/or deeply skilled data scientists, who volunteer their time and talents to help solve hard data challenges.

The convergence of data technologies is leading to the development of numerous “smart paradigms”, including smart highways, smart farms, smart grid, and smart cities, just to name a few. By combining this technology convergence (data science, IoT, sensors, services, open data) with a difficult societal challenge (air quality in urban areas) in conjunction with community engagement (volunteer citizen scientists, whether professional or non-professional), the U.S. Environmental Protection Agency (EPA) has knitted the complex fabric of smart people, smart technologies, and smart problems into a significant open competition: the EPA Smart City Air Challenge.

The EPA Smart City Air Challenge launched on August 30, 2016. The challenge is open for about 8 weeks. This is an unusually important and rare project that sits at that nexus of IoT, Big Data Analytics, and Citizen Science. It calls upon clever design thinking at the intersection of sensor technologies, open data, data science, environment science, and social good.

Open data is fast becoming a standard for public institutions, encouraging partnerships between governments and their constituents. The EPA Smart City Air Challenge is a great positive step in that direction. By bringing together expertise across a variety of domains, we can hope to address and fix some hard social, environmental, energy, transportation, and sustainability challenges facing the current age. The challenge competition will bring forward best practices for managing big data at the community level. The challenge encourages communities to deploy hundreds of air quality sensors and to make their data public. The resulting data sets will help communities to understand real-time environmental quality, what are the driving factors in air quality change (including geographic features, urban features, and human factors), to assess which changes will lead to better outcomes (including social, mobile, transportation, energy use, education, human health, etc.), and to motivate those changes at the grass roots local community level.

The EPA Smart City Air Challenge encourages local governments to form partnerships with sensor manufacturers, data management companies, citizen scientists, data scientists, and others. Together they’ll create strategies for collecting and using the data. EPA will award prizes of up to $40,000 to two communities based on their strategies, including their plans to share their data management methods so others can benefit. The prizes are intended to be seed money, so the partnerships are essential.

After receiving awards for their partnerships, strategies and designs, the two winning communities will have a year to start developing and implementing their solutions based on those winning designs. After that year, EPA will then evaluate the accomplishments and collaboration of the two winning communities. Based upon that evaluation, EPA may then award up to an additional $10,000 to each of the two winning communities.

The EPA Smart City Air Challenge is open until October 28, 2016. The competition is for developers and scientists, for data lovers and technology lovers, for startups and for established organizations, for society and for you. Join the competition now! For more information, visit the Smart City Air Challenge website at http://www.challenge.gov/challenge/smart-city-air-challenge/, or write to smartcityairchallenge@epa.gov.

Spread the word about EPA’s Smart City Air Challenge — big data, data science, and IoT for societal good in your communities!

Thanks to Ethan McMahon @mcmahoneth for his contributions to this article and to the EPA Smart City Air Challenge.

Follow Kirk Borne on Twitter @KirkDBorne

Discovering and understanding patterns in highly dimensional data

Dimensionality reduction is a critical component of any solution dealing with massive data collections. Being able to sift through a mountain of data efficiently in order to find the key descriptive, predictive, and explanatory features of the collection is a fundamental required capability for coping with the Big Data avalanche. Identifying the most interesting dimensions of data is especially valuable when visualizing high-dimensional (high-variety) big data.

There is a “good news, bad news” angle here. First, the bad news: the human capacity for seeing multiple dimensions is very limited: 3 or 4 dimensions are manageable; 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-n projection of a larger-N parameter space of variables.

In data visualization, a systematic ordered parameter sweep through an ensemble of small-n projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/n is large.

Consequently, a data analytics approach that combines the best of both worlds (machine algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data. One such approach is to apply Computer Vision algorithms, which are designed to emulate human perception and cognitive abilities. Another approach is to generate “interestingness metrics” that signal to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data. A specific example of the latter is latent (hidden) variable discovery.

Latent variables are not explicitly observed but are inferred from the observed features, specifically because they are the variables that deliver the all-important (but sometimes hidden) descriptive, predictive, and explanatory power of the data set. Latent variables can also be concepts that are implicitly represented by the data (e.g., the “sentiment” of the author of a social media posting).  

Because some latent variables are “observable” in the sense that they can be generated through a “yet to be discovered” mathematical combination of several of the measured variables, these are therefore an obvious example of dimension reduction for visual exploration of large high-dimensional data.

Latent (Hidden) Variable Models are used in statistics to infer variables that are not observed but are inferred from the variables that are observed. Latent variables are widely used in social science, psychology, economics, life sciences and machine learning. In machine learning, many problems involve collection of high-dimensional multivariate observations and then hypothesizing a model that explains them. In such models, the role of the latent variables is to represent properties that have not been directly observed.

After inferring the existence of latent variables, the next challenge is to understand them. This can be achieved by exploring their relationship with the observed variables (e.g., using Bayesian methods) . Several correlation measures and dimensionality reduction methods such as PCA can be used to measure those relationships. Since we don’t know in advance what relationships exist between the latent variables and the observed variables, more generalized nonparametric measures like the Maximal Information Coefficient (MIC) can be used.

MIC has become popular recently, to some extent because it provides a straightforward R-squared type of estimate to measure dependency among variables in a high-dimensional data set.  Since we don’t know in advance what a latent variable actually represents, it is not possible to predict the type of relationship that it might possess with the observed variables. Consequently, a nonparametric approach makes sense in the case of large high-dimensional data, for which the interrelationships among the many variables is a mystery. Exploring variables that possess the largest values of MIC can help us to understand the type of relationships that the latent variables have with the existing variables, thereby achieving both dimension reduction and a parameter space in which to conduct visual exploration of high-dimensional data.

The techniques described here can help data end-users to discover and understand data patterns that may lead to interesting insights within their massive data collections.

Follow Kirk Borne on Twitter @KirkDBorne

Why Today’s Big Data is Not Yesterday’s Big Data — Exponential and Combinatorial Growth

(The following article was first published in July of 2013 at analyticbridge.com. At least 3 of the links in the original article are now obsolete and/or broken. I re-post the article here with the correct links. A lot of things in the Big Data, Data Science, and IoT universe have changed dramatically since that first publication, but I did not edit the article accordingly, in order to preserve the original flavor and context. The central message is still worth repeating today.)

The on-going Big Data media hype stirs up a lot of passionate voices. There are naysayers (“it is nothing new“), doomsayers (“it will disrupt everything”), and soothsayers (e.g., Predictive Analytics experts). The naysayers are most bothersome, in my humble opinion. (Note: I am not talking about skeptics, whom we definitely and desperately need during any period of maximized hype!)

We frequently encounter statements of the “naysayer” variety that tell us that even the ancient Romans had big data.  Okay, I understand that such statements logically follow from one of the standard definitions of big data: data sets that are larger, more complex, and generated more rapidly than your current resources (computational, data management, analytic, and/or human) can handle — whose characteristics correspond to the 3 V’s of Big Data.  This definition of Big Data could be used to describe my first discoveries in a dictionary or my first encounters with an encyclopedia.  But those “data sets” are hardly “Big Data” — they are universally accessible, easily searchable, and completely “manageable” by their handlers. Therefore, they are SMALL DATA, and thus it is a myth to label them as “Big Data”. By contrast, we cannot ignore the overwhelming fact that in today’s real Big Data tsunami, each one of us generates insurmountable collections of data on our own. In addition, the correlations, associations, and links between each person’s digital footprint and all other persons’ digital footprints correspond to an exponential (actually, combinatorial) explosion in additional data products.

Nevertheless, despite all of these clear signs that today’s big data environment is something radically new, that doesn’t stop the naysayers.  With the above standard definition of big data in their quiver, the naysayers are fond of shooting arrows through all of the discussions that would otherwise suggest that big data are changing society, business, science, media, government, retail, medicine, cyber-anything, etc. I believe that this naysayer type of conversation is unproductive, unhelpful, and unscientific. The volume, complexity, and speed of data today are vastly different from anything that we have ever previously experienced, and those facts will be even more emphatic next year, and even more so the following year, and so on.  In every sector of life, business, and government, the data sets are becoming increasingly off-scale and exponentially unmanageable. The 2011 McKinsey report Big Data: The Next Frontier for Innovation, Competition, and Productivity.” made this abundantly clear.  When the Internet of Things and machine-to-machine applications really become established, then the big data V’s of today will seem like child’s play.

In an attempt to illustrate the enormity of scale of today’s (and tomorrow’s) big data, I have discussed the exponential explosion of data in my TedX talk Big Data, small world (e.g., you can fast-forward to my comments on this topic starting approximately at the 9:00 minute mark in the video). You can also read more about this topic in the article Big Data Growth – Compound Interest on Steroids“, where I have elaborated on the compound growth rate of big data — the numbers will blow your mind, and they should blow away the naysayers’ arguments.  Read all about it at http://rocketdatascience.org/?p=204.

Follow Kirk Borne on Twitter @KirkDBorne

 

Definitive Guides to Data Science and Analytics Things

The Definitive Guide to anything should be a helpful, informative road map to that topic, including visualizations, lessons learned, best practices, application areas, success stories, suggested reading, and more.  I don’t know if all such “definitive guides” can meet all of those qualifications, but here are some that do a good job:

  1. The Field Guide to Data Science (big data analytics by Booz Allen Hamilton)
  2. The Data Science Capability Handbook (big data analytics by Booz Allen Hamilton)
  3. The Definitive Guide to Becoming a Data Scientist (big data analytics)
  4. The Definitive Guide to Data Science – The Data Science Handbook (analytics)
  5. The Definitive Guide to doing Data Science for Social Good (big data analytics, data4good)
  6. The Definitive Q&A Guide for Aspiring Data Scientists (big data analytics, data science)
  7. The Definitive Guide to Data Literacy for all (analytics, data science)
  8. The Data Analytics Handbook Series (big data, data science, data literacy by Leada)
  9. The Big Analytics Book (big data, data science)
  10. The Definitive Guide to Big Data (analytics, data science)
  11. The Definitive Guide to the Data Lake (big data analytics by MapR)
  12. The Definitive Guide to Business Intelligence (big data, business analytics)
  13. The Definitive Guide to Natural Language Processing (text analytics, data science)
  14. A Gentle Guide to Machine Learning (analytics, data science)
  15. Building Machine Learning Systems with Python (a non-definitive guide) (data analytics)
  16. The Definitive Guide to Data Journalism (journalism analytics, data storytelling)
  17. The Definitive “Getting Started with Apache Spark” ebook (big data analytics by MapR)
  18. The Definitive Guide to Getting Started with Apache Spark (big data analytics, data science)
  19. The Definitive Guide to Hadoop (big data analytics)
  20. The Definitive Guide to the Internet of Things for Business (IoT, big data analytics)
  21. The Definitive Guide to Retail Analytics (customer analytics, digital marketing)
  22. The Definitive Guide to Personalization Maturity in Digital Marketing Analytics (by SYNTASA)
  23. The Definitive Guide to Nonprofit Analytics (business intelligence, data mining, big data)
  24. The Definitive Guide to Marketing Metrics & Analytics
  25. The Definitive Guide to Campaign Tagging in Google Analytics (marketing, SEO)
  26. The Definitive Guide to Channels in Google Analytics (SEO)
  27. A Definitive Roadmap to the Future of Analytics (marketing, machine learning)
  28. The Definitive Guide to Data-Driven Attribution (digital marketing, customer analytics)
  29. The Definitive Guide to Content Curation (content-based marketing, SEO analytics)
  30. The Definitive Guide to Collecting and Storing Social Profile Data (social big data analytics)
  31. The Definitive Guide to Data-Driven API Testing (analytics automation, analytics-as-a-service)
  32. The Definitive Guide to the World’s Biggest Data Breaches (visual analytics, privacy analytics)

Follow Kirk Borne on Twitter @KirkDBorne

4_book_image-6fd6043b69f0bb051f45055c9481cccc

Fraud Analytics: Fast Automatic Modeling for Customer Loyalty Programs

It doesn’t take a rocket scientist to understand the deep and dark connection between big money and big fraud. One need only look at black markets for drugs and other controlled and/or precious commodities. But what about cases where the commodity is soft, intangible, and practically virtual? I am talking about loyalty and rewards programs.

A study by Colloquy (in 2011) estimated that the loyalty and rewards programs in the U.S. alone had an estimated outstanding value of $48 billion US dollars. This is “outstanding” value because it doesn’t carry tangible benefit until the rewards or loyalty points are cashed in, redeemed, or otherwise exchanged for something that you can “take to the bank”. In anybody’s book, $48 billion is really big value — i.e., big money rewards for loyal customers, and a big target for criminals seeking to defraud the rightful beneficiaries of these rewards.

The risk vs. reward equation in loyalty programs now has huge numbers on both sides of that equation. There’s great value for customers. There’s great return on investment for businesses seeking loyal customers. And that’s great bait to lure criminals into the game.

In the modern digital marketplace, it is now possible to manipulate payment systems on a larger scale, thereby defrauding the business of thousands of dollars in rewards points. The scale of the fraud could match the scale of the entire loyalty program for some firms, which would therefore bankrupt their supply of rewards for their loyal and faithful customers. This is a really big problem waiting to happen unless something is done about it.

The something that can be done about it is to take advantage of the fast predictive modeling capabilities for fraud detection that are enabled by access to more data (big data), better technology (analytics tools), and more insightful predictive and prescriptive algorithms (data science).

Fraud analytics is no silver bullet. It won’t rid the world of fraudsters and other criminals. But at least fast automatic modeling will give firms better defenses, more timely alerts, and faster response capabilities. This is essential because, in the digital era, it is not only business that is moving at the speed of light, but so also are the business disruptors.

Some simple use cases for fraud analytics within the context of customer loyalty reward programs can be found in the article “Where There’s Big Money, There’s Big Fraud (Analytics)“.

Payment fraud reaches across a vast array of industries: insurance (of all kinds), underwriting, social programs, purchasing and procurement, and now loyalty and rewards programs. Be prepared. Check out the analytics solutions from the fast automatic modeling folks at http://soft10ware.com/.

Follow Kirk Borne on Twitter @KirkDBorne

 

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

Some of the key players in efforts that use open data for social good include: DataKind, Bayes ImpactBooz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.

We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015.  We have now added more. The following seven V’s represent characteristics and challenges of open data:

  1. Validity:  data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value:  new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety:  the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice:  your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary:  the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability:  the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data):  maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Here are some sources and meta-sources of open data:

We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.

The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at http://sunlightfoundation.com/opendataguidelines/. These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.

Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including the following 10 quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):

  1. Statistics
  2. Availability
  3. Processability
  4. Accuracy
  5. Consistency
  6. Relevance
  7. Completeness
  8. Conformance
  9. Credibility
  10. Timeliness

Follow Kirk Borne on Twitter @KirkDBorne