Tag Archives: Analytics

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

Some of the key players in efforts that use open data for social good include: DataKind, Bayes ImpactBooz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.

We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015.  We have now added more. The following seven V’s represent characteristics and challenges of open data:

  1. Validity:  data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value:  new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety:  the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice:  your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary:  the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability:  the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data):  maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Here are some sources and meta-sources of open data:

We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.

The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at http://sunlightfoundation.com/opendataguidelines/. These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.

Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including the following 10 quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):

  1. Statistics
  2. Availability
  3. Processability
  4. Accuracy
  5. Consistency
  6. Relevance
  7. Completeness
  8. Conformance
  9. Credibility
  10. Timeliness

Follow Kirk Borne on Twitter @KirkDBorne

 

Blogging My Way Through Data Science, Big Data, and Analytics

I frequently write blog posts on other sites.  You can find those articles here (updated March 21, 2016):

I also write “one-off” blog posts, such as these examples:

Follow Kirk Borne on Twitter @KirkDBorne

What Motivates a Data Scientist?

I recently had the pleasure of being interviewed by Manu Jeevan for his Big Data Made Simple blog.  He asked me several questions:

  • How did you get into data science?
  • What exactly is enterprise data science?
  • How does Booz Allen Hamilton use data science?
  • What skills should business executives have to effectively to communicate with data scientists?
  • How is big data changing the world? (Please give us interesting examples)
  • What are your go-to tools for doing data science?
  • In your TedX talk Big Data, Small World you gave special attention to association discovery, is there a specific reason for that?
  • The Data Scientist has been called the sexiest job of the 21st century. Do you agree?
  • What advice would you give to people aspiring for a long career in data science?

All of these questions were ultimately aimed at understanding the key underlying question: “What motivates you to work in data science?” The question about enterprise data science really comes the closest to identifying what really motivates me — that is, I am exceedingly fortunate every day to be given the opportunity to work with a fantastic team of data scientists at Booz Allen Hamilton, with the mandate to explore data as a corporate asset and to exploit data science as a core capability in order to achieve more profound discoveries, to make better (data-driven) decisions, and to propel new innovations across numerous domains, industries, agencies, and organizations. My Data Science Declaration also sums up these motivating factors for me.

You can see the full scope of my answers to the above questions here: http://bigdata-madesimple.com/interview-with-leading-data-science-expert-kirk-borne/.

Follow Kirk Borne on Twitter @KirkDBorne

Just-in-Time Supply Chain Management with Data Analytics

A common phrase in SCM (Supply Chain Management) is Just-In-Time (JIT) inventory. JIT refers to a management strategy in which raw materials, products, or services are delivered to the right place, at the right time, as demand requires. This has always been an excellent business goal, but the power to excel at JIT inventory management is now improving dramatically with the increased use of data analytics across the supply chain.

In the article “Operational Analytics and Droning About Big Data“, we discussed two examples of JIT: (1) a just-in-time supply replenishment system for human bases on the Moon, and (2) the proposal by Amazon to use drones to deliver products to your front door “just in time”! The Internet of Things will almost certainly generate similar use cases and benefits.

Descriptive analytics (hindsight) tells you what has already happened in your supply chain. If there was a deficiency or problem somewhere, then you can react to that event. But, that is “old school” supply chain management. Modern analytics is predictive (foresight), allowing you to predict where the need will occur (in advance) so that you can proactively deliver products and services at the point of need, just in time.

The next advance in analytics is prescriptive (insight), which uses optimization techniques (from operations research) in combination with insights and knowledge of your business (systems, processes, and resources) in order to optimize your delivery systems, for the best possible outcome (greater sales, fewer losses, reduced inventory, etc.). Just-in-time supply chain management then becomes something more than a reality — it now becomes an enabler of increased efficiency and productivity.

Many more examples of use cases in the manufacturing and retail industries (and elsewhere) where just-in-time analytics is important (and what you can do about it) have been enumerated by the fast Automatic Modeling folks from Soft10, Inc. Check out their fast predictive analytics products at http://soft10ware.com/.

(Read more about these ideas at: https://www.linkedin.com/pulse/supply-chain-data-analytics-jit-legit-kirk-borne)

Follow Kirk Borne on Twitter @KirkDBorne

 

Analytics Maturity Models

In the world of big data analytics, there are several emerging standards for measuring Analytics Capability Maturity within organizations.  One of these has been presented in the TIBCO Analytics Maturity Journey – their six steps toward analytics maturity are:  Measure, Diagnose, Predict and Optimize, Operationalize, Automate, and Transform.  Another example is presented through the SAS Analytics Assessment, which evaluates business analytics readiness and capabilities in several areas.  The B-eye Network Analytics Maturity Model mimics software engineering’s CMM (Capability Maturity Model) – their 6 levels of maturity are:  Level 0 = Incomplete; Level 1 = Performed; Level 2 = Managed; Level 3 = Defined, Level 4 = Quantitatively Managed; and Level 5 = Optimizing.

The most “mature” standard in the field is probably the IDC Big Data and Analytics (BDA) MaturityScape Framework.  This BDA framework (measured across the five core dimensions of intent, data, technology, process, and people) consists of five stages of maturity, which essentially parallel the others mentioned above:  Ad hoc, Opportunistic, Repeatable, Managed, and Optimized.

All of these are excellent models for analytics maturity.  But, if you find these different models to be too theoretical or opaque or unattainable, then I suggest a more practical model for your business analytics progression from ground zero all of the way up to cognitive analytics:  from Descriptive and Diagnostic, to Predictive, to Prescriptive, and finally to Cognitive.

A specific example from the field of Marketing is SYNTASA‘s PMI (Personalization Maturity Index). Personalization Capability Maturity parallels the Analytics Capability Maturity frameworks within the specific context of data-driven customer-centric one-to-one marketing and segmentation of one. Read more about this in the article The Battle for Customer Personalization – Divisive Clustering is Good For Youand in much more detail within SYNTASA’s PMI white paper linked above.

(continue reading here:  https://www.mapr.com/blog/raising-standard-big-data-analytics-profession)

Follow Kirk Borne on Twitter @KirkDBorne

 

4 Reasons why an Accurate Analytics Model may not be Good Enough

Here are four reasons why the result of your analytics modeling might be correct (according to some accuracy metric), but it might not be the right answer:

  1. Your model may be underfit.

  2. Your model may be overfit.

  3. Your model may be biased.

  4. Your model may be suffering from the false positive paradox.

In data science, we are trained to keep searching even after we find a model that appears to be accurate. Data Scientists should continue searching for a better solution, for at least the four reasons listed above. Please note that I am not advocating “paralysis of analysis”, where never-ending searches for new and better solutions are just an excuse (or a behavior pattern) that prevents one from making a final decision. Good leaders know when an answer is “good enough”. We discussed this in a previous article: “Machine Unlearning – The Value of Imperfect Models”…

(For more discussion of the four cases listed above, continue reading herehttps://www.mapr.com/blog/4-reasons-look-further-accurate-answer-your-analytics-question)

Follow Kirk Borne on Twitter @KirkDBorne

Clear and Obvious Analytics for Clear and Present Dangers

Not every industry has found their clear and obvious applications of big data analytics. But the clear and present dangers of risk and fraud in financial transactions demand fast predictive modeling. Precisely because we live in the ubiquitous digital era, where most business (and non-business) transactions are rarely (if ever) in analog form and those transactions no longer move at the pace of humans (but at the speed of light), consequently the volume of digital signals as well as lurking dangers is enormous.

Digital signals (from sensors everywhere in our operational business systems) carry transactional information (what happened to what?), as well as metadata (descriptors) and analytics information (data-encoded knowledge and insights).  These analytics can be behavioral (providing insights into the interests, intentions, and preferences of the actors in a given transaction) as well as functional (providing insights into the actions or events associated with the transaction).

Behavioral analytics is developing into a major component of digital marketing, as firms seek to sell, cross-sell, and up-sell their products to the right customer at the right time.  Behavioral analytics is also critical in risk mitigation of all sorts: financial, cybersecurity, health (individual and population), supply chain, machine performance, and so on.

Here are 10 examples of where fast predictive analytics can play a vital role in most industries (with a focus on financial):

  1. Predict credit risk and fraud in real-time!
  2. Use Social Media for deeper understanding (likes and dislikes) of your customers.
  3. Personalize customer interactions in real-time, across multiple channels.
  4. Stop improper insurance payments before claims are paid!
  5. Spot insurance rate evasion tactics during the quote process – before you issue a policy!
  6. Predict High Health Risk versus Low Health Risk to better manage healthcare decision-making.
  7. Generate better predictive models of health, car, and home insurance eligibility fraud, underwriting fraud, and improper payments.
  8. Spot adversarial and anomalous behavior in cyber networks – stop the data breach or illegal funds transfer before it happens!
  9. Eliminate your Supply Chain hiccups – move the right products to the right locations in the right quantities – and at the right time!
  10. Make better business decisions regarding merchandising, demand forecasting, and pricing – don’t leave money on the table, or products in your warehouse.

Let us look a little more closely at the financial services industry…

One of the common conditions in traditional financial services (including home, health, and auto insurance) has been the “pay and chase” — i.e., you make the payment to the claimant, and then (after making the payment) you find out that the claim is fraudulent, thus beginning the chase to get your money back.

The new world of predictive modeling and advanced analytics allows for a new mantra in the financial and insurance industries: “Do Not Pay!” — i.e., you do not pay the claim until you have analyzed its likelihood for claim fraud, extraordinary financial risk, or payment anomalies (e.g., duplicate payments).

Predictive analytics modeling delivers a better financial risk posture for your organization than the “pay and chase”. With access to greater and more diverse data sources, it is now possible to develop better models of your customers’ credit risk regardless of the industry. This is certainly true in the financial services industry where there is so much data available: credit scores, credit history, court records, tax records, health records, insurance claims, and more. There is no excuse for not examining as much “public data” as you can in conjunction with other data sources that are available to you internally within your organization. Moderate outlays of your organization’s funds that are incurred in acquiring access to diverse external data sources should be offset by the savings accrued by “not sending your funds out the door” erroneously (either to intentional fraudsters or in unintentional duplicate claims).

An analytics-driven predictive model can predict fraud more efficiently (with fast automatic statistical software packages) and more effectively (with higher precision and higher recall: fewer false positives and false negatives) than traditional business processes. A good predictive analytics model should: (a) detect claims that “smell funny”, (b) prevent the “pay and chase” mode of operations, and (c) stop claims fraud abruptly by empowering a “do not pay” mode of operations. Predictive analytics modeling should aim to satisfy the following business requirements:

  • Detect and prevent both opportunistic and professional fraud throughout the claims process.
  • Detect underwriting fraud, to prevent premium leakage at the point of sale and renewal.
  • Spot rate evasion tactics during the quote process – before you issue a policy.

Many more examples of use cases in the financial services industry (and elsewhere) where fast predictive analytics is important (and what you can do about it) have been expertly enumerated by the fast statistical modeling folks from Soft10, Inc. Check out their fast analytics products (including the Instant Online Overbilling Claims Detector) at http://soft10ware.com/.

Follow Kirk Borne on Twitter @KirkDBorne

A Growth Hacker’s Journey Through the Recent History of Data Science

In 1998, I was attending a conference when an astronomer that I knew from across the country sought me out and asked if his group could send the data from their large astronomy experiment to NASA’s ADF (Astrophysics Data Facility, where I was working at the time). Their data set was two Terabytes in total. That seemed big (like the birth of “Astronomy Big Data”) to me, especially for 1998, but I didn’t know how big until I went back to work a few days later. When I mentioned this opportunity to the NASA facility senior managers, they looked at me like I was unaware of something really obvious and important. They were right! They “reminded” me that the data facility was the home for 15,000 NASA space science mission data sets, and the aggregate sum total data volume for all of those data sets combined(!) was less than one Terabyte! They couldn’t possibly accept one single experiment’s data that single-handedly eclipsed the total volume of all of the other 15,000 experiments’ data sets combined.

Well, this was embarrassing! What could we do? I was told that ADF could accept the data if I would write a research grant proposal and win some funds to pay for all of the new I.T. resources that would be required. “What kind of research proposal would pay for such a thing?” I asked myself. This led me to investigate a field of research that I had only briefly heard in conversation once or twice previously — Data Mining (= Machine Learning applied to large data sets). The more I read about this topic (now called Data Science), the more I became convinced that this is what I wanted to do for the rest of my research career. I was hooked. I was at the right place at the right time…

(continue reading herehttps://www.mapr.com/blog/growth-hackers-journey-right-place-right-time)

Follow Kirk Borne on Twitter @KirkDBorne

These are a few of my favorite things… in Big Data and Data Science: A to Z

A while back, we made a list from A to Z of a few of our favorite things in big data and data science. We have made a lot of progress toward covering several of these topics. Here’s a handy list of the write-ups that I have completed so far:

AAssociation rule mining:  described in the article “Association Rule Mining – Not Your Typical Data Science Algorithm.”

C – Characterization:  described in the article “The Big C of Big Data: Top 8 Reasons that Characterization is ‘ROIght’ for Your Data.”

H – Hadoop (of course!):  described in the article “H is for Hadoop, along with a Huge Heap of Helpful Big Data Capabilities.” To learn more, check out the Executive’s Guide to Big Data and Apache Hadoop, available as a free download from MapR.

K – K-anything in data mining:  described in the article “The K’s of Data Mining – Great Things Come in Pairs.”

L – Local linear embedding (LLE):  is described in detail in the blog post series “When Big Data Goes Local, Small Data Gets Big – Part 1” and “Part 2

N – Novelty detection (also known as “Surprise Discovery”):  described in the articles “Outlier Detection Gets a Makeover – Surprise Discovery in Scientific Big Data” and “N is for Novelty Detection…” To learn more, check out the book Practical Machine Learning: A New Look at Anomaly Detection, available as a free download from MapR.

P – Profiling (specifically, data profiling):  described in the article “Data Profiling – Four Steps to Knowing Your Big Data.”

Q – Quantified and Tracked:  described in the article “Big Data is Everything, Quantified and Tracked: What this Means for You.”

R – Recommender engines:  described in two articles: “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore.” To learn more, check out the book Practical Machine Learning: Innovations in Recommendation, available as a free download from MapR.

S – SVM (Support Vector Machines):  described in the article “The Importance of Location in Real Estate, Weather, and Machine Learning.”

Z – Zero bias, Zero variance:  described in the article “Statistical Truisms in the Age of Big Data.”

Learning from Data Big and Small — What’s the Shape of Your Data?

(A version of this article was originally published on BigDataRepublic.com in July 2013 — that site no longer exists.)

Does discovery depend on the scale of your experiment? In some cases, no! Whether Christopher Columbus sailed with 3 ships or 3000, he still would have found the New World, probably in the same amount of time. In this case, the existence of the Americas is independent of the scale of the exploration resources. Conversely, there are many more cases where the potential for discovery does scale with the size of your resources. If those resources are Big Data, then prepare to say “hello, world” to many more new worlds (and new discoveries). The good news for small-to-mid scale projects is that, even without Big Data, you can still be a Columbus.

Learning from small data has justifiably acquired a faithful following of advocates (see this and this).  Let us illustrate this with a common example: Time Series Analysis.

In a simple single-parameter data stream, you can extract characterizations from the time series: (a) the change since the last value (y2-y1); (b) a running mean (e.g., the average of the last 3 data points = [y1+y2+y3]/2); (c) the slope of the trend line (= velocity = dy/dt = [y2-y1]/[t2-t1]); (d) the rate of change of the trend line slope (= acceleration = the 2nd derivative of the data d2y/dt = {[y3-y2]/[t3-t2] – [y2-y1]/[t2-t1]} / [t3-t1] ); (e) the rate of change of acceleration (= jerk); and so on.

Stock market day traders watch 2nd derivatives more closely than the other time series characterizations, since that parameter can signal an inflection point in the data series. Inflection points (a change in the sign of the 2nd derivative) can thus be used as a predictor of an impending turn-around point (maximum = time to sell; or minimum = time to buy) in the time series.

These simple statistical metrics are therefore valuable and informative in some circumstances.  Somewhat more interesting characterizations include the shape of the variation: e.g., U, V, or W. These symbolic representations of temporal behaviors can be quite powerful for sequence mining, pattern discovery, transition detection, and trend analysis in time series data, as well as for the all-important dimensionality reduction and indexing of massive complex data streams.

If the time series stream of data is dense (in time), then you can do a spectral (frequency) analysis to measure the strength of patterns in the time series on all scales (high-frequency to low-frequency) — this is called Fourier Analysis. This analysis gives you a large number of characterization metrics (e.g., the frequency components and their amplitudes) for dense time series.  You can monitor these metrics and alert the end-user only when the power spectrum of the different frequency components changes significantly, even if the change is in only one component (e.g., its phase or amplitude) or if a new component appears (e.g., an hourly fluctuation in data that previously only showed daily fluctuation).

Finally, imagine massive parallel streams of data: Big Time Series Data. Now the fun begins! Such parallel streams may be Twitter timelines for hundreds of millions of users, or streaming data from hundreds (or thousands) of sensors in an airplane or manufacturing plant, or streaming transaction data from millions of retail shoppers or for a large financial firm. Monitoring massively parallel data streams in this way may be a perfect job for a distributed computing environment: Map-Reduce and Hadoop.

At each step (or within each incremental time range) of such massive data streams, you can create a data distribution histogram of the data values Y (or a histogram of trend line slopes dY, or of 2nd derivatives d2Y) across the full ensemble of parallel data streams. You can then estimate a variety of statistical metrics for the separate data distributions (i.e., one set of metrics each for Y, dY, d2Y, and others) as a function of time: mean, median, mode, variance, skew, kurtosis, presence of a long tail, mixture models, and more.  (Of course, if the data are textual, as in Twitter comments, then some form of numerical coding of the text will yield a goldmine of value – that’s a story for another article.)

Exploiting a variety of statistical metrics (data stream characterizations) such as these is where the exploration and discovery potential expands significantly. Similar to the small-data cases described earlier, the values of these characteristic statistical metrics on massive data streams become a model for the state of the system that you are monitoring. The model itself can be monitored and flagged for significant changes in these characteristic statistical features or for the appearance of new features in the data streams. As long as the massive parallel data streams continue to behave in predictable consistent patterns (which is called a “stationary state”), then there is no need to alert the end-user. However, when the stationarity of the data stream model changes (perhaps triggered by a change in any one of the state parameters that exceeds a pre-specified threshold), then a signal is raised and the end-user verifies whether a truly new behavior or event has been discovered. Land ahoy! All hands on deck!

The point of these examples is to demonstrate that discovery and learning from small data is still useful and valuable. As the data set becomes increasingly larger, it is then possible (and likely) that more intricate, subtle, and descriptive features within the data will be revealed. The discovery potential of bigger data thereby increases (perhaps exponentially). Additionally, the nature and diversity of the discoveries become richer, and maybe so will you!

Follow Kirk Borne on Twitter @KirkDBorne