Open Data: Big Benefits, 7 V’s, and Thousands of Repositories

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

Some of the key players in efforts that use open data for social good include: DataKind, Bayes ImpactBooz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.

We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015.  We have now added more. The following seven V’s represent characteristics and challenges of open data:

  1. Validity:  data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
  2. Value:  new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
  3. Variety:  the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
  4. Voice:  your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
  5. Vocabulary:  the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
  6. Vulnerability:  the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
  7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data):  maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Here are some sources and meta-sources of open data:

We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.

The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.

Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including the following 10 quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):

  1. Statistics
  2. Availability
  3. Processability
  4. Accuracy
  5. Consistency
  6. Relevance
  7. Completeness
  8. Conformance
  9. Credibility
  10. Timeliness

Follow Kirk Borne on Twitter @KirkDBorne


Blogging My Way Through Data Science, Big Data, and Analytics

I frequently write blog posts on other sites.  You can find those articles here (updated March 21, 2016):

I also write “one-off” blog posts, such as these examples:

Follow Kirk Borne on Twitter @KirkDBorne

What Motivates a Data Scientist?

I recently had the pleasure of being interviewed by Manu Jeevan for his Big Data Made Simple blog.  He asked me several questions:

  • How did you get into data science?
  • What exactly is enterprise data science?
  • How does Booz Allen Hamilton use data science?
  • What skills should business executives have to effectively to communicate with data scientists?
  • How is big data changing the world? (Please give us interesting examples)
  • What are your go-to tools for doing data science?
  • In your TedX talk Big Data, Small World you gave special attention to association discovery, is there a specific reason for that?
  • The Data Scientist has been called the sexiest job of the 21st century. Do you agree?
  • What advice would you give to people aspiring for a long career in data science?

All of these questions were ultimately aimed at understanding the key underlying question: “What motivates you to work in data science?” The question about enterprise data science really comes the closest to identifying what really motivates me — that is, I am exceedingly fortunate every day to be given the opportunity to work with a fantastic team of data scientists at Booz Allen Hamilton, with the mandate to explore data as a corporate asset and to exploit data science as a core capability in order to achieve more profound discoveries, to make better (data-driven) decisions, and to propel new innovations across numerous domains, industries, agencies, and organizations. My Data Science Declaration also sums up these motivating factors for me.

You can see the full scope of my answers to the above questions here:

Follow Kirk Borne on Twitter @KirkDBorne

Just-in-Time Supply Chain Management with Data Analytics

A common phrase in SCM (Supply Chain Management) is Just-In-Time (JIT) inventory. JIT refers to a management strategy in which raw materials, products, or services are delivered to the right place, at the right time, as demand requires. This has always been an excellent business goal, but the power to excel at JIT inventory management is now improving dramatically with the increased use of data analytics across the supply chain.

In the article “Operational Analytics and Droning About Big Data“, we discussed two examples of JIT: (1) a just-in-time supply replenishment system for human bases on the Moon, and (2) the proposal by Amazon to use drones to deliver products to your front door “just in time”! The Internet of Things will almost certainly generate similar use cases and benefits.

Descriptive analytics (hindsight) tells you what has already happened in your supply chain. If there was a deficiency or problem somewhere, then you can react to that event. But, that is “old school” supply chain management. Modern analytics is predictive (foresight), allowing you to predict where the need will occur (in advance) so that you can proactively deliver products and services at the point of need, just in time.

The next advance in analytics is prescriptive (insight), which uses optimization techniques (from operations research) in combination with insights and knowledge of your business (systems, processes, and resources) in order to optimize your delivery systems, for the best possible outcome (greater sales, fewer losses, reduced inventory, etc.). Just-in-time supply chain management then becomes something more than a reality — it now becomes an enabler of increased efficiency and productivity.

Many more examples of use cases in the manufacturing and retail industries (and elsewhere) where just-in-time analytics is important (and what you can do about it) have been enumerated by the fast Automatic Modeling folks from Soft10, Inc. Check out their fast predictive analytics products at

(Read more about these ideas at:

Follow Kirk Borne on Twitter @KirkDBorne


Definitive Guide to Data Literacy For All – A Reading List

One of the most important roles that we should be embracing right now is training the next-generation workforce in the art and science of data. Data Literacy is a fundamental literacy that should be imparted at the earliest levels of learning, and it should continue through all years of education. Education research has shown the value of using data in the classroom to teach any subject — so, I am not advocating the teaching of hard-core data science to children, but I definitely promote the use of data mining and data science applications in the teaching of other subjects (perhaps, in all subjects!). See my “Using Data in the Classroom Reading List” here on this subject.

I encourage you to read a position paper that I wrote (along with a few astronomy colleagues) for the US National Academies of Science in 2009 that addressed the data science literacy requirements in astronomy. Though focused on the needs in astronomy workforce development for the coming decade, the paper also contains more general discussion of “data literacy for the masses” that is applicable to any and all disciplines, domains, and organizations: “Data Science For The Masses.”

Two new “…For Dummies” books can help in those situations, to bring data literacy to a much larger audience (of students, business leaders, government agencies, educators, etc.). Those new books are: “Data Science For Dummies” by Lillian Pierson, and “Data Mining for Dummies” by Meta Brown.  And here is one more that I believe is an excellent data literacy companion: The Data Journalism Handbook.

Update (April 2016) – The following site has a wealth of information on the use of “Data in Education”:

Data Mining For Dummies Data Journalism Handbook Data Science For Dummies

(Read more here:

Follow Kirk Borne on Twitter @KirkDBorne

Analytics Maturity Models

In the world of big data analytics, there are several emerging standards for measuring Analytics Capability Maturity within organizations.  One of these has been presented in the TIBCO Analytics Maturity Journey – their six steps toward analytics maturity are:  Measure, Diagnose, Predict and Optimize, Operationalize, Automate, and Transform.  Another example is presented through the SAS Analytics Assessment, which evaluates business analytics readiness and capabilities in several areas.  The B-eye Network Analytics Maturity Model mimics software engineering’s CMM (Capability Maturity Model) – their 6 levels of maturity are:  Level 0 = Incomplete; Level 1 = Performed; Level 2 = Managed; Level 3 = Defined, Level 4 = Quantitatively Managed; and Level 5 = Optimizing.

The most “mature” standard in the field is probably the IDC Big Data and Analytics (BDA) MaturityScape Framework.  This BDA framework (measured across the five core dimensions of intent, data, technology, process, and people) consists of five stages of maturity, which essentially parallel the others mentioned above:  Ad hoc, Opportunistic, Repeatable, Managed, and Optimized.

All of these are excellent models for analytics maturity.  But, if you find these different models to be too theoretical or opaque or unattainable, then I suggest a more practical model for your business analytics progression from ground zero all of the way up to cognitive analytics:  from Descriptive and Diagnostic, to Predictive, to Prescriptive, and finally to Cognitive.

A specific example from the field of Marketing is SYNTASA‘s PMI (Personalization Maturity Index). Personalization Capability Maturity parallels the Analytics Capability Maturity frameworks within the specific context of data-driven customer-centric one-to-one marketing and segmentation of one. Read more about this in the article The Battle for Customer Personalization – Divisive Clustering is Good For Youand in much more detail within SYNTASA’s PMI white paper linked above.

(continue reading here:

Follow Kirk Borne on Twitter @KirkDBorne


A Day in the Life of Confounding Factors and Explanatory Variables

Would we trust an insurance provider who sets motorbike insurance rates based on the sales of sour cream? Or would we schedule our space launches according to the number of doctoral degrees awarded in Sociology?

Probably all of us would agree that this kind of decision-making is unjustified. A specific decision like this appears to be only superficially supported by the evidence of correlations between those various factors, but is there more to the story? Does it go any deeper? What if there exists a hidden causal factor that induces the apparently spurious correlation?

For example, suppose the increase in space launches and the increase in doctoral degrees in Sociology were both related to an increase in government investments in research studies on the sociological impacts of establishing a permanent human colony on the Moon. This case reveals a hidden causal connection in an otherwise strange correlation. The explanatory variable (which is a hidden confounding factor) is the research investment, and the response variables are the space launches and doctoral degrees.

What about other cases? What about the evidence that sour cream sales correlate with motorbike accidents? In such cases, shouldn’t we all be pleased to see organizations making evidence-based data-driven objective decisions (especially in this brave new world of exploding data volumes and ubiquitous analytics)? No, I don’t think so!!

So, what kind of world is this?

Welcome to the world of explanatory variables and confounding factors!

Statistical literacy is needed now more than ever (to paraphrase H. G. Wells). This includes awareness of and adherence to common principles of statistical reasoning. For example…

(continue reading here

Follow Kirk Borne on Twitter @KirkDBorne

Drilling Through Data Silos with Apache Drill

Enterprise data collections are typically stored in silos belonging to different business divisions. Sometimes these silos belong to different projects within the same division. These silos may be further segmented by services/products and functions. Silos (which stifle data-sharing and innovation) are often identified as a primary impediment (both practically and culturally) to business progress and thus they may be the cause of numerous difficulties. For example, streamlining important business processes are rendered more challenging, ranging from compliance to data discovery. But, breaking down the silos may not be so easy to accomplish. In fact, it is often infeasible due to ownership issues, governance practices, and regulatory concerns.

Big Data silos create additional complications including data duplication (and associated increased costs), complicated data replication solutions, high data latency, and data quality concerns, not to mention being an enabler of the real problematic situation where your data repositories could hold different versions of the truth. The silos also put a limit on business intelligence (discovery and actionable insights). As big data best practices rise above the hype and noise, we now know that actionable value is more easily extracted when multiple data sets can be integrated and viewed holistically.

Data analysts naturally want to break down silos to combine data from multiple data sources. Unfortunately, this can create its own bottleneck: a complex integration labyrinth—which is costly to maintain, rarely performs well, and can’t be guaranteed to provide consistent results.

In response, many companies have deployed Apache Hadoop to address the problem of segregated data. Hadoop enables multiple types of data to be directly processed in place, and it fully supports data integration from multiple sources across different data storage technologies.

Organizations that use Hadoop are finding additional benefits with Apache Drill, which is the open source version of Google’s Dremel system…

(continue reading here

Follow Kirk Borne on Twitter @KirkDBorne

4 Reasons why an Accurate Analytics Model may not be Good Enough

Here are four reasons why the result of your analytics modeling might be correct (according to some accuracy metric), but it might not be the right answer:

  1. Your model may be underfit.

  2. Your model may be overfit.

  3. Your model may be biased.

  4. Your model may be suffering from the false positive paradox.

In data science, we are trained to keep searching even after we find a model that appears to be accurate. Data Scientists should continue searching for a better solution, for at least the four reasons listed above. Please note that I am not advocating “paralysis of analysis”, where never-ending searches for new and better solutions are just an excuse (or a behavior pattern) that prevents one from making a final decision. Good leaders know when an answer is “good enough”. We discussed this in a previous article: “Machine Unlearning – The Value of Imperfect Models”…

(For more discussion of the four cases listed above, continue reading here

Follow Kirk Borne on Twitter @KirkDBorne

The Definitive Q&A for Aspiring Data Scientists

I was recently asked five questions by Alex Woodie of Datanami for the article, “So You Want To Be A Data Scientist” that he was preparing. He used a few snippets from my full set of answers. The longer version of my answers provided additional advice. For aspiring data scientists of all ages, I provide here the full, unabridged version of my answers, which may help you even more to achieve your goal. (Note: I paraphrase Alex’s original questions in quotes below.)

1. “What is the number one piece of advice you give to aspiring data scientists?”

My number one piece of advice always is to follow your passions first. Know what you are good at and what you care about, and pursue that. So, you might be good at math, or programming, or data manipulation, or problem solving, or communications (data journalism), or whatever. You can do that flavor of data science within the context of any domain: scientific research, government, media communications, marketing, business, healthcare, finance, cybersecurity, law enforcement, manufacturing, transportation, or whatever. As a successful data scientist, your day can begin and end with you counting your blessings that you are living your dream by solving real-world problems with data. I saw a quote recently that summarizes this: “If you think your scarce data science skills could be better used elsewhere, be bold and make the move.” (Reference).

2. “What are the most important skills for an aspiring data scientist to acquire?”

There are many skills under the umbrella of data science, and we should not expect any one single person to be a master of them all. The best solution to the data science talent shortage is a team of data scientists. So I suggest…

(continue reading here

Follow Kirk Borne on Twitter @KirkDBorne