Open Data - Rocket-Powered Data Science

[Note: this post is a slightly modified version of an earlier post. Although written with light (perhaps weak) humor, this post was created to bring attention to the Benefits of Open Data for innovation, value creation, and digital transformation on Open Data Day.]

Data Innovation is a powerful strategic goal for data-intensive organizations, especially to be celebrated through Data Innovation Day events, whenever they may occur. Here are the top 10 conversations that you do not want to have on that day. Let the countdown begin….

10. CDO (Chief Data Officer) speaking to Data Innovation Day event manager who is trying to re-schedule the event for Father’s Day: “Don’t do that! It’s pronounced ‘Day-tuh’, not ‘Dadda’.”

9. CDO speaking at the company’s Data Innovation Day event regarding an acronym that was used to abbreviate his job title in the event program guide: “I am the company’s Big Data ‘As A Service’ guru, not the company’s Bigdata ‘As Software Service’ guru.” (Hint: that’s BigData-aaS, not Big-aSS)

8. Data Scientist speaking to Data Innovation Day session chairperson: “Why are all of these cowboys on stage with me? I said I was planning to give a LASSO demonstration.”

7. Any person speaking to you: “Our organization has always done big data.”

6. You speaking to any person: “Seriously? … The title of our Data Innovation Day Event is ‘Big Data is just Small Data, Only Bigger’.”

5. New cybersecurity administrator (fresh from college) sends this e-mail to company’s Data Scientists at 4:59pm on Friday: “The security holes in our data-sharing platform are now fixed. It will now automatically block all ports from accepting incoming data access requests between 5:00pm today and 9:00am Monday. Gotta go now. Have a nice weekend. From your new BFF.”

4. Data Scientist speaking to new Human Resources Department Analytics Specialist regarding the truckload of tree seedlings that she received as her end-of-year company bonus: “I said in my employment application that I like Decision Trees, not Deciduous Trees.”

3. Organizer for the huge Las Vegas Data Innovation Day Symposium speaking to the conference keynote speaker: “Oops, sorry. I blew your $100,000 speaker’s honorarium at the poker tables in the Grand Casino.”

2. Over-zealous cleaning crew speaking to the Data Center Manager as she is arriving for work in the morning after Data Innovation Day event that was held in the company’s shiny new Exascale Data Center: “We did a very thorough job cleaning your data center. And we won’t even charge you for the extra hours that we spent wiping the disk drives clean and deleting all the dirty data that you kept talking about yesterday.”

1. Announcement to University staff regarding the Data Innovation Day event: “Dan Ariely’s keynote talk ‘Big Data is Like Teenage Sex‘ is being moved from basement room B002 in the Physics Department to the Campus Football Stadium due to overwhelming student interest.”

Follow Kirk Borne on Twitter @KirkDBorne

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

Some of the key players in efforts that use open data for social good include: DataKind, Bayes Impact, Booz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.

We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015. We have now added more. The following seven V’s represent characteristics and challenges of open data:

Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this even applies to data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Obviously, theft of open data isn’t the problem! But, data security can be a major issue for open data, which are much more vulnerable to misuse, abuse, manipulation, or alteration.
proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that have been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses, their context, and more.

Here are some sources and meta-sources of open data [UPDATED October 2023]:

We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.

The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at http://sunlightfoundation.com/opendataguidelines/. These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.

Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including these quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):

Availability
Processability
Accuracy
Consistency
Relevance
Completeness
Conformance
Credibility
Timeliness

Source: http://blog.import.io/post/free-access-to-data-a-non-boring-story-about-open-data

Follow Kirk Borne on Twitter @KirkDBorne

Rocket-Powered Data Science

Data Reflections by Dr. Kirk Borne @KirkDBorne

Tag Archives: Open Data

Top 10 Conversations That You Do Not Want to Have on Data Innovation Day

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories