Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).
Some of the key players in efforts that use open data for social good include: DataKind, Bayes Impact, Booz-Allen Hamilton, Kaggle, Data Analysts for Social Good, and the Tableau Foundation. Check out this “Definitive Guide to do Data Science for Good.” Interested scientists should also check out the Data Science for Social Good Fellowship Program.
We have discussed 6 V’s of Open Data at the DATA Act Forum in July 2015. We have now added more. The following seven V’s represent characteristics and challenges of open data:
- Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
- Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
- Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
- Voice: your open data becomes the voice of your organization to your stakeholders (including customers, clients, employees, sponsors, and the public).
- Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
- Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this even applies to data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Obviously, theft of open data isn’t the problem! But, data security can be a major issue for open data, which are much more vulnerable to misuse, abuse, manipulation, or alteration.
- proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that have been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses, their context, and more.
Here are some sources and meta-sources of open data [UPDATED October 2023]:
- https://www.kaggle.com/datasets
- https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data
- https://www.edureka.co/blog/25-best-free-datasets-machine-learning/
- https://dzone.com/articles/19-free-public-data-sets-for-your-data-science-pro
- Datasets to make your AI better
- Datasets for AI and Machine Learning Training
- http://data.gov
- https://github.com/caesar0301/awesome-public-datasets
- http://www.census.gov/data.html
- http://www.healthdata.gov/
- http://www.opendatanetwork.com/
- https://www.quandl.com/
- http://data.gov.uk/
- http://index.okfn.org/dataset/
- http://www.gapminder.org/data/
- http://aws.amazon.com/datasets
- http://www.google.com/publicdata/directory
- http://datacatalog.worldbank.org/
- https://www.kaggle.com/competitions
- http://www.kdnuggets.com/datasets/index.html
- http://www.crowdflower.com/data-for-everyone
- http://www.data-mania.com/blog/19-excellent-free-open-data-sources-for-doing-datascience/
- http://archive.ics.uci.edu/ml/
- https://kdd.ics.uci.edu/
- http://wiki.dbpedia.org/
- http://blog.visual.ly/data-sources/
- http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/
- http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html
- http://readwrite.com/2008/04/09/where_to_find_open_data_on_the
We have not even tried to list here the thousands of open data sources in specific disciplines, such as the sciences, including astronomy, medicine, climate, chemistry, materials science, and much more.
The Sunlight Foundation has published an impressively detailed list of 30+ Open Data Policy Guidelines at http://sunlightfoundation.com/opendataguidelines/. These guidelines cover the following topics (and more) with several real policy examples provided for each: (a) What data should be public? (b) How to make data public? (c) Creating permanent and lasting access to data. (d) Mandating the use of unique identifiers. (e) Creating public APIs for accessing information. (f) Creating processes to ensure data quality.
Related to open data initiatives, the W3C Working Group for “Data on the Web Best Practices” has published a Data Quality Vocabulary (to express the data’s quality), including these quality metrics for data on the web (which are related to our 7 V’s of open data that we described above):
- Availability
- Processability
- Accuracy
- Consistency
- Relevance
- Completeness
- Conformance
- Credibility
- Timeliness
Follow Kirk Borne on Twitter @KirkDBorne