Author Archives: Kirk Borne

About Kirk Borne

Dr. Kirk D. Borne is a transdisciplinary Data Scientist and Professor of Astrophysics & Computational Science at George Mason University (since 2003). He conducts research, teaching, and doctoral student advising in the theory and practice of data science. He is also an active consultant to numerous organizations in data science and big data analytics across a wide variety disciplines, domains, and sectors. He previously spent nearly 20 years supporting large scientific data systems for NASA astrophysics missions, including the Hubble Space Telescope. He was identified in 2013 as the Worldwide #1 Big Data influencer on Twitter at @KirkDBorne. ​

Where to get your Data Science Training or Apprenticeship

I am frequently asked for suggestions regarding academic institutions, professional organizations, or MOOCs that provide Data Science training.  The following list will be updated occasionally (LAST UPDATED: 2018 March 29) .

Also, be sure to check out The Definitive Q&A for Aspiring Data Scientists and the story of my journey from Astrophysics to Data Science. If the latter story interests you, then here are a couple of related interviews: “Data Mining at NASA to Teaching Data Science at GMU“, and “Interview with Leading Data Science Expert“.

Here are a few places to check out:

  1. The Booz Allen Field Guide to Data Science
  2. Do you have what it takes to be a Data Scientist? (get the Booz Allen Data Science Capability Handbook)
  3. http://www.thisismetis.com/explore-data-science-online-training (formerly exploredatascience.com at Booz-Allen)
  4. http://www.thisismetis.com/
  5. https://www.teamleada.com/
  6. MapR Academy (offering Free Hadoop, Spark, HBase, Drill, Hive training and certifications at MapR)
  7. Data Science Apprenticeship at DataScienceCentral.com
  8. (500+) Colleges and Universities with Data Science Degrees
  9. List of Machine Learning Certifications and Best Data Science Bootcamps
  10. NYC Data Science Academy
  11. NCSU Institute for Advanced Analytics
  12. Master of Science in Analytics at Bellarmine University
  13. http://www.districtdatalabs.com/ (District Data Labs)
  14. http://www.dataschool.io/
  15. http://www.persontyle.com/school/ 
  16. http://www.galvanize.it/education/#classes (formerly Zipfian Academy) includes http://www.galvanizeu.com/ (Data Science, Statistics, Machine Learning, Python)
  17. https://www.coursera.org/specialization/jhudatascience/1
  18. https://www.udacity.com/courses#!/data-science 
  19. https://www.udemy.com/courses/Business/Data-and-Analytics/
  20. http://insightdatascience.com/ 
  21. Data Science Master Classes (at Datafloq)
  22. http://datasciencemasters.org
  23. http://www.jigsawacademy.com/
  24. https://intellipaat.com/
  25. http://www.athenatechacademy.com/ (Hadoop training, and more)
  26. O’Reilly Media Learning Paths
  27. http://www.godatadriven.com/training.html
  28. Courses for Data Pros at Microsoft Virtual Academy
  29. 18 Resources to Learn Data Science Online (by Simplilearn)
  30. Learn Everything About Analytics (by AnalyticsVidhya)
  31. Data Science Masters Degree Programs

DataMiningTagCloudTagxedoWordCloud

Follow Kirk Borne on Twitter @KirkDBorne

Big Data Growth — Compound Interest on Steroids

(This article was originally published on BigDataRepublic.com in June 2013 — that site no longer exists.)

Could a simple math formula be responsible for all of modern civilization? An article in 2013 hypothesized that there is one, and the Formula for Compound Interest is it. The formula is actually quite straightforward, but the mathematical consequences are huge and potentially impossible to assimilate. Let us illustrate this with a simple example, and then we will see the consequences for the current Big Data revolution.

Assuming an annual period of compounding, if your principal (asset or debt) P grows at an annual rate R, then your net accumulation A after one year is P*(1+R). The accumulation A grows by an additional (1+R) factor for each additional year. Therefore, your accumulation after N years is equal to A=P*(1+R)N.

The fact that the number of compounding periods N is in the exponent of the compound interest growth formula means two things: (1) the growth rate is exponential (by definition); and (2) because the growth rate is exponential, the total accumulation A after a modest number of compounding periods can easily dwarf the initial value P, particularly for values of R equal to several percent per annum (or greater).

Many people have experienced the power of this compound interest growth through their own personal long-term retirement contributions. If you make a one-time investment of $5000 at age 20 (with no other contributions for the rest of your working career), then an annual return rate R=8% will yield a balance of $160,000 at 65 years old (a net gain of over 3000%).  If you make more modest but systematic contributions (for example $400 each year), then the final value of your retirement fund would also be $160,000 (from a total personal investment of $18,000 over 45 years – a net gain of 800%). This compound interest growth is amazing and impressive. Most people can understand these numbers and can relate them to normal life experience.

But consider what happens if the annual rate R is not a few percent, but double-digit or triple-digit percent. For example: if R=100%, then a $1 investment each year starting at age 20 would produce a net accumulation of $1024 after 10 years (from just $10 total personal investment). The net accumulation after 45 years at age 65 (from a total personal investment of $45) would equal $35,000,000,000,000 – that is, thirty-five trillion dollars! In this case, the mathematical consequences are enormous and too mind-boggling to comprehend. It is off-the-charts and unbelievable, and yet it is a mathematical certainty – the number (1+R) in the compound interest formula when R=100% is 2, and 245 is a truly huge number.

Finally, let us connect the original historical hypothesis to our current Big Data environment.  Some conservative estimates suggest that the world’s data volume doubles every year. That is a growth rate R=100%. Does that look familiar? Annual data-doubling corresponds to 210 times more data after every 10 years: from zettabytes now to geopbytes in a few decades (similar to investing $45 to get $35 trillion)!  The Big Data explosion is truly enormous growth on steroids! This is why Big Data is not simply “more data”, but it is something completely different, mind-boggling, and off-the-charts impossible to grasp. Nearly every government entity, corporate decision-maker, business strategist, marketing specialist, statistician, domain scientist, news service, digital publisher, and social media guru is talking “Big Data”. However, most of us involved in those conversations cannot begin to assimilate how the current growth in Big Data and a simple math formula will be responsible for radically transforming modern civilization all over again.

Therefore, don’t believe people when they say “We have always had Big Data!” That statement completely misses the point of today’s data revolution and trivializes the massive disruptive forces that are now transforming our digital world. Today’s big data is not yesterday’s big data!

Follow Kirk Borne on Twitter @KirkDBorne

Variety is the Spice of Life for Data Scientists

“Variety is the spice of life,” they say.  And variety is the spice of data also: adding rich texture and flavor to otherwise dull numbers. Variety ranks among the most exciting, interesting, and challenging aspects of big data.  Variety is one of the original “3 V’s of Big Data” and is frequently mentioned in Big Data discussions, which focus too much attention on Volume.

A short conversation with many “old school technologists” these days too often involves them making the declaration: We’ve always done big data.”  That statement really irks me… for lots of reasons.  I summarize in the following article some of those reasons:  “Today’s Big Data is Not Yesterday’s Big Data.” In a nutshell, those statements focus almost entirely on Volume, which is really missing the whole point of big data (in my humble opinion)… here comes the Internet of Things… hold onto your bits!

The greatest challenges and the most interesting aspects of big data appear in high-Velocity Big Data (requiring fast real-time analytics) and high-Variety Big Data (enabling the discovery of interesting patterns, trends, correlations, and features in high-dimensional spaces). Maybe because of my training as an astrophysicist, or maybe because scientific curiosity is a natural human characteristic, I love exploring features in multi-dimensional parameter spaces for interesting discoveries, and so should you!

Dimension reduction is a critical component of any solution dealing with high-variety (high-dimensional) data. Being able to sift through a mountain of data efficiently in order to find the key predictive, descriptive, and indicative features of the collection is a fundamental required data science capability for coping with Big Data.

Identifying the most interesting dimensions of the data is especially valuable when visualizing high-dimensional data. There is a “good news, bad news” perspective here. First, the bad news: the human capacity for seeing multiple dimensions is very limited: 3 or 4 dimensions are manageable; and 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-n projection of a larger-N parameter space of variables.

In data visualization, a systematic ordered parameter sweep through an ensemble of small-n projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/n is large. Consequently, a data analytics approach that combines the best of both worlds (machine vision algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data.

One such approach is to use statistical and machine learning techniques to develop “interestingness metrics” for high-variety data sets.  As such algorithms are applied to the data (in parameter sweeps or grand tours), they can discover and then present to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data: “Numbers are powerful, especially in interesting combinations.”

The outcomes of such exploratory data analyses are even more enhanced when the analytics tool ranks the output models (e.g., the data’s “most interesting parameters”) in order of significance and explanatory power (i.e., their ability to “explain” the complex high-dimensional patterns in the data).  Soft10’s “automatic statistician” Dr. Mo is a fast predictive analytics software package for exploring complex high-dimensional (high-variety) data.  Dr. Mo’s proprietary modeling and analytics techniques have been applied across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at http://soft10ware.com/ and read more herehttp://soft10ware.com/big-data-complexity-requires-fast-modeling-technology/

Kirk Borne is a member of the Soft10, Inc. Board of Advisors.

Follow Kirk Borne on Twitter @KirkDBorne

Standards in the Big Data Analytics Profession

A sign of maturity for most technologies and professions is the appearance of standards. Standards are used to enable, to promote, to measure, and perhaps to govern the use of that technology or the practice of that profession across a wide spectrum of communities. Standardization increases independent applications and comparative evaluations of the tools and practices of a profession.

Standards often apply to processes and codes of conduct, but standards also apply to digital content, including: (a) interoperable data exchange (such as GIS, CDF, or XML-based data standards); (b) data formats (such as ASCII or IEEE 754); (c) image formats (such as GIF or JPEG); (d) metadata coding standards (such as ICD-10 for the medical profession, or the Dublin Core for cultural, research, and information artifacts); and (e) standards for the sharing of models (such as PMML, the predictive model markup language, for data mining models).

Standards are ubiquitous.  This abundance causes some folks to quip: “The nice thing about standards is that there are so many of them.”  So, it should not be surprising to note that standards are now beginning to appear also in the worlds of big data and data science, providing evidence of the growing maturity of those professions…

(continue reading herehttps://www.mapr.com/blog/raising-standard-big-data-analytics-profession)

Follow Kirk Borne on Twitter @KirkDBorne

My Data Science Declaration for 2015

Here it is… my Data Science Declaration for 2015 (posted to Twitter on January 14, 2015):

“Now is the time to begin thinking of Data Science as a profession not a job, as a corporate culture not a corporate agenda, as a strategy not a stratagem, as a core competency not a course, and as a way of doing things not a thing to do.”

DataScienceDeclaration

 

Follow Kirk Borne on Twitter @KirkDBorne

Top 10 Conversations That You Don’t Want to Have on Data Privacy Day

On January 28, the world observes Data Privacy Day. Here are the top 10 conversations that you do not want to have on that day. Let the countdown begin….

10.  CDO (Chief Data Officer) speaking to Data Privacy Day event manager who is trying to re-schedule the event for Father’s Day: “Don’t do that! It’s pronounced ‘Day-tuh’, not ‘Dadda’.”

9.  CDO speaking at the company’s Data Privacy Day event regarding an acronym that was used to list his job title in the event program guide: “I am the company’s Big Data ‘As A Service’ guru, not the company’s Big Data ‘As Software Service’ guru.”  (Hint: that’s BigData-aaS, not BigData-aSS)

8.  Data Scientist speaking to Data Privacy Day session chairperson: “Why are all of these cows on stage with me? I said I was planning to give a LASSO demonstration.”

​7.  Any person speaking to you: “Our organization has always done big data.”

6.  You speaking to any person: “Seriously? … The title of our Data Privacy Day Event is ‘Big Data is just Small Data, Only Bigger’.

5.  New cybersecurity administrator (fresh from college) sends this e-mail to company’s Data Scientists at 4:59pm: “The security holes in our data-sharing platform are now fixed. It will now automatically block all ports from accepting incoming data access requests between 5:00pm and 9:00am the next day.  Gotta go now.  Have a nice evening.  From your new BFF.”

4.  Data Scientist to new HR Department Analytics ​Specialist regarding the truckload of tree seedlings that she received as her end-of-year company bonus:  “I said in my employment application that I like Decision Trees, not Deciduous Trees.”

3.  Organizer for the huge Las Vegas Data Privacy Day Symposium speaking to the conference keynote speaker: “Oops, sorry.  I blew your $100,000 speaker’s honorarium at the poker tables in the Grand Casino.”

2.  Over-zealous cleaning crew speaking to Data Center Manager arriving for work in the morning after Data Privacy Day event that was held in the company’s shiny new Exascale Data Center: “We did a very thorough job cleaning your data center. And we won’t even charge you for the extra hours that we spent wiping the dirty data from all of those disk drives that you kept talking about yesterday.”

1.  Announcement to University staff regarding the Data Privacy Day event:  “Dan Ariely’s keynote talkBig Data is Like Teenage Sex‘ is being moved from room B002 in the Physics Department to the Campus Football Stadium due to overwhelming student interest.”

Follow Kirk Borne on Twitter @KirkDBorne

When Big Data Gets Local, Small Data Gets Big

We often hear that small data deserves at least as much attention in our analyses as big data. While there may be as many interpretations of that statement as there are definitions of big data, there are at least two situations where “small data” applications are worth considering. I will label these “Type A” and “Type B” situations.

In “Type A” situations, small data refers to having a razor-sharp focus on your business objectives, not on the volume of your data. If you can achieve those business objectives (and “answer the mail”) with small subsets of your data mountain, then do it, at once, without delay!

In “Type B” situations, I believe that “small” can be interpreted to mean that we are relaxing at least one of the 3 V’s of big data: Velocity, Variety, or Volume:

  1. If we focus on a localized time window within high-velocity streaming data (in order to mine frequent patterns, find anomalies, trigger alerts, or perform temporal behavioral analytics), then that is deriving value from “small data.”
  2. If we limit our analysis to a localized set of features (parameters) in our complex high-variety data collection (in order to find dominant segments of the population, or classes/subclasses of behavior, or the most significant explanatory variables, or the most highly informative variables), then that is deriving value from “small data.”
  3. If we target our analysis on a tight localized subsample of entries in our high-volume data collection (in order to deliver one-to-one customer engagement, personalization, individual customer modeling, and high-precision target marketing, all of which still require use of the full complexity, variety, and high-dimensionality of the data), then that is deriving value from “small data.”

(continue reading here: https://www.mapr.com/blog/when-big-data-goes-local-small-data-gets-big-part-1)

Follow Kirk Borne on Twitter @KirkDBorne

Local Linear Embedding(Image source**: http://mdp-toolkit.sourceforge.net/examples/lle/lle.html)

**Zito, T., Wilbert, N., Wiskott, L., Berkes, P. (2009). Modular toolkit for Data Processing (MDP): a Python data processing frame work, Front. Neuroinform. (2008) 2:8. doi:10.3389/neuro.11.008.2008

New Directions for Big Data and Analytics in 2015

The world of big data and analytics is remarkably vibrant and marked by incredible innovation, and there are advancements on every front that will continue into 2015. These include increased data science education opportunities and training programs, in-memory analytics, cloud-based everything-as-a-service, innovations in mobile (business intelligence and visual analytics), broader applications of social media (for data generation, consumption and exploration), graph (linked data) analytics, embedded machine learning and analytics in devices and processes, digital marketing automation (in retail, financial services and more), automated discovery in sensor-fed data streams (including the internet of everything), gamification, crowdsourcing, personalized everything (medicine, education, customer experience and more) and smart everything (highways, cities, power grid, farms, supply chain, manufacturing and more).

Within this world of wonder, where will we wander with big data and analytics in 2015? I predict two directions for the coming year…

(continue reading herehttp://www.ibmbigdatahub.com/blog/new-directions-big-data-and-analytics-2015)

Follow Kirk Borne on Twitter @KirkDBorne

Feature Mining in Big Data

We love features in our data, lots of features, in the same way that we love features in our toys, mobile phones, cars, and other gadgets.  Good features in our big data collection empower us to build accurate predictive models, identify the most informative trends in our data, discover insightful patterns, and select the most descriptive parameters for data visualizations. Therefore, it is no surprise that feature mining is one aspect of data science that appeals to all data scientists. Feature mining includes: (1) feature generation (from combinations of existing attributes), (2) feature selection (for mining and knowledge discovery), and (3) feature extraction (for operational systems, decision support, and reuse in various analytics processes, dashboards, and pipelines).

Learn more about feature mining and feature selection for Big Data Analytics in these publications:

  1. Feature-Rich Toys and Data
  2. Interactive Visualization-enabled Feature Selection and Model Creation
  3. Feature Selection (available on the National Science Bowl blog site)
  4. Feature Selection Methods used with different Data Mining algorithms
  5. (and for heavy data science pundits) Computational Methods of Feature Selection

Follow Kirk Borne on Twitter @KirkDBorne

6 Ways To Be Fooled by Randomness

Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Consequently, in data science, randomness in your data can negate the value of a predictive analytics model.

It is easy to be fooled by randomness. We often see randomness when there is none, and vice versa. Here are 6 ways in which we can be fooled by randomness:

  1. We often tend to pick out and focus on the “most interesting” results in our data, and ignore the uninteresting cases.  For example, if you toss a coin 2000 times, and you see a subsequence of 12 consecutive Heads in the sequence, then your attention is directed to this interesting subsequence (and you might conclude that there is something unfair about the coin or the coin tossing) even though it is statistically reasonable for such a subsequence to appear. This is selection bias, and it is also an example of “a posteriori” statistics (derived from observed facts, not from logical principles).
  2. We may unintentionally overlook the randomness in the data, especially in our rush to build predictive analytics models.
  3. Randomness sometimes appears to behave opposite to what our intuition would suggest. An example of this is the famous birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., less and less likely to avoid a repeating pattern in random data).
  4. Humans are good at seeing patterns and correlations in data, but humans are less good at remembering that correlation does not imply causation.
  5. The bigger the data set, the more likely you will see an “unlikely” pattern!
  6. When asked to pick the “random” statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse “randomness” with the “appearance of randomness”. A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior.

We consider 3 examples of randomness in order to test our ability to recognize it…

(continue reading herehttp://www.analyticbridge.com/profiles/blogs/7-traps-to-avoid-being-fooled-by-statistical-randomness)

Follow Kirk Borne on Twitter @KirkDBorne