Tag Archives: Analytics

New Directions for Big Data and Analytics in 2015

The world of big data and analytics is remarkably vibrant and marked by incredible innovation, and there are advancements on every front that will continue into 2015. These include increased data science education opportunities and training programs, in-memory analytics, cloud-based everything-as-a-service, innovations in mobile (business intelligence and visual analytics), broader applications of social media (for data generation, consumption and exploration), graph (linked data) analytics, embedded machine learning and analytics in devices and processes, digital marketing automation (in retail, financial services and more), automated discovery in sensor-fed data streams (including the internet of everything), gamification, crowdsourcing, personalized everything (medicine, education, customer experience and more) and smart everything (highways, cities, power grid, farms, supply chain, manufacturing and more).

Within this world of wonder, where will we wander with big data and analytics in 2015? I predict two directions for the coming year…

(continue reading herehttp://www.ibmbigdatahub.com/blog/new-directions-big-data-and-analytics-2015)

Follow Kirk Borne on Twitter @KirkDBorne

Feature Mining in Big Data

We love features in our data, lots of features, in the same way that we love features in our toys, mobile phones, cars, and other gadgets.  Good features in our big data collection empower us to build accurate predictive models, identify the most informative trends in our data, discover insightful patterns, and select the most descriptive parameters for data visualizations. Therefore, it is no surprise that feature mining is one aspect of data science that appeals to all data scientists. Feature mining includes: (1) feature generation (from combinations of existing attributes), (2) feature selection (for mining and knowledge discovery), and (3) feature extraction (for operational systems, decision support, and reuse in various analytics processes, dashboards, and pipelines).

Learn more about feature mining and feature selection for Big Data Analytics in these publications:

  1. Feature-Rich Toys and Data
  2. Interactive Visualization-enabled Feature Selection and Model Creation
  3. Feature Selection (available on the National Science Bowl blog site)
  4. Feature Selection Methods used with different Data Mining algorithms
  5. (and for heavy data science pundits) Computational Methods of Feature Selection

Follow Kirk Borne on Twitter @KirkDBorne

6 Ways To Be Fooled by Randomness

Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Consequently, in data science, randomness in your data can negate the value of a predictive analytics model.

It is easy to be fooled by randomness. We often see randomness when there is none, and vice versa. Here are 6 ways in which we can be fooled by randomness:

  1. We often tend to pick out and focus on the “most interesting” results in our data, and ignore the uninteresting cases.  For example, if you toss a coin 2000 times, and you see a subsequence of 12 consecutive Heads in the sequence, then your attention is directed to this interesting subsequence (and you might conclude that there is something unfair about the coin or the coin tossing) even though it is statistically reasonable for such a subsequence to appear. This is selection bias, and it is also an example of “a posteriori” statistics (derived from observed facts, not from logical principles).
  2. We may unintentionally overlook the randomness in the data, especially in our rush to build predictive analytics models.
  3. Randomness sometimes appears to behave opposite to what our intuition would suggest. An example of this is the famous birthday paradox (in which the likelihood that two people in a crowd have the same birthday is approximately 50% when there are only 23 people in the group). This 50-50 break point occurs at such a small number because, as you increase the sample size, it becomes less and less likely to avoid the same birthday (i.e., less and less likely to avoid a repeating pattern in random data).
  4. Humans are good at seeing patterns and correlations in data, but humans are less good at remembering that correlation does not imply causation.
  5. The bigger the data set, the more likely you will see an “unlikely” pattern!
  6. When asked to pick the “random” statistical distribution that is generated by a human (versus a distribution generated by an algorithm), we tend to confuse “randomness” with the “appearance of randomness”. A distribution may appear to be more random, but in fact it is less random, since it has a statistically unrealistic small variance in behavior.

We consider 3 examples of randomness in order to test our ability to recognize it…

(continue reading herehttp://www.analyticbridge.com/profiles/blogs/7-traps-to-avoid-being-fooled-by-statistical-randomness)

Follow Kirk Borne on Twitter @KirkDBorne

Machine Unlearning and The Value of Imperfect Models

Common wisdom states that “perfect is the enemy of good enough.” We can apply this wisdom to the machine learning models that we train and deploy for big data analytics. If we strive for perfection, then we may encounter several potential risks. It may be useful therefore to pay attention to a little bit of “machine unlearning.” For example:

Overfitting

By attempting to build a model that correctly follows every little nuance, deviation, and variation in our data set, we are consequently almost certainly fitting the natural variance in the data, which will never go away.  After building such a model, we may find that it has nearly 100% accuracy on the training data, but significantly lower accuracy on the test data set.  These test results are guaranteed proof that we have overfit our model. Of course, we don’t want a trivial model (an underfit model) either – to paraphrase Albert Einstein: “models should be as simple as possible, but no simpler.”

(continue reading herehttps://www.mapr.com/blog/machine-unlearning-value-imperfect-models)

Follow Kirk Borne on Twitter @KirkDBorne

The Power of Three: Big Data, Hadoop, and Finance Analytics

Big data is a universal phenomenon. Every business sector and aspect of society is being touched by the expanding flood of information from sensors, social networks, and streaming data sources. The financial sector is riding this wave as well. We examine here some of the features and benefits of Hadoop (and its family of tools and services) that enable large-scale data processing in finance (and consequently in nearly every other sector).

Three of the greatest benefits of big data are discovery, improved decision support, and greater return on innovation. In the world of finance, these also represent critical business functions….

(continue reading here:  https://www.mapr.com/blog/potent-trio-big-data-hadoop-and-finance-analytics)

Follow Kirk Borne on Twitter @KirkDBorne

Numbers are Powerful, Especially in Combination

The phrase “Big Data” refers to a set of serious analytical challenges that arise when the data increase in quantity, real-time speed, and complexity.  The three V’s of big data (Volume, Velocity, and Variety) are now well known and well worn. Their familiarity and frequent association with “big data hype” may numb us to the important data challenges that they are meant to represent. These three characterizations have their counterparts in tools and technologies.  For example, Hadoop (Apache’s open source implementation of the MapReduce programming model) is the technology du jour for management and analysis of high-volume data.  The Hadoop Distributed File System (HDFS) is the file system for big data storage and access in Hadoop clusters.  Apache Spark is a computing framework (built on HDFS) for fast processing of high-velocity data.

But, what about high-variety data?  The storage and management challenges of such data are already addressed (see above), but the real challenge is in performing effective and efficient statistical modeling, data mining, and discovery across high-dimensional (complex) data sets.  Software tools like Soft10 Inc.‘s “automatic statistician” Dr. Mo are designed to address that specific challenge.

When considering complex (high-variety) data, it is important to note that even relatively small-volume data sets can pose huge challenges to modeling, mining, and analysis algorithms and tools. For example, consider a gigabyte data table with a billion entries. If those entries correspond to 500 million rows and 2 columns, then some relatively simple “textbook” techniques can be applied: e.g., correlation analysis, regression analysis, Naïve Bayes, etc. However, if those entries correspond to one million rows and 1000 columns, then the complexity of the data analysis explodes exponentially.

It is not hard to find data sets that are at least this complex, if not much worse.  For example, the human genome consists of 3 billion base pairs (of just four bases: A, C, G, T) – the number of possible sequences of length 3 billion that can be formed from just four items is 4 to the power of 3 billion (limited of course by various genetic constraints). Another example will be the astronomical database to be obtained in the 10-year survey of the sky by the Large Synoptic Survey Telescope (lsst.org) – the final source table will consist of approximately 20 trillion rows and over 200 columns of scientific information per source.  Analyses of all possible combinations of these scientific parameters (to discover new correlations, patterns, associations, clusters, etc.) would be prohibitive.

The combinatorial theorem in mathematics tells us that there are (2^N – 1) possible combinations of N things. For example, a statistical analysis of a data table with just 3 columns (A,B,C) would require 7 distinct analyses (statistical models) of the behavior of the data: A, B, C, A with B, B with C, A with C, and with all three taken at once.  A data table with 5 columns would require 31 distinct analyses; and a table with 25 columns would require over 33 million distinct analyses. My calculator tells me that the number of distinct combinations of 200 variables is greater than 10^60.  This extraordinarily rapid growth rate is called the “combinatorial explosion”.  While no software package could ever perform that many variations of high-dimensional data analysis, it is common to focus on joint combinations of fewer parameters.  Even pairs, triples, and similar small-number combinations can have significant correlation and covariance, consequently yielding important discoveries.

Therefore, in order to meet the challenge of big data complexity (high variety), fast modeling technology is needed.  Such tools provide big benefits to both statisticians and non-statisticians.  These benefits multiply favorably when the technology can automatically build and test a large number of models (different combinations of parameters) in parallel.  Furthermore, the power of the technology is even more enhanced when it ranks the output models and parameter selection in order of significance and correlation strength.  Soft10’s “automatic statistician” Dr. Mo does these things and more. Dr. Mo models complex high-dimensional data both row-wise and column-wise. Dr. Mo produces high-accuracy predictions.  Dr. Mo’s proprietary multi-model technology is a powerful tool for predictive modeling and analytics across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at http://soft10ware.com/ and read more herehttp://soft10ware.com/big-data-complexity-requires-fast-modeling-technology/

Kirk Borne is a member of the Soft10, Inc. Board of Advisors.

Follow Kirk Borne on Twitter @KirkDBorne

IBM Insight 2014 – Day 2: The “One Thing” – Watson Analytics

The highlight of Day 2 at IBM Insight 2014 was the presentation of numerous examples, new features, powerful capabilities, and strategic vision for Watson Analytics.  This was the “one thing” – (to borrow the phrase from the movie “City Slickers”) – the one thing that seems to matter the most, that will make the biggest impact, and that has captured the essence of big data and analytics technologies for the future, rapidly approaching world of data everywhere, sensors everywhere, and the Internet of Things.

(continue reading more about Watson Analytics here:  http://ibm.co/10zEl6S)

Follow Kirk Borne on Twitter @KirkDBorne

IBM Insight 2014 – Day 1 Soundbites: Carpe Datum

There are big data meetups, workshops, conferences, and symposia. And then… there is IBM Insight 2014! There’s only one word to describe this happenin’ event: “Wow!”

The content of the event is focused on IBM’s products, services, corporate strengths, and partnerships. But the theme and message is laser-focused on the light-speed transformation of business in 2014 that has been achieved through insights from big data and analytics. From the Day 1 opening laser light show and film clip that featured DataKind founder Jake Porway along with Sensemaking evangelist Jeff Jonas, to their spectacular well timed entrance into the packed 12,000-seat Mandalay Bay Arena, continuing into a vast array of workshops and hands-on labs, the first day of Insight 2014 has been like a rapid tour through multiple parallel “Alice in Wonderland” universes.

If you are not able to attend the event, you can watch at InsightGO. You can also watch participant interviews on TheCube from SiliconANGLE and Wikibon: http://siliconangle.tv/ibm-insight-2014/

Ideas and insights have filled the arena and convention center in every conversation. Attached below are some of the soundbites (harvested from presentations and conversations).

What are people talking about at IBM Insight 2014?

  • Analytics take big data from information to insights to innovation.
  • The new data-driven business is built around “Systems of Insight” that inform every decision, interaction, and process.
  • Systems of Insight involve more people, more places, and more data.
  • Big data analytics drive business integration, intelligence, and innovation.
  • Watson Analytics reinvents the analytics experience in the cloud — its brilliant human-computer interface gives a whole new meaning to “human factors engineering”.
  • Cognitive Analytics with Watson generates (in real-time) the questions that you should be asking your data, through natural language dialogue, guided discovery, and fully automated intelligence.
  • IBM has released a suite a new services for big data and analytics, including Watson Curator, DataWorks, DashDB, and Cloudant.
  • The new quest for business is personalized engagement that incorporates immersive user experiences: fusing the physical world with digital interactions of all kinds.
  • In the era of digital marketing and real-time customer analytics, battles are won or lost in minutes (or even seconds).
  • A paradox in digital marketing has emerged:  outward-facing customer-centric analytics (personalization, segment of one) have forced organizations into more inward focus on big data operations.  We believe that this paradox evaporates when we realize that the focus on operations is in response to the urgent need to focus on the customer, at the right time, with the right offer, at the right place, in the right context.  That’s the 360 view, and that’s cognitive analytics at its best!
  • Fast data (big data velocity) is fast becoming the number 1 challenge, source of innovation, and revenue-generator for business.  Big data volume is so “2012”, and big data variety is so “2013” (though I personally think that we have yet to see the real power and revolution in data-driven business discovery through high-variety data, particularly via fast complex streaming data emanating from multiple sensors, sources,  and signals).
  • The real “big data analytics” talent shortage is in finding folks who know both the analytics (data science) and the business.
  • The Chief Data Officer is an agent for business transformation and change in the big data era.
  • IBM Insight might just be the 2014 World Series of Big Data Analytics.
  • Perhaps the real insight at IBM Insight 2014 is that what you really need to do is “to dress for success” with the right T-shirt …:

B0-OViwCMAAWnPj(Caption: Kirk Borne, Cortnie Abercrombie, and Jake Porway sharing a moment)

  • Carpe datum!

Follow Kirk Borne on Twitter @KirkDBorne

 

Chief Data Officer as Business Change Agent

Deriving business value from, leveraging, protecting, and promoting an organization’s rapidly growing data assets are now coming under the corporate executive sponsorship of a new member of the executive suite – the CDO (Chief Data Officer).  This role should be considered as distinctly different from other similarly defined roles: (a) the CIO, whose responsibilities now revolve primarily around information technologies and information security; (b) the CDS (Chief Data Scientist), whose role is evolving, but should be primarily that of Chief Scientist, specifically related to Data Science, exploring new business models and discoverying insights from the data resources; and (c) the CAO (Chief Analytics Officer), whose role is also evolving and who may be roughly equivalent to the CDS, though the CAO’s focus should be more on mapping the data science capabilities (championed by the CDS) and the data assets (sponsored by the CDO) onto the data-to-decisions, data-to-discovery, and data-to-insights goals of the line of business.

We also see a lot of overlap in this set of roles with those of the CMO (Chief Marketing Officer) and the Chief Innovation Strategy Officer.  We are not suggesting that each and every business will need all of these, but the organization should identify what their corporate strategy and business goals require, and then create the roles that will drive change in those directions.

In this evolving leadership landscape within the growing Big Data era, the CDO is definitely creating a lot of buzz.  Since Big Data and Analytics are now listed as the top drivers of innovation, revenue, and change within organizations, then the CDO should be there to drive that change.  Here are two sources of case studies and information regarding the CDO:

(1) See the new IBM Chief Data Officer website at http://ibm.com/services/c-suite/cdo. Related to this effort, see also the Institute for Business Value within IBM’s Center for Applied Insights. For further insights, listen to Cortnie Abercrombie of IBM as she provides further insights and recommendations for the CDO role in her online interviews: here and here!

(2) Download the Innovation Enterprise’s white paper Rise of the Chief Data Officer – An Executive Whose Time Has Come“, by George Hill and Chris Towers.  I was fortunate to write the Foreward for this booklet.  Here is an excerpt from my Foreward:

Many now believe that Big Data has matured, moving beyond the peak of its initial hype and is moving ahead into its promised plateau of productivity. Data has come of age in the corporate boardroom as well.  The enormous potential for new wealth, new products, new customers, new insights, and new entrepreneurial business lines has caused a cataclysmic shift in the power of “information” in the corporate executive suite. The existing CIO’s role seems to have solidified in the past decade to that of “Chief Information Technology Officer,” with an emphasis primarily on technology and infrastructure.  The new CxO in the boardroom is the data person (the “data lover”). This may be the Chief Data Scientist (focused on the analytics objectives, opportunities, and obsessions that arise in this era of Big Data).  But, we also see the CDO (Chief Data Officer) coming into the inner circle of executive power.

The CDO is focused on the data – acquisition, governance, quality, management, integration, policies (including privacy, preservation, deduplication, curation), value creation, recruiting skilled data professionals, establishing a data-driven corporate culture, team-building around data-centric business objectives, and acquisition and oversight of corporate data technologies (not I.T. in the historical sense). The responsibilities are enormous, the requisite skills are CxO-worthy, the challenges are many, and the opportunities to create and define the role are very attractive.

(continue reading here http://ie.theinnovationenterprise.com/event_justify_your_rois/Rise-of-the-Chief-Data-Officer.pdf)

Follow Kirk Borne on Twitter @KirkDBorne

IBM Insight 2014 – The Big Data World Series (or something like that)

I am attending the IBM Insight 2014 conference this year, along with many(!) other big data analytics luminaries, including Jake Porway (@jakeporway, of Data Kind), James Kobielus (@jameskobielus, the IBM Big Data Evangelist — I love that title!), Carla Gentry (@data_nerd, of Analytical Solution), Lillian Pierson (@BigDataGal, data journalist and author of the new book “Data Science for Dummies“), and thousands (maybe millions) more!

There are many new developments happening in 2014 across the big data universe that will be discussed at #IBMinsight.  These include cloud-based analytics, cognitive analytics, Big SQL, the Internet of Things, machine-to-machine analytics, real-time actionable insights from data, content-based customer experience management, and predictive everything (e.g., customer intelligence; manufacturing; biomedical conditions; etc.). This is the place to be right now in order to learn about these rapidly evolving advancements.

For the twitter crowd, you can follow and  jump into one of the Insight Tweet Chats from anywhere in the world.  For example, join the discussion on Sunday October 26 at 11:00pm (New York time):  https://www.crowdchat.net/IBMInsight

Even if you cannot be in Las Vegas(**) for this great event (perhaps “The 2014 World Series of Big Data”), you can still watch the presentations online and learn all about the latest Big Data Analytics and Data Science insights via “Insight GO“. The IBM InsightGo site describes it like this:

InsightGO is IBM’s interactive digital platform, streaming live broadcasts straight to your laptop or mobile device. It’s the next best thing to being in Vegas.

InsightGO is specially designed for both offsite and onsite attendees and features:

InsightGO will be hosted by writer, gamer and video star Veronica Belmont.

InsightGO registration is complimentary.

(**) What happens in Vegas stays in my Twitter timeline!

insight2014-940x219

 

Follow Kirk Borne on Twitter @KirkDBorne