Category Archives: Products

Numbers are Powerful, Especially in Combination

The phrase “Big Data” refers to a set of serious analytical challenges that arise when the data increase in quantity, real-time speed, and complexity.  The three V’s of big data (Volume, Velocity, and Variety) are now well known and well worn. Their familiarity and frequent association with “big data hype” may numb us to the important data challenges that they are meant to represent. These three characterizations have their counterparts in tools and technologies.  For example, Hadoop (Apache’s open source implementation of the MapReduce programming model) is the technology du jour for management and analysis of high-volume data.  The Hadoop Distributed File System (HDFS) is the file system for big data storage and access in Hadoop clusters.  Apache Spark is a computing framework (built on HDFS) for fast processing of high-velocity data.

But, what about high-variety data?  The storage and management challenges of such data are already addressed (see above), but the real challenge is in performing effective and efficient statistical modeling, data mining, and discovery across high-dimensional (complex) data sets.  Software tools like Soft10 Inc.‘s “automatic statistician” Dr. Mo are designed to address that specific challenge.

When considering complex (high-variety) data, it is important to note that even relatively small-volume data sets can pose huge challenges to modeling, mining, and analysis algorithms and tools. For example, consider a gigabyte data table with a billion entries. If those entries correspond to 500 million rows and 2 columns, then some relatively simple “textbook” techniques can be applied: e.g., correlation analysis, regression analysis, Naïve Bayes, etc. However, if those entries correspond to one million rows and 1000 columns, then the complexity of the data analysis explodes exponentially.

It is not hard to find data sets that are at least this complex, if not much worse.  For example, the human genome consists of 3 billion base pairs (of just four bases: A, C, G, T) – the number of possible sequences of length 3 billion that can be formed from just four items is 4 to the power of 3 billion (limited of course by various genetic constraints). Another example will be the astronomical database to be obtained in the 10-year survey of the sky by the Large Synoptic Survey Telescope ( – the final source table will consist of approximately 20 trillion rows and over 200 columns of scientific information per source.  Analyses of all possible combinations of these scientific parameters (to discover new correlations, patterns, associations, clusters, etc.) would be prohibitive.

The combinatorial theorem in mathematics tells us that there are (2^N – 1) possible combinations of N things. For example, a statistical analysis of a data table with just 3 columns (A,B,C) would require 7 distinct analyses (statistical models) of the behavior of the data: A, B, C, A with B, B with C, A with C, and with all three taken at once.  A data table with 5 columns would require 31 distinct analyses; and a table with 25 columns would require over 33 million distinct analyses. My calculator tells me that the number of distinct combinations of 200 variables is greater than 10^60.  This extraordinarily rapid growth rate is called the “combinatorial explosion”.  While no software package could ever perform that many variations of high-dimensional data analysis, it is common to focus on joint combinations of fewer parameters.  Even pairs, triples, and similar small-number combinations can have significant correlation and covariance, consequently yielding important discoveries.

Therefore, in order to meet the challenge of big data complexity (high variety), fast modeling technology is needed.  Such tools provide big benefits to both statisticians and non-statisticians.  These benefits multiply favorably when the technology can automatically build and test a large number of models (different combinations of parameters) in parallel.  Furthermore, the power of the technology is even more enhanced when it ranks the output models and parameter selection in order of significance and correlation strength.  Soft10’s “automatic statistician” Dr. Mo does these things and more. Dr. Mo models complex high-dimensional data both row-wise and column-wise. Dr. Mo produces high-accuracy predictions.  Dr. Mo’s proprietary multi-model technology is a powerful tool for predictive modeling and analytics across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at and read more here

Kirk Borne is a member of the Soft10, Inc. Board of Advisors.

Follow Kirk Borne on Twitter @KirkDBorne

IBM Insight 2014 – Day 2: The “One Thing” – Watson Analytics

The highlight of Day 2 at IBM Insight 2014 was the presentation of numerous examples, new features, powerful capabilities, and strategic vision for Watson Analytics.  This was the “one thing” – (to borrow the phrase from the movie “City Slickers”) – the one thing that seems to matter the most, that will make the biggest impact, and that has captured the essence of big data and analytics technologies for the future, rapidly approaching world of data everywhere, sensors everywhere, and the Internet of Things.

(continue reading more about Watson Analytics here:

Follow Kirk Borne on Twitter @KirkDBorne

Apervi’s Conflux Gives a Big Boost to a Confluence of Big Data Workflows

Data-driven workflows are the life and existence of big data professionals everywhere: data scientists, data analysts, and data engineers. We perform all types of data functions in these workflow processes: archive, discover, access, visualize, mine, manipulate, fuse, integrate, transform, feed models, learn models, validate models, deploy models, etc. It is a dizzying day’s work. We start manually in our workflow development, identifying what needs to happen at each stage of the process, what data are needed, when they are needed, where data needs to be staged, what are the inputs and outputs, and more.  If we are really good, we can improve our efficiency in performing these workflows manually, but not substantially. A better path to success is to employ a workflow platform that is scalable (to larger data), extensible (to more tasks), more efficient (shorter time-to-solution), more effective (better solutions), adaptable (to different user skill levels and to different business requirements), comprehensive (providing a wide scope of functionality), and automated (to break the time barrier of manual workflow activities).

(continue reading here

Apervi Conflux


Follow Kirk Borne on Twitter @KirkDBorne

Visual Cues in Big Data for Analytics and Discovery

One of the most fun outcomes that you can achieve with your data is to discover new and interesting things.  Sometimes, the most interesting thing is the detection of a novel, unexpected, surprising object, event, or behavior – i.e., the outlier, the thing that falls outside the bounds of your original expectations, the thing that signals something new about your data domain (a new class of behavior, an anomaly in the data processing pipeline, or an error in the data collection activity).  The more quickly that you can find the interesting features and characteristics within your data collection, consequently the more likely you are to improve decision-making and responsiveness in your data-driven workflows.

Tapping into the human natural cognitive ability to see patterns quickly and to detect anomalies readily is powerful medicine for big data analytics headaches.  That’s where data visualization shines most brightly in the big data firmament!

(continue reading here …


Follow Kirk Borne on Twitter @KirkDBorne