I have written articles in many places. I will be collecting links to those sources here. The list is not complete and will be constantly evolving. There are some older blogs that I will be including in the list below as I remember them and find them. Also included are some interviews in which I provided detailed answers to a variety of questions.
50 years after the publication of Alvin Toffler’s landmark book “Future Shock“, a new book “After Shock” is here. This 540-page compendium collects over 100 modern-day perspectives on After Shock from leading futurists, including deep assessments of Toffler’s formidably prescient predictions from half a century ago, along with a status check on the current exponential growth (“shock”) in all sectors of the world economy, from the unique vantage points of many different contributors. Contributors include David Brin, Po Bronson, Sanjiv Chopra, George Gilder, Newt Gingrich, Alan Kay, Ray Kurzweil, Jane McGonigal, Lord Martin Rees, Byron Reese, and many others. I am honored to be included among those luminary contributors. I present here a short excerpt from my contribution to the book.
“Shocking Amount of Data”
An excerpt from my chapter in the book:
“We are fully engulfed in the era of massive data collection. All those data represent the most critical and valuable strategic assets of modern organizations that are undergoing digital disruption and digital transformation. Advanced analytics tools and techniques drive insights discovery, innovation, new market opportunities, and value creation from the data. However, our enthusiasm for “big data” is tempered by the fact that this data flood also drives us to sensory input shock and awe.
“Among the countless amazing foresights that appeared in Alvin Toffler’s Future Shock was the concept of information overload. His discussion of the topic came long before the creation and proliferation of social networks, the World Wide Web, the Internet, enterprise databases, ubiquitous sensors, and digital data collection by all organizations—big and small, public and private, near and far. The clear and present human consequences and dangers of infoglut were succinctly called out as section headings in Chapter 16 “The Psychological Dimension” of Toffler’s book, including these: “the overstimulated individual,” “bombardment of the senses” and “decision stress.”
“While these ominous forecasts have now become a reality for our digitally drenched society, especially for the digital natives who have known no other experience, there is hope for a lifeline that we can grasp while swimming (or drowning) in that sea of data. And that hope emanates from the same foundation that is the basis of the information overload shock itself. That foundation is data, and the hope is AI – artificial intelligence. The promise of AI is entirely dependent on the flood of sensory input data that fuels the advanced algorithms that activate and stimulate the Actionable Insights (representing another definition of A.I.), which is what AI is really aimed at achieving.
“AI is a great producer of dichotomous reactions in society: hype versus hope, concern versus commercialization, fantasy versus forward-thinking, overstimulation versus overabundance of opportunity, bombardment of the senses versus bountiful insights into solving hard problems, and decision stress versus automated decision support. Could we imagine any technology that has more mutually contradictory psychological dimensions than AI? I cannot.
“AI takes its cue from data. It needs data – and not just small quantities, but large quantities of data, containing training examples of all sorts of behaviors, events, processes, things, and outcomes. Just as a human (or any cognitive creature) receives sensory inputs from its world through multiple channels (e.g., our five senses) and then produces an output—make a decision, take an action—in response to those inputs, similarly that is exactly what an AI does. The AI relies on mathematical algorithms that sort through streams of data, to find the most relevant, informative, actionable, and pertinent patterns in the data. From those presently perceived patterns, AI then produces an output (decision and/or action). For example, when a human detects and then recognizes a previously seen pattern, the person knows how to respond—what to decide and how to act. This is what an AI is being trained to do automatically and with minimal human intervention through the data inputs that cue the algorithms.
“How AI helps with infoglut and thereby addresses the themes of Toffler’s writings (“information overload”, “overstimulation and bombardment of the senses”, and “decision stress”) is through…
You can continue reading my chapter, plus dozens more perspectives, in the full After Shock book here: https://amzn.to/2S01MC7
Definitions of terminology frequently seen and used in discussions of emerging digital technologies.
Additive Manufacturing: see 3D-Printing
AGI (Artificial General Intelligence): The intelligence of a machine that has the capacity to understand or learn any intellectual task that a human being can. It is a primary goal of some artificial intelligence research and a common topic in science fiction and future studies.
AI (Artificial Intelligence): Application of Machine Learning algorithms to robotics and machines (including bots), focused on taking actions based on sensory inputs (data). Examples: (1-3) All those applications shown in the definition of Machine Learning. (4) Credit Card Fraud Alerts. (5) Chatbots (Conversational AI). There is nothing “artificial” about the applications of AI, whose tangible benefits include Accelerated Intelligence, Actionable Intelligence (and Actionable Insights), Adaptable Intelligence, Amplified Intelligence, Applied Intelligence, Assisted Intelligence, and Augmented Intelligence.
Algorithm: A set of rules to follow to solve a problem or to decide on a particular action (e.g., the thermostat in your house, or your car engine alert light, or a disease diagnosis, or the compound interest growth formula, or assigning the final course grade for a student).
Analytics: The products of Machine Learning and Data Science (such as predictive analytics, health analytics, cyber analytics).
AR (Augmented Reality): A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view. Examples: (1) Retail. (2) Disaster Response. (3) Machine maintenance. (4) Medical procedures. (5) Video games in your real world. (6) Clothes shopping & fitting (seeing the clothes on you without a dressing room). (7) Security (airports, shopping malls, entertainment & sport events).
Autonomous Vehicles: Self-driving (guided without a human), informed by data streaming from many sensors (cameras, radar, LIDAR), and makes decisions and actions based on computer vision algorithms (ML and AI models for people, things, traffic signs,…). Examples: Cars, Trucks, Taxis
BI (Business Intelligence): Technologies, applications and practices for the collection, integration, analysis, and presentation of business information. The purpose of Business Intelligence is to support better business decision-making.
Big Data: An expression that refers to the current era in which nearly everything is now being quantified and tracked (i.e., data-fied). This leads to the collection of data and information on nearly full-population samples of things, instead of “representative subsamples”. There have been many descriptions of the characteristics of “Big Data”, but the three dominant attributes are Volume, Velocity, and Variety — the “3 V’s” concept was first introduced by Doug Laney in 2001 here. Read more in this article: “Why Today’s Big Data is Not Yesterday’s Big Data“. Some consider the 2011 McKinsey & Company research report “Big Data: The Next Frontier for Innovation, Competition, and Productivity” as the trigger point when the world really started paying attention to the the volume and variety of data that organizations everywhere are collecting — the report stated, “The United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
Blockchain: A system in which a permanent and verifiable record of transactions is maintained across several computers that are linked in a peer-to-peer network. It has many applications beyond its original uses for bitcoin and other cryptocurrencies. Blockchain in an example of Distributed Ledger Technology, in which independent computers (referred to as nodes) record, share and synchronize transactions in their respective electronic ledgers (instead of keeping data centralized as in a traditional ledger). Blockchain’s name refers to a chain (growing list) of records, called blocks, which are linked using cryptography, and are used to record transactions between two parties efficiently and in a verifiable and permanent way. In simplest terms, Blockchain is a distributed database existing on multiple computers at the same time. It grows as new sets of recordings, or ‘blocks’, are added to it, forming a chain. The database is not managed by any particular body; instead, everyone in the network gets a copy of the whole database. Old blocks are preserved forever and new blocks are added to the ledger irreversibly, making it impossible to manipulate by faking documents, transactions and other information. All blocks are encrypted in a special way, so everyone can have access to all the information but only a user who owns a special cryptographic key is able to add a new record to a particular chain.
Chatbots (see also Virtual Assistants): These typically are text-based user interfaces (often customer-facing for organizations) that are designed and programmed to reply to only a certain set of questions or statements. If the question asked is other than the learned set of responses by the customer, the chatbot will fail. Chatbots cannot hold long, continuing human interaction. Traditionally they are text-based but audio and pictures can also be used for interaction. They provide more like an FAQ (Frequently Asked Questions) type of an interaction. They cannot process language inputs generally.
Cloud: The cloud is a metaphor for a global network of remote servers that operates transparently to the user as a single computing ecosystem, commonly associated with Internet-based computing.
Cloud Computing: The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server, local mainframe, or a personal computer.
Computer Vision: An interdisciplinary scientific field that focuses on how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do, including pattern detection, pattern recognition, pattern interpretation, and pattern classification.
Data Mining: Application of Machine Learning algorithms to large data collections, focused on pattern discovery and knowledge discovery in data. Pattern discovery includes clusters (class discovery), correlation (and trend) discovery, link (association) discovery, and anomaly detection (outlier detection, surprise discovery).
Data Science: Application of scientific method to discovery from data (including Statistics, Machine Learning, data visualization, exploratory data analysis, experimentation, and more).
Digital Transformation: Refers to the novel use of digital technology to solve traditional problems. These digital solutions enable — other than efficiency via automation — new types of innovation and creativity, rather than simply enhance and support traditional methods.
Digital Twins: A phrase used to describe a computerized (or digital) version of a real physical asset and/or process. The digital twin contains one or more sensors that collects data to represent real-time information about the physical asset. By bridging the physical and the virtual world, data is transmitted seamlessly allowing the virtual entity to exist simultaneously with the physical entity. Digital Twins are used in manufacturing, large-scale systems (such as maritime vessels, wind farms, space probes) and other complex systems. Digital Twins are virtual replicas of physical devices that data scientists and IT pros can use to run simulations before actual devices are built and deployed, and also while those devices are in operation. They represent a strong merging and optimization of numerous digital technologies such as IoT (IIoT), AI, Machine Learning, and Big Data Analytics.
Drone (UAV, UAS): An unmanned aerial vehicle (UAV) or uncrewed aerial vehicle (commonly known as a Drone) is an aircraft without a human pilot on board. UAVs are a component of an unmanned aircraft system (UAS); which include a UAV, a ground-based controller, and a system of communications between the two.
Dynamic Data-driven Application (Autonomous) Systems (DDDAS): A paradigm in which the computation and instrumentation aspects of an application system are dynamically integrated in a feed-back control loop, such that instrumentation data can be dynamically incorporated into the executing model of the application, and in reverse the executing model can control the instrumentation. Such approaches can enable more accurate and faster modeling and analysis of the characteristics and behaviors of a system and can exploit data in intelligent ways to convert them to new capabilities, including decision support systems with the accuracy of full scale modeling, efficient data collection, management, and data mining. See http://dddas.org/.
Edge Computing (and Edge Analytics): A distributed computing paradigm which brings computation to the data, closer to the location where it is needed, to improve response times in autonomous systems and to save bandwidth. Edge Analytics specifically refers to an approach to data collection and analysis in which an automated analytical computation is performed on data at a sensor, network switch or other device instead of waiting for the data to be sent back to a centralized data store. This is important in applications where the result of the analytic computation is needed as fast as possible (at the point of data collection), such as in autonomous vehicles or in digital manufacturing.
Industry 4.0: A reference to a new phase in the Industrial Revolution that focuses heavily on interconnectivity, automation, Machine Learning, and real-time data. Industry 4.0 is also sometimes referred to as IIoT (Industrial Internet of Things) or Smart Manufacturing, because it joins physical production and operations with smart digital technology, Machine Learning, and Big Data to create a more holistic and better connected ecosystem for companies that focus on manufacturing and supply chain management.
IoT (Internet of Things) and IIoT (Industrial IoT): Sensors embedded on devices and within things everywhere, measuring properties of things, and sharing that data over the Internet (over fast 5G), to fuel ML models and AI applications (including AR and VR) and to inform actions (robotics, autonomous vehicles, etc.). Examples: (1) Wearable health devices (Fitbit). (2) Connected cars. (3) Connected products. (4) Precision farming. (5) Industry 4.0
Knowledge Graphs (see also Linked Data): Knowledge graphs encode knowledge arranged in a network of nodes (entities) and links (edges) rather than tables of rows and columns. The graph can be used to link units of data (the nodes, including concepts and content), with a link (the edge) that explicitly specifies what type of relationship connects the nodes.
Linked Data (see also Knowledge Graphs): A data structure in which data items are interlinked with other data items that enables the entire data set to be more useful through semantic queries. The easiest and most powerful standard designed for Linked Data is RDF (Resource Description Framework).
Machine Learning (ML): Mathematical algorithms that learn from experience (i.e., pattern detection and pattern recognition in data). Examples: (1) Digit detection algorithm (used in automated Zip Code readers at Post Office. (2) Email Spam detection algorithm (used for Spam filtering). (3) Cancer detection algorithm (used in medical imaging diagnosis).
MR (Mixed Reality): Sometimes referred to as hybrid reality, is the merging of real and virtual worlds to produce new environments and visualizations where physical and digital objects co-exist and interact in real time. It means placing new imagery within a real space in such a way that the new imagery is able to interact, to an extent, with what is real in the physical world we know. The key characteristic of MR is that the synthetic content and the real-world content are able to react to each other in real time.
NLP (Natural Language Processing), NLG (NL Generation), NLU (NL Understanding): NLP a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLG is a software process that transforms structured data into human-language content. It can be used to produce long form content for organizations to automate custom reports, as well as produce custom content for a web or mobile application, or produce the words that will be spoken by a Virtual (Voice-based) Assistant. NLU is a subtopic of Natural Language Processing in Artificial Intelligence that deals with algorithms that have language comprehension (understanding the meaning of the words, both their content and their context).
Quantum Computing: The area of study focused on developing computer technology based on the principles of quantum theory and quantum phenomena (such as superposition of states and entanglement). Qubits are the fundamental units of quantum computing — they are somewhat analogous to bits in a classical computer.
Robotics: A branch of AI concerned with creating devices that can move and react to sensory input (data). Examples: (1) Automated manufacturing assembly line. (2) Roomba (vacuums your house). (3) Warehouse / Logistics. (4) Prosthetics.
Statistics: the practice or science of collecting and analyzing numerical data, especially for the purpose of inferring proportions in a whole population from measurements of those properties within a representative subsample.
UAV (Unmanned Aerial Vehicle) and UAS (Unmanned Aircraft System): see Drones.
Virtual Assistants (see also Chatbots): A sophisticated voice-based interface in an interactive platform for user and customer interactions. Virtual assistants understand not only the language but also the meaning of what the user is saying. They can learn from their conversation instances, which can lead to an unpredictability in their behavior. Consequently, they can have extended adaptable human interaction. They can be set to perform slightly complicated tasks as well, such as order-taking and task fulfillment.
VR (Virtual Reality): Computer-generated simulation of a three-dimensional environment that can be interacted with in a seemingly real or physical way by a person using special electronic equipment, such as a helmet with a screen inside or gloves fitted with sensors. Examples: (1) Games. (2) Family adventures. (3) Training & Education. (4) Design. (5) Big Data Exploration.
XAI (eXplainable AI, Trusted AI): Artificial intelligence that is programmed to describe (explain) its purpose, rationale and decision-making process in a way that can be understood by the average person. This includes the specific criteria the program uses to arrive at a decision.
XPU: One of the many specialized CPUs for specific applications (similar to an ASIC), which may be real-time, data-intensive, data-specific, or at the edge (see Edge Analytics). For more information, refer to the article “Sensor Analytics at Micro Scale on the xPU“.
3D-Printing … moving on to 4D-printing: Additive Manufacturing — the action or process of making a physical object from a three-dimensional digital model, typically by laying down many thin layers of a material in succession. The terms “additive manufacturing” and “3D printing” both refer to creating an object by sequentially adding build material in successive cross-sections, one stacked upon another.
5G: Fifth-generation wireless, the latest iteration of cellular technology, engineered to greatly increase the speed and responsiveness of wireless networks. 5G will also enable a sharp increase in the amount of data transmitted over wireless systems due to more available bandwidth. Example applications: (1) High-definition and 3D video. (2) Gbit/sec Internet. (3) Broadband network access nearly everywhere. (4) IoT.
We often think of analytics on large scales, particularly in the context of large data sets (“Big Data”). However, there is a growing analytics sector that is focused on the smallest scale. That is the scale of digital sensors — driving us into the new era of sensor analytics.
Small scale (i.e., micro scale) is nothing new in the digital realm. After all, the digital world came into existence as a direct consequence of microelectronics and microcircuits. We used to say in the early years of astronomy big data (which is my background) that the same transistor-based logic microcircuitry that comprises our data storage devices (which are storing massive streams of data) is essentially the same transistor-based logic microcircuitry inside our sensors (which are collecting that data). The latter includes, particularly, the sensors inside digital cameras, consisting of megapixels and even gigapixels. Consequently, there should be no surprise that the two digital data functions (sensing and storing) are intimately connected and that we are therefore drowning in oceans of data.
But, in our rush to crown data “big”, we sometimes may have forgotten that micro-scale component to the story. But not any longer. There is growing movement in the microchip world in new and interesting directions.
I am not only talking about evolutions of the CPU (central processing unit) that we have seen for years: the GPU (graphics processing unit) and the FPGA (field programmablegate array). We are now witnessing the design, development, and deployment of more interesting application-specific integrated circuits (ASICs), one of which is the TPU (tensor processing unit) which is specifically designed for AI (artificial intelligence) applications. The TPU (as its name suggests) can perform tensor calculations on the chip, in the same way that earlier generation integrated circuits were designed to perform scalar operations (in the CPU) and to perform vector and/or parallel streaming operations (in the GPU).
Speeding the calculations is precisely the goal of these new chips. One that I heard discussed in the context of cybersecurity applications is the BPU (Behavior Processing Unit), designed to detect BOI (behaviors of interest). Whereas the TPU might be detecting persons of interest (POI) or objects of interest in an image, the BOI is looking at patterns in the time series (sequence data) that are indicative of interesting (and/or anomalous) behavioral patterns.
The BOI detector (the BPU) would definitely represent an amplifier to cybersecurity operations, in which the massive volumes of data streaming through our networks and routers are so huge that we never actually capture (and store) all of that data. So we need to detect the anomalous pattern in real-time before a damaging cyber incident occurs!
Through the years, I have decided on the following “4 V’s of Big Data” summary in my own presentations:
Volume = the most annoying V
Velocity = the most challenging V
Variety = the most rich V for insights discovery, innovation, and better decisions
Value = the most important V
As Dez Blanchfield once said, “You don’t need a data scientist to tell you big data is valuable. You need one to show its value.”
A series of articles that drive home the all-important value of data is being published on the DataMakesPossible.com site. The site’s domain name says it all: Data Makes Possible!
What does data make possible? I occasionally discuss these things in articles that I write for the MapR blog series, where I have summarized the big data value proposition very simply in my own list of the 3 D2D‘s of big data:
Data-to-Discovery (Class Discovery, Correlation Discovery, Novelty Discovery, Association Discovery)
So, what does data make possible? I document our progress toward deriving big value from big data in a series of articles that I am writing (with more to come) for the DataMakesPossible.com site. These articles include:
One of the most significant characteristics of the evolving digital age is the convergence of technologies. That includes information management (databases), data collection (big data), data storage (cloud), data applications (analytics), knowledge discovery (data science), algorithms (machine learning), transparency (open data), computation (distributed computing: e.g., Hadoop), sensors (internet of things: IoT), and API services (microservices, containerization). One more dimension in this pantheon of tech, which is the most important, is the human dimension. We see the human interaction with technology explicitly among the latest developments in digital marketing, behavioral analytics, user experience, customer experience, design thinking, cognitive computing, social analytics, and (last, but not least) citizen science.
Citizen Scientists are trained volunteers who work on authentic science projects with scientific researchers to answer real-world questions and to address real-world challenges (see discussion here). Citizen Science projects are popular in astronomy, medicine, botany, ecology, ocean science, meteorology, zoology, digital humanities, history, and much more. Check out (and participate) in the wonderful universe of projects at the Zooniverse (zooniverse.org) and at scistarter.com.
In the data science community, we have seen activities and projects that are similar to citizen science in that volunteers step forward to use their skills and knowledge to solve real-world problems and to address real-world challenges. Examples of this include Kaggle.com machine learning competitions and the Data Science Bowl (sponsored each year since 2014 by Booz Allen Hamilton, and hosted by Kaggle). These “citizen science” projects are not just for citizens who are untrained in scientific disciplines, but they are dominated by professional and/or deeply skilled data scientists, who volunteer their time and talents to help solve hard data challenges.
The convergence of data technologies is leading to the development of numerous “smart paradigms”, including smart highways, smart farms, smart grid, and smart cities, just to name a few. By combining this technology convergence (data science, IoT, sensors, services, open data) with a difficult societal challenge (air quality in urban areas) in conjunction with community engagement (volunteer citizen scientists, whether professional or non-professional), the U.S. Environmental Protection Agency (EPA) has knitted the complex fabric of smart people, smart technologies, and smart problems into a significant open competition: the EPA Smart City Air Challenge.
The EPA Smart City Air Challenge launched on August 30, 2016. The challenge is open for about 8 weeks. This is an unusually important and rare project that sits at that nexus of IoT, Big Data Analytics, and Citizen Science. It calls upon clever design thinking at the intersection of sensor technologies, open data, data science, environment science, and social good.
Open data is fast becoming a standard for public institutions, encouraging partnerships between governments and their constituents. The EPA Smart City Air Challenge is a great positive step in that direction. By bringing together expertise across a variety of domains, we can hope to address and fix some hard social, environmental, energy, transportation, and sustainability challenges facing the current age. The challenge competition will bring forward best practices for managing big data at the community level. The challenge encourages communities to deploy hundreds of air quality sensors and to make their data public. The resulting data sets will help communities to understand real-time environmental quality, what are the driving factors in air quality change (including geographic features, urban features, and human factors), to assess which changes will lead to better outcomes (including social, mobile, transportation, energy use, education, human health, etc.), and to motivate those changes at the grass roots local community level.
The EPA Smart City Air Challenge encourages local governments to form partnerships with sensor manufacturers, data management companies, citizen scientists, data scientists, and others. Together they’ll create strategies for collecting and using the data. EPA will award prizes of up to $40,000 to two communities based on their strategies, including their plans to share their data management methods so others can benefit. The prizes are intended to be seed money, so the partnerships are essential.
After receiving awards for their partnerships, strategies and designs, the two winning communities will have a year to start developing and implementing their solutions based on those winning designs. After that year, EPA will then evaluate the accomplishments and collaboration of the two winning communities. Based upon that evaluation, EPA may then award up to an additional $10,000 to each of the two winning communities.
The EPA Smart City Air Challenge is open until October 28, 2016. The competition is for developers and scientists, for data lovers and technology lovers, for startups and for established organizations, for society and for you. Join the competition now! For more information, visit the Smart City Air Challenge website at http://www.challenge.gov/challenge/smart-city-air-challenge/, or write to email@example.com.
Spread the word about EPA’s Smart City Air Challenge — big data, data science, and IoT for societal good in your communities!
Thanks to Ethan McMahon @mcmahoneth for his contributions to this article and to the EPA Smart City Air Challenge.
(The following article was first published in July of 2013 at analyticbridge.com. At least 3 of the links in the original article are now obsolete and/or broken. I re-post the article here with the correct links. A lot of things in the Big Data, Data Science, and IoT universe have changed dramatically since that first publication, but I did not edit the article accordingly, in order to preserve the original flavor and context. The central message is still worth repeating today.)
The on-going Big Data media hype stirs up a lot of passionate voices. There are naysayers (“it is nothing new“), doomsayers (“it will disrupt everything”), and soothsayers (e.g.,Predictive Analytics experts). The naysayers are most bothersome, in my humble opinion. (Note: I am not talking about skeptics, whom we definitely and desperately need during any period of maximized hype!)
We frequently encounter statements of the “naysayer” variety that tell us that even the ancient Romans had big data. Okay, I understand that such statements logically follow from one of the standard definitions of big data: data sets that are larger, more complex, and generated more rapidly than your current resources (computational, data management, analytic, and/or human) can handle — whose characteristics correspond to the 3 V’s of Big Data. This definition of Big Data could be used to describe my first discoveries in a dictionary or my first encounters with an encyclopedia. But those “data sets” are hardly “Big Data” — they are universally accessible, easily searchable, and completely “manageable” by their handlers. Therefore, they are SMALL DATA, and thus it is a myth to label them as “Big Data”. By contrast, we cannot ignore the overwhelming fact that in today’s real Big Data tsunami, each one of us generates insurmountable collections of data on our own. In addition, the correlations, associations, and links between each person’s digital footprint and all other persons’ digital footprints correspond to an exponential (actually, combinatorial) explosion in additional data products.
Nevertheless, despite all of these clear signs that today’s big data environment is something radically new, that doesn’t stop the naysayers. With the above standard definition of big data in their quiver, the naysayers are fond of shooting arrows through all of the discussions that would otherwise suggest that big data are changing society, business, science, media, government, retail, medicine, cyber-anything, etc. I believe that this naysayer type of conversation is unproductive, unhelpful, and unscientific. The volume, complexity, and speed of data today are vastly different from anything that we have ever previously experienced, and those facts will be even more emphatic next year, and even more so the following year, and so on. In every sector of life, business, and government, the data sets are becoming increasingly off-scale and exponentially unmanageable. The 2011 McKinsey report “Big Data: The Next Frontier for Innovation, Competition, and Productivity.” made this abundantly clear. When the Internet of Things and machine-to-machine applications really become established, then the big data V’s of today will seem like child’s play.
In an attempt to illustrate the enormity of scale of today’s (and tomorrow’s) big data, I have discussed the exponential explosion of data in my TedX talk “Big Data, small world“ (e.g., you can fast-forward to my comments on this topic starting approximately at the 9:00 minute mark in the video). You can also read more about this topic in the article “Big Data Growth – Compound Interest on Steroids“, where I have elaborated on the compound growth rate of big data — the numbers will blow your mind, and they should blow away the naysayers’ arguments. Read all about it at http://rocketdatascience.org/?p=204.
The Definitive Guide to anything should be a helpful, informative road map to that topic, including visualizations, lessons learned, best practices, application areas, success stories, suggested reading, and more. I don’t know if all such “definitive guides” can meet all of those qualifications, but here are some that do a good job:
A common phrase in SCM (Supply Chain Management) is Just-In-Time (JIT) inventory. JIT refers to a management strategy in which raw materials, products, or services are delivered to the right place, at the right time, as demand requires. This has always been an excellent business goal, but the power to excel at JIT inventory management is now improving dramatically with the increased use of data analytics across the supply chain.
In the article “Operational Analytics and Droning About Big Data“, we discussed two examples of JIT: (1) a just-in-time supply replenishment system for human bases on the Moon, and (2) the proposal by Amazon to use drones to deliver products to your front door “just in time”! The Internet of Things will almost certainly generate similar use cases and benefits.
Descriptive analytics (hindsight) tells you what has already happened in your supply chain. If there was a deficiency or problem somewhere, then you can react to that event. But, that is “old school” supply chain management. Modern analytics is predictive (foresight), allowing you to predict where the need will occur (in advance) so that you can proactively deliver products and services at the point of need, just in time.
The next advance in analytics is prescriptive (insight), which uses optimization techniques (from operations research) in combination with insights and knowledge of your business (systems, processes, and resources) in order to optimize your delivery systems, for the best possible outcome (greater sales, fewer losses, reduced inventory, etc.). Just-in-time supply chain management then becomes something more than a reality — it now becomes an enabler of increased efficiency and productivity.
Many more examples of use cases in the manufacturing and retail industries (and elsewhere) where just-in-time analytics is important (and what you can do about it) have been enumerated by the fast Automatic Modeling folks from Soft10, Inc. Check out their fast predictive analytics products at http://soft10ware.com/.
“Variety is the spice of life,” they say. And variety is the spice of data also: adding rich texture and flavor to otherwise dull numbers. Variety ranks among the most exciting, interesting, and challenging aspects of big data. Variety is one of the original “3 V’s of Big Data” and is frequently mentioned in Big Data discussions, which focus too much attention on Volume.
A short conversation with many “old school technologists” these days too often involves them making the declaration: “We’ve always done big data.” That statement really irks me… for lots of reasons. I summarize in the following article some of those reasons: “Today’s Big Data is Not Yesterday’s Big Data.” In a nutshell, those statements focus almost entirely on Volume, which is really missing the whole point of big data (in my humble opinion)… here comes the Internet of Things… hold onto your bits!
The greatest challenges and the most interesting aspects of big data appear in high-Velocity Big Data (requiring fast real-time analytics) and high-Variety Big Data (enabling the discovery of interesting patterns, trends, correlations, and features in high-dimensional spaces). Maybe because of my training as an astrophysicist, or maybe because scientific curiosity is a natural human characteristic, I love exploring features in multi-dimensional parameter spaces for interesting discoveries, and so should you!
Dimension reduction is a critical component of any solution dealing with high-variety (high-dimensional) data. Being able to sift through a mountain of data efficiently in order to find the key predictive, descriptive, and indicative features of the collection is a fundamental required data science capability for coping with Big Data.
Identifying the most interesting dimensions of the data is especially valuable when visualizing high-dimensional data. There is a “good news, bad news” perspective here. First, the bad news: the human capacity for seeing multiple dimensions is very limited: 3 or 4 dimensions are manageable; and 5 or 6 dimensions are possible; but more dimensions are difficult-to-impossible to assimilate. Now for the good news: the human cognitive ability to detect patterns, anomalies, changes, or other “features” in a large complex “scene” surpasses most computer algorithms for speed and effectiveness. In this case, a “scene” refers to any small-n projection of a larger-N parameter space of variables.
In data visualization, a systematic ordered parameter sweep through an ensemble of small-n projections (scenes) is often referred to as a “grand tour”, which allows a human viewer of the visualization sequence to see quickly any patterns or trends or anomalies in the large-N parameter space. Even such “grand tours” can miss salient (explanatory) features of the data, especially when the ratio N/n is large. Consequently, a data analytics approach that combines the best of both worlds (machine vision algorithms and human perception) will enable efficient and effective exploration of large high-dimensional data.
One such approach is to use statistical and machine learning techniques to develop “interestingness metrics” for high-variety data sets. As such algorithms are applied to the data (in parameter sweeps or grand tours), they can discover and then present to the data end-user the most interesting and informative features (or combinations of features) in high-dimensional data: “Numbers are powerful, especially in interesting combinations.”
The outcomes of such exploratory data analyses are even more enhanced when the analytics tool ranks the output models (e.g., the data’s “most interesting parameters”) in order of significance and explanatory power (i.e., their ability to “explain” the complex high-dimensional patterns in the data). Soft10’s “automatic statistician” Dr. Mo is a fast predictive analytics software package for exploring complex high-dimensional (high-variety) data. Dr. Mo’s proprietary modeling and analytics techniques have been applied across many application domains, including medicine and health, finance, customer analytics, target marketing, nonprofits, membership services, and more. Check out Dr. Mo at http://soft10ware.com/ and read more here: http://soft10ware.com/big-data-complexity-requires-fast-modeling-technology/