Category Archives: Machine Learning

How We Teach The Leaders of Tomorrow To Be Curious, Ask Questions and Not Be Afraid To Fail Fast To Learn Fast

I recently enjoyed recording a podcast with Joe DosSantos (Chief Data Officer at Qlik). This was one in a series of #DataBrilliant podcasts by Qlik, which you can also access here (Apple Podcasts) and here (Spotify). I summarize below some of the topics that Joe and I discussed in the podcast. Be sure to listen to the full recording of our lively conversation, which covered Data Literacy, Data Strategy, Data Leadership, and more.

The Age of Hype Cycles

The data age has been marked by numerous “hype cycles.” First, we heard how Big Data, Data Science, Machine Learning (ML) and Advanced Analytics would have the honor to be the technologies that would cure cancer, end world hunger and solve the world’s biggest challenges. Then came third-generation Artificial Intelligence (AI), Blockchain and soon Quantum Computing, with each one seeking that honor.

From all this hope and hype, one constant has always been there: a focus on value creation from data. As a scientist, I have always recommended a scientific approach: State your problem first, be curious (ask questions), collect facts to address those questions (acquire data), investigate, analyze, ask more questions, include a sensible serving of skepticism, and (above all else) aim to fail fast in order to learn fast. As I discussed with Joe DosSantos when I spoke with him for the latest episode of Data Brilliant, you don’t need to be a data scientist to follow these principles. These apply to everyone, in all organizations and walks of life, in every sector.

One characteristic of science that is especially true in data science and implicit in ML is the concept of continuous learning and refining our understanding. We build models to test our understanding, but these models are not “one and done.” They are part of a cycle of learning. In ML, the learning cycle is sometimes called backpropagation, where the errors (inaccurate predictions) of our models are fed back into adjusting the model’s input parameters in a way that aims to improve the output accuracy. A more colloquial expression for this is: good judgment comes from experience, and experience comes from bad judgment.

Data Literacy For All

I know that for some, the term data and some of the other terminology I’ve mentioned already can be scary. But they shouldn’t be. We are all surrounded by – and creating – masses of data every single day. As Joe and I talked about, one of the first hurdles in data literacy is getting people to recognize that everything is data. What you see with your eyes? That’s data. What you hear with your ears? Data. The words that come out of your mouth that other people hear? That’s all data. Images, text, documents, audio, video and all the apps on your phone, all the things you search for on the internet? Yet again, that’s data.

Every single day, everyone around the world is using data and the principles I mention above, many without realizing it. So, now we need to bring this value to our businesses.

How To Build A Successful Enterprise Data Strategy

In my chat with Joe, we talked about many data concepts in the context of enterprise digital transformation. As always, but especially during the race toward digital transformation that has been accelerated by the 2020 pandemic, a successful enterprise data strategy that leads to business value creation can benefit from first addressing these six key questions:

(1) What mission objective and outcomes are you aiming to achieve?

(2) What is the business problem, expressed in data terminology? Specifically, is it a detection problem (fraud or emergent behavior), a discovery problem (new customers or new opportunities), a prediction problem (what will happen) or an optimization problem (how to improve outcomes)?

(3) Do you have the talent (key people representing diverse perspectives), tools (data technologies) and techniques (AI and ML knowledge) to make it happen?

(4) What data do you have to fuel the algorithms, the training and the modeling processes?

(5) Is your organizational culture ready for this (for data-informed decisions; an experimentation mindset; continuous learning; fail fast to learn fast; with principled AI and data governance)?

(6) What operational technology environment do you have to deploy the implementation (cloud or on-premise platform)?

Data Leadership

As Joe and I discussed, your ultimate business goal is to build a data-fueled enterprise that delivers business value from data. Therefore, ask questions, be purposeful (goal-oriented and mission-focused), be reasonable in your expectations and remain reasonably skeptical – because as famous statistician, George Box, once said “all models are wrong, but some are useful.”

Now, listen to the full podcast here.

Key Strategies and Senior Executives’ Perspectives on AI Adoption in 2020

Artificial intelligence (AI) has become one of the most significant emerging technologies of the past few years. Some market estimates anticipate that AI will contribute 16 trillion dollars to the global GDP (gross domestic product) by 2030. While there has been accelerating interest in implementing AI as a technology, there has been concurrent growth in interest in implementing successful AI strategies. Some key elements of such strategies that have emerged include explainable AI, trusted AI, AI ethics, operationalizing AI, scaling sustainable AI operations, workforce development (training), and how to speed up all of this development.

The 2020 year of the pandemic has forced organizations to speed up their digital transformation and advanced technology adoption plans, essentially compressing several years of anticipated developments into several months. These accelerated developments cover a wide scope, including: technology-enabled remote work solutions, technology-enhanced health and safety programs, AI-powered implementations of “all of the above”, and sharpened focus and attention on their workforce: future of work in the age of AI, AI-assisted human work and process enhancements, and training initiatives (including data literacy and AI).

In the recent 2020 RELX Emerging Tech Study, results were presented from a survey of over 1000 U.S. senior executives across eight industries: agriculture, banking, exhibitions, government, healthcare, insurance, legal, and science/medical. The survey, carried out by RELX (a global provider of information-based analytics and decision tools for professional and business customers), focused on the state of interest, investment, and implementations in AI tech during the pandemic period. But it was not just a snapshot on the state of AI in 2020. The survey also had forward-looking questions, as well as historical comparisons and trends from results of similar “state of AI in the enterprise” surveys in 2018 and 2019.

Some of the most remarkable findings in the 3-year trending data include these changes from 2018 to 2020:

1) The percentage of senior executives who stated that AI technologies are being utilized in their business dramatically increased from 48% to 81%.

2) The percentage who are concerned about other countries being more advanced than the U.S. in AI technology and implementation increased from 70% to 82%.

3) The percentage who believe that government programs should assist in AI workforce development increased from 45% to 59%.

4) There was a slight increase (though still a minority) in those who believe that the government should leave the promotion of AI technologies to the private sector, growing from 30% to 36%.

5) There is a continued (and growing) strong belief that U.S. companies should invest in the future AI workforce through educational initiatives such as university partnerships, increasing from 92% to 95%.

6) The percentage who are offering training opportunities in AI technologies to their workforce increased significantly from 46% to 75%.

Some of the interesting findings specifically in the 2020 survey results include:

1) Across all sectors, a strong majority (86%) of survey respondents believe that ethical considerations are a strategic priority in the design and implementation of their AI systems, ranging from 77% in government and 80% in the legal sectors, up to 93% in banking and 94% in the insurance sectors, with healthcare and science/medical near the middle of that range.

2) The majority (68%) of respondents stated that they increased their investment in AI technologies, 48% of which invested in new AI technologies. The sectors with the greatest increases in investment were insurance, banking, and agriculture, followed closely by healthcare and science/medical.

3) 82% stated that AI technologies are most likely to be used to increase efficiencies and worker productivities.

4) Only 26% of respondents reported that AI technologies are being used to replace human labor.

Regarding the last point, this minority view regarding AI’s potential negative impact on employment is consistent with a 2019 worldwide survey of 19,000 employers, which found that 87% plan to increase or maintain the size of their staff as a result of automation and AI, and that just 9% of companies across the globe and 4% in the U.S. anticipate cutting jobs. Another report stated, “Rather, than being replaced, humans will be redeployed into higher-order jobs requiring more cognitive skills.”

The broader takeaways and insights drawn from the 2020 RELX Emerging Tech survey are these:

(a) COVID-19 drove increased AI tech investment and adoption.

(b) The use of AI has increased across all sectors that were polled.

(c) Ethical AI is viewed as both a priority and a competitive advantage.

(d) International competition remains a concern for U.S. organizations.

(e) AI workforce training and development is a major component of AI strategy, though AI implementations consistently outpace training initiatives.

Vijay Raghavan, Executive Vice President and Chief Technology Officer for the Risk & Business Analytics division of RELX, has summarized the survey results very well in the following statements:

  • “Businesses’ response to COVID-19 has confirmed the view of US business leaders that artificial intelligence has the power to create smarter, more agile and profitable businesses.”
  • “Businesses face more complex challenges every day and AI technologies have become a mission-critical resource in adapting to, if not overcoming, these types of unforeseen obstacles and staying resilient.”
  • “Companies that do not dedicate the necessary resources to training existing employees on new AI technologies risk leaving growth opportunities on the table and using biased or otherwise flawed systems to make and enforce major decisions.”

Therefore, when we consider AI strategy, a global perspective is no less important than local organization-specific mission objectives. Understanding your competition, the marketplace, and both the current expectations and future needs of your stakeholders (customers, employees, citizens, and shareholders) is vital. Furthermore, non-technology considerations should be incorporated alongside technology implementation and operationalization requirements. Performance metrics and goals associated with AI governance, ethics, talent, and training must be on the same balance sheet as AI tools, techniques, and technologies. How to bring all of these pieces together in a successful AI strategy has become clearer with the results and insights revealed in the 2020 RELX Emerging Tech survey.

You can find and read the full RELX survey report here: http://bit.ly/relx-emerging-tech-kb

Key Finding in Emerging Tech Survey
Key Findings in RELX 2020 Emerging Tech Survey

Data Science Blogs-R-Us

I have written articles in many places. I will be collecting links to those sources here. The list is not complete and will be constantly evolving. There are some older blogs that I will be including in the list below as I remember them and find them. Also included are some interviews in which I provided detailed answers to a variety of questions.

In 2019, I was listed as the #1 Top Data Science Blogger to Follow on Twitter.

And then there’s this — not a blog, but a link to my 2013 TedX talk: “Big Data, Small World.” (Many more videos of my talks are available online. That list will be compiled in another place soon.)

  1. Rocket-Powered Data Science (the website that you are now reading).
  2. https://www.govloop.com/author/kirkdborne/
  3. https://datamakespossible.westerndigital.com/tag/kirk-borne/
  4. https://www.linkedin.com/in/kirkdborne/detail/recent-activity/posts/
  5. https://blog.qlik.com/how-we-teach-the-leaders-of-tomorrow-to-be-curious-ask-questions-and-not-be-afraid-to-fail-fast-to-learn-fast
  6. https://www.boozallen.com/s/insight/blog/kirk-borne-on-building-data-science-models.html
  7. https://odsc.com/blog/adapting-machine-learning-algorithms-to-novel-use-cases/
  8. https://www.sas.com/en_us/insights/articles/analytics/data-scientist-data-literacy.html
  9. https://blogs.sas.com/content/sascom/2019/04/27/getting-practical-about-ai-with-kirk-borne/
  10. https://blogs.sas.com/content/sascom/2017/08/31/3-machine-learning-technologies-3-three-years/
  11. https://www.digitalistmag.com/future-of-work/2019/05/15/intelligent-enterprise-connecting-islands-of-innovation-06198471
  12. https://www.digitalistmag.com/cio-knowledge/2019/06/27/data-strategy-that-first-date-with-your-data-06199224
  13. https://blogs.oracle.com/author/kirk-borne
  14. https://blogs.thomsonreuters.com/answerson/doing-better-at-your-service-with-ai-as-a-service/
  15. https://www.aitimejournal.com/data-science-interview-with-kirk-borne-principal-data-scientist-booz-allen-hamilton
  16. https://www.oreilly.com/people/kirk-borne/
  17. https://www.syntasa.com/blog/author/kirk-borne
  18. https://mapr.com/blog/author/kirk-borne/
  19. https://www.analyticbridge.datasciencecentral.com/profiles/blog/list?user=1zo63k80n1dij
  20. https://www.statisticsviews.com/view/searchResults.html?q=%22Kirk+Borne%22&submitSearch=Search
  21. https://muckrack.com/kirk-borne/articles
  22. https://www.kdnuggets.com/2019/01/data-scientist-dilemma-cold-start-machine-learning.html
  23. https://www.manthan.com/blogs/nrf-interview-with-kirk-borne-big-data-hype-the-worst-is-behind-us/
  24. https://www.thinkful.com/blog/meet-the-experts-dr-kirk-borne/
  25. https://itpeernetwork.intel.com/author/kirkborne/#gs.6zd0x8
  26. https://www.ibmbigdatahub.com/blog/author/kirk-borne
  27. https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/bult.2013.1720390414
  28. https://www.laserfiche.com/ecmblog/3-questions-kirk-borne-about-big-data/

The Power of Graph Databases, Linked Data, and Graph Algorithms

In 2019, I was asked to write the Foreword for the book Graph Algorithms: Practical Examples in Apache Spark and Neo4j, by Mark Needham and Amy E. Hodler. I wrote an extensive piece on the power of graph databases, linked data, graph algorithms, and various significant graph analytics applications. In their wisdom, the editors of the book decided that I wrote “too much”. So, they correctly shortened my contribution by about half in the final published version of my Foreword for the book.

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j).

Graph Algorithms book

Now, for the first time, the full unabridged (and unedited) version of my initial contribution as the Foreword for the book is published here. (You should still get the book because it is a fantastic 250-page masterpiece for data scientists!) Any omissions, errors, or viewpoints in the piece below are entirely my own. I publish this in its original form in order to capture the essence of my point of view on the power of graph analytics.

As you read this, just remember the most important message: the natural data structure of the world is not rows and columns, but a graph. And this: perhaps the most powerful node in a graph model for real-world use cases might be “context”. How does one express “context” in a data model? Ahh, that’s the topic for another article. But this might help you get there: https://twitter.com/KirkDBorne/status/1232094501537275904

“All the World’s a Graph”

What do marketing attribution analysis, anti-money laundering, customer journey modeling, safety incident causal factor analysis, literature-based discovery, fraud network analysis, Internet search, the map app on your mobile phone, the spread of infectious diseases, and the theatrical performance of a William Shakespeare play all have in common? No, it is not something as trivial as the fact that all these phrases contain nouns and action verbs! What they have in common is that all these phrases proved that Shakespeare was right when he declared, “All the world’s a graph!”

Okay, the Bard of Avon did not actually say “Graph” in that sentence, but he did say “Stage” in the sentence. However, notice that all the examples mentioned above involve entities and the relationships between them, including both direct and indirect (transitive) relationships — a graph! Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. The relationships between the nodes are the edges in the graph. Therefore, isn’t the very essence of a Shakespearean play the live action portrayal of entities (the nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could have said “Graph” in his famous declaration.

What makes graph algorithms and graph databases so interesting and powerful isn’t the simple relationship between two entities: A is related to B. After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram). What makes graphs so remarkably different and important are directional relationships and transitive relationships. In directional relationships, A may cause B, but not the opposite. In transitive relationships, A can be directly related to B and B can be directly related to C, while A is not directly related to C, so that consequently A is transitively related to C.

Because of transitivity relationships, particularly when they are numerous and diverse with many possible relationship (network) patterns and many possible degrees of separation between the entities, the graph model enables discovery of relationships between two entities that otherwise may seem wholly disconnected, unrelated, and difficult (if not impossible) to discover in a relational database. Hence, the graph model can be applied productively and effectively in numerous network analysis use cases.

Consider this marketing attribution use case: person A sees the marketing campaign, person A talks about it on their social media account, person B is connected to person A and sees the comment, and subsequently person B buys the product. From the marketing campaign manager’s perspective, the standard relational model would fail to identify the attribution, since B did not see the campaign and A did not respond to the campaign. The campaign looks like a failure. But it is not a failure — its actual success (and positive ROI) is discovered by the graph analytics algorithm through the transitive relationship between the marketing campaign and the final customer purchase, through an intermediary (entity-in-the-middle)!

Next, consider the anti-money laundering (AML) use case: person A and person C are under suspicion for illicit trafficking. Any interaction between the two (e.g., a financial transaction in a financial database) would be flagged by the authorities, and the interactions would come under great scrutiny. However, if A and C never transact any business together, but instead conduct their financial dealings through a safe, respected, unflagged financial authority B, then who would ever notice the transaction? Well, the graph analytics algorithm would notice! The graph engine would discover that there was a transitive relationship between A and C through the intermediary B (the entity-in-the-middle).

Similar descriptions of the power of graph can be given for the other use cases mentioned in the opening paragraph above, all of which are examples of network analysis through graph algorithms. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations). Because of their great interest and power, we highlight two more of those use cases: Internet search and Literature-Based Discovery (LBD).

In Internet search a hyperlinked network (graph-based) algorithm is used by a major search engine to find the central authoritative node across the entire Internet for any given set of search words. The directionality of the edge is most important in this use case since the authoritative node in the network is the one that many other nodes point toward.

LBD is a knowledge network (graph-based) application in which significant discoveries are enabled across the knowledgebase of thousands (and even millions) of research journal articles — the discovery of “hidden knowledge” is only made through the connection between two published research results that may have a large number of degrees of separation (transitive relationships) between them. LBD is being applied to cancer research studies, where the massive semantic medical knowledgebase of symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term results, and long-term consequences may be “hiding” previously unknown cures or beneficial treatments of the most impenetrable cases. The knowledge is already in the network, if only we were to connect the dots to discover it.

The book Graph Algorithms: Practical Examples in Apache Spark and Neo4j is aimed at broadening our knowledge and capabilities around these types of graph analyses, including algorithms, concepts, and practical machine learning applications of the algorithms. From basic concepts to fundamental algorithms, to processing platforms and practical use cases, the authors have compiled an instructive and illustrative guide to the wonderful world of graphs.

Chapter 1 provides a beautiful introduction to graphs, graph analytics algorithms, network science, and graph analytics use cases. In the discussion of power-law distributions, we see again another way that graphs differ from more familiar statistical analyses that assume a normal distribution of properties in random populations. Prepare yourself for some unexpected insights when you realize that power-law distributions are incredibly common in the natural world — graph analytics is a great tool for exploring those scale-free structures and their long tails. By the way, I always love a discussion that mentions the Pareto distribution.

Chapter 2 steps up our graph immersion by introducing us to the many different types of graphs that represent the rich variety of informative relationships that can exist between nodes, including directed and undirected, cyclic and acyclic, trees, and more. If you have always wondered what a DAG was, now you have no more excuses for not knowing. It’s all here. The chapter ends with a quick summary of things to come in greater detail in future chapters, by defining the three major categories of graph algorithms: pathfinding, centrality, and community detection.

Chapter 3 focuses on the graph processing platforms that are mentioned in the subtitle to the book: Apache Spark and Neo4j. In the Apache Spark section, you will find information about the Spark Graph Project, GraphFrames, and Cypher (the graph query language). In the Neo4j section, you will learn about its APOC library: Awesome Procedures On Cypher. Brief instructions on installing these graph engines are included, to prepare you for the use cases and sample applications that are provided later in the book.

Chapters 4, 5, and 6 then dive into the three major graph algorithm categories mentioned earlier. For example, the map app on your mobile phone employs a version of the pathfinding algorithm. Root cause analysis, customer journey modeling, and the spread of infectious diseases are other example applications of pathfinding. Internet search and influencer detection in social networks are example applications of the centrality algorithm. Fraud network analysis, AML, and LBD are example applications of community detection.

Marketing attribution is a use case that may productively incorporate applications of all three graph analytics algorithm categories, depending on the specific question being asked: (1) how did the marketing message flow from source to final action? (pathfinding); (2) was there a dominant influencer who initiated the most ROI from the marketing campaign? (centrality); or (3) is there a community (a set of common personas) that are most responsive to the marketing campaign? (community detection).

Let’s not forget one more example application — a well-conceived theatrical masterpiece will almost certainly be an instantiation of community detection (co-conspirators, love triangles, and all that). That masterpiece will undoubtedly include a major villain or a central hero (representing centrality). Such a masterpiece is probably also a saga (the story of a journey), containing intrigues, strategies, and plots that move ingeniously, methodically, and economically (in three acts or less) toward some climactic ending (thus representing pathfinding).

In Chapter 7, we find many code samples for example applications of graph algorithms, thus rendering all the above knowledge real and useful to the reader. In this manner, the book becomes an instructional tutorial as well as a guide on the side. Putting graph algorithms into practice through these examples is one of the most brilliant contributions of this book — giving you the capability to do it for yourself and to start reaping the rewards of exploring the most natural data structure to describe the world: not rows and columns, but a graph! You will be able to connect the dots that aren’t connected in traditional data structures, build a knowledge graph, explore the graph for insights, and exploit it for value. Let’s put this another way: your graph-powered team will be able to increase the value of your organization’s data assets in ways that others may not have ever imagined. Your team will become graph heroes.

Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. The training data and feature sets that feed machine learning algorithms can now be immensely enriched with tags, labels, annotations, and metadata that were inferred and/or provided naturally through the transformation of your repository of data into a graph of data. Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. All of that “metadata” (which is simply “other data about your data”) enables rich discovery of shortest paths, central nodes, and communities.

Graph modeling of your data set thus enables more efficient and effective feature extraction and selection (also described in Chapter 8), as the graph exposes the most important, influential, representative, and explanatory attributes to be included in machine learning models that aim to predict a particular target outcome variable as accurately as possible.

When considering the power of graph, we should keep in mind that perhaps the most powerful node in a graph model for real-world use cases might be “context”, including the contextual metadata that we already mentioned. Context may include time, location, related events, nearby entities, and more. Incorporating context into the graph (as nodes and as edges) can thus yield impressive predictive analytics and prescriptive analytics capabilities.

When all these pieces and capabilities are brought together, the graph analytics engine is thereby capable of exploring deep relationships between events, actions, people, and other things across both spatial and temporal (as well as other contextual) dimensions. Consequently, a graph algorithm-powered analysis tool may be called a Spatial-Temporal Analytics Graph Engine (STAGE!). Therefore, if Shakespeare was alive today, maybe he would agree with that logic and would still say “All the world’s a STAGE.” In any case, he would want to read this book to learn how to enrich his stories with deeper insights into the world and with more interesting relationships.

Glossary of Digital Terminology for Career Relevance

Career Relevance - keeping you digitally literate
Career Relevance

Definitions of terminology frequently seen and used in discussions of emerging digital technologies.

Additive Manufacturing: see 3D-Printing

AGI (Artificial General Intelligence): The intelligence of a machine that has the capacity to understand or learn any intellectual task that a human being can. It is a primary goal of some artificial intelligence research and a common topic in science fiction and future studies.

AI (Artificial Intelligence): Application of Machine Learning algorithms to robotics and machines (including bots), focused on taking actions based on sensory inputs (data). Examples: (1-3) All those applications shown in the definition of Machine Learning. (4) Credit Card Fraud Alerts. (5) Chatbots (Conversational AI). There is nothing “artificial” about the applications of AI, whose tangible benefits include Accelerated Intelligence, Actionable Intelligence (and Actionable Insights), Adaptable Intelligence, Amplified Intelligence, Applied Intelligence, Assisted Intelligence, and Augmented Intelligence.

Algorithm: A set of rules to follow to solve a problem or to decide on a particular action (e.g., the thermostat in your house, or your car engine alert light, or a disease diagnosis, or the compound interest growth formula, or assigning the final course grade for a student).

Analytics: The products of Machine Learning and Data Science (such as predictive analytics, health analytics, cyber analytics).

AR (Augmented Reality): A technology that superimposes a computer-generated image on a user’s view of the real world, thus providing a composite view. Examples: (1) Retail. (2) Disaster Response. (3) Machine maintenance. (4) Medical procedures. (5) Video games in your real world. (6) Clothes shopping & fitting (seeing the clothes on you without a dressing room). (7) Security (airports, shopping malls, entertainment & sport events).

Autonomous Vehicles: Self-driving (guided without a human), informed by data streaming from many sensors (cameras, radar, LIDAR), and makes decisions and actions based on computer vision algorithms (ML and AI models for people, things, traffic signs,…). Examples: Cars, Trucks, Taxis

BI (Business Intelligence): Technologies, applications and practices for the collection, integration, analysis, and presentation of business information. The purpose of Business Intelligence is to support better business decision-making.

Big Data: An expression that refers to the current era in which nearly everything is now being quantified and tracked (i.e., data-fied). This leads to the collection of data and information on nearly full-population samples of things, instead of “representative subsamples”. There have been many descriptions of the characteristics of “Big Data”, but the three dominant attributes are Volume, Velocity, and Variety — the “3 V’s” concept was first introduced by Doug Laney in 2001 here. Read more in this article: “Why Today’s Big Data is Not Yesterday’s Big Data“. Some consider the 2011 McKinsey & Company research report “Big Data: The Next Frontier for Innovation, Competition, and Productivity” as the trigger point when the world really started paying attention to the the volume and variety of data that organizations everywhere are collecting — the report stated, “The United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

Blockchain: A system in which a permanent and verifiable record of transactions is maintained across several computers that are linked in a peer-to-peer network. It has many applications beyond its original uses for bitcoin and other cryptocurrencies. Blockchain in an example of Distributed Ledger Technology, in which independent computers (referred to as nodes) record, share and synchronize transactions in their respective electronic ledgers (instead of keeping data centralized as in a traditional ledger). Blockchain’s name refers to a chain (growing list) of records, called blocks, which are linked using cryptography, and are used to record transactions between two parties efficiently and in a verifiable and permanent way. In simplest terms, Blockchain is a distributed database existing on multiple computers at the same time. It grows as new sets of recordings, or ‘blocks’, are added to it, forming a chain. The database is not managed by any particular body; instead, everyone in the network gets a copy of the whole database. Old blocks are preserved forever and new blocks are added to the ledger irreversibly, making it impossible to manipulate by faking documents, transactions and other information. All blocks are encrypted in a special way, so everyone can have access to all the information but only a user who owns a special cryptographic key is able to add a new record to a particular chain.

Chatbots (see also Virtual Assistants): These typically are text-based user interfaces (often customer-facing for organizations) that are designed and programmed to reply to only a certain set of questions or statements. If the question asked is other than the learned set of responses by the customer, the chatbot will fail. Chatbots cannot hold long, continuing human interaction. Traditionally they are text-based but audio and pictures can also be used for interaction. They provide more like an FAQ (Frequently Asked Questions) type of an interaction. They cannot process language inputs generally.

Cloud: The cloud is a metaphor for a global network of remote servers that operates transparently to the user as a single computing ecosystem, commonly associated with Internet-based computing.

Cloud Computing: The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server, local mainframe, or a personal computer.

Computer Vision: An interdisciplinary scientific field that focuses on how computers can be made to gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the human visual system can do, including pattern detection, pattern recognition, pattern interpretation, and pattern classification.

Data Mining: Application of Machine Learning algorithms to large data collections, focused on pattern discovery and knowledge discovery in data. Pattern discovery includes clusters (class discovery), correlation (and trend) discovery, link (association) discovery, and anomaly detection (outlier detection, surprise discovery).

Data Science: Application of scientific method to discovery from data (including Statistics, Machine Learning, data visualization, exploratory data analysis, experimentation, and more).

Digital Transformation: Refers to the novel use of digital technology to solve traditional problems. These digital solutions enable — other than efficiency via automation — new types of innovation and creativity, rather than simply enhance and support traditional methods.

Digital Twins: A phrase used to describe a computerized (or digital) version of a real physical asset and/or process. The digital twin contains one or more sensors that collects data to represent real-time information about the physical asset. By bridging the physical and the virtual world, data is transmitted seamlessly allowing the virtual entity to exist simultaneously with the physical entity. Digital Twins are used in manufacturing, large-scale systems (such as maritime vessels, wind farms, space probes) and other complex systems. Digital Twins are virtual replicas of physical devices that data scientists and IT pros can use to run simulations before actual devices are built and deployed, and also while those devices are in operation. They represent a strong merging and optimization of numerous digital technologies such as IoT (IIoT), AI, Machine Learning, and Big Data Analytics.

Drone (UAV, UAS): An unmanned aerial vehicle (UAV) or uncrewed aerial vehicle (commonly known as a Drone) is an aircraft without a human pilot on board. UAVs are a component of an unmanned aircraft system (UAS); which include a UAV, a ground-based controller, and a system of communications between the two.

Dynamic Data-driven Application (Autonomous) Systems (DDDAS): A paradigm in which the computation and instrumentation aspects of an application system are dynamically integrated in a feed-back control loop, such that instrumentation data can be dynamically incorporated into the executing model of the application, and in reverse the executing model can control the instrumentation. Such approaches can enable more accurate and faster modeling and analysis of the characteristics and behaviors of a system and can exploit data in intelligent ways to convert them to new capabilities, including decision support systems with the accuracy of full scale modeling, efficient data collection, management, and data mining. See http://dddas.org/.

Edge Computing (and Edge Analytics): A distributed computing paradigm which brings computation to the data, closer to the location where it is needed, to improve response times in autonomous systems and to save bandwidth. Edge Analytics specifically refers to an approach to data collection and analysis in which an automated analytical computation is performed on data at a sensor, network switch or other device instead of waiting for the data to be sent back to a centralized data store. This is important in applications where the result of the analytic computation is needed as fast as possible (at the point of data collection), such as in autonomous vehicles or in digital manufacturing.

Industry 4.0: A reference to a new phase in the Industrial Revolution that focuses heavily on interconnectivity, automation, Machine Learning, and real-time data. Industry 4.0 is also sometimes referred to as IIoT (Industrial Internet of Things) or Smart Manufacturing, because it joins physical production and operations with smart digital technology, Machine Learning, and Big Data to create a more holistic and better connected ecosystem for companies that focus on manufacturing and supply chain management.

IoT (Internet of Things) and IIoT (Industrial IoT): Sensors embedded on devices and within things everywhere, measuring properties of things, and sharing that data over the Internet (over fast 5G), to fuel ML models and AI applications (including AR and VR) and to inform actions (robotics, autonomous vehicles, etc.). Examples:  (1) Wearable health devices (Fitbit). (2) Connected cars. (3) Connected products. (4) Precision farming. (5) Industry 4.0

Knowledge Graphs (see also Linked Data): Knowledge graphs encode knowledge arranged in a network of nodes (entities) and links (edges) rather than tables of rows and columns. The graph can be used to link units of data (the nodes, including concepts and content), with a link (the edge) that explicitly specifies what type of relationship connects the nodes.

Linked Data (see also Knowledge Graphs): A data structure in which data items are interlinked with other data items that enables the entire data set to be more useful through semantic queries. The easiest and most powerful standard designed for Linked Data is RDF (Resource Description Framework).

Machine Learning (ML): Mathematical algorithms that learn from experience (i.e., pattern detection and pattern recognition in data). Examples:  (1) Digit detection algorithm (used in automated Zip Code readers at Post Office. (2) Email Spam detection algorithm (used for Spam filtering). (3) Cancer detection algorithm (used in medical imaging diagnosis). 

MR (Mixed Reality): Sometimes referred to as hybrid reality, is the merging of real and virtual worlds to produce new environments and visualizations where physical and digital objects co-exist and interact in real time. It means placing new imagery within a real space in such a way that the new imagery is able to interact, to an extent, with what is real in the physical world we know. The key characteristic of MR is that the synthetic content and the real-world content are able to react to each other in real time.

NLP (Natural Language Processing), NLG (NL Generation), NLU (NL Understanding): NLP a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. NLG is a software process that transforms structured data into human-language content. It can be used to produce long form content for organizations to automate custom reports, as well as produce custom content for a web or mobile application, or produce the words that will be spoken by a Virtual (Voice-based) Assistant. NLU is a subtopic of Natural Language Processing in Artificial Intelligence that deals with algorithms that have language comprehension (understanding the meaning of the words, both their content and their context).

Quantum Computing: The area of study focused on developing computer technology based on the principles of quantum theory and quantum phenomena (such as superposition of states and entanglement). Qubits are the fundamental units of quantum computing — they are somewhat analogous to bits in a classical computer.

Robotics: A branch of AI concerned with creating devices that can move and react to sensory input (data). Examples: (1) Automated manufacturing assembly line. (2) Roomba (vacuums your house). (3) Warehouse / Logistics. (4) Prosthetics.

Statistics: the practice or science of collecting and analyzing numerical data, especially for the purpose of inferring proportions in a whole population from measurements of those properties within a representative subsample.

UAV (Unmanned Aerial Vehicle) and UAS (Unmanned Aircraft System): see Drones.

Virtual Assistants (see also Chatbots): A sophisticated voice-based interface in an interactive platform for user and customer interactions. Virtual assistants understand not only the language but also the meaning of what the user is saying. They can learn from their conversation instances, which can lead to an unpredictability in their behavior. Consequently, they can have extended adaptable human interaction. They can be set to perform slightly complicated tasks as well, such as order-taking and task fulfillment.

VR (Virtual Reality): Computer-generated simulation of a three-dimensional environment that can be interacted with in a seemingly real or physical way by a person using special electronic equipment, such as a helmet with a screen inside or gloves fitted with sensors. Examples: (1) Games. (2) Family adventures. (3) Training & Education. (4) Design. (5) Big Data Exploration.

XAI (eXplainable AI, Trusted AI): Artificial intelligence that is programmed to describe (explain) its purpose, rationale and decision-making process in a way that can be understood by the average person. This includes the specific criteria the program uses to arrive at a decision.

XPU: One of the many specialized CPUs for specific applications (similar to an ASIC), which may be real-time, data-intensive, data-specific, or at the edge (see Edge Analytics). For more information, refer to the article “Sensor Analytics at Micro Scale on the xPU“.

3D-Printing … moving on to 4D-printing: Additive Manufacturing — the action or process of making a physical object from a three-dimensional digital model, typically by laying down many thin layers of a material in succession. The terms “additive manufacturing” and “3D printing” both refer to creating an object by sequentially adding build material in successive cross-sections, one stacked upon another.

5G: Fifth-generation wireless, the latest iteration of cellular technology, engineered to greatly increase the speed and responsiveness of wireless networks. 5G will also enable a sharp increase in the amount of data transmitted over wireless systems due to more available bandwidth. Example applications: (1) High-definition and 3D video. (2) Gbit/sec Internet. (3) Broadband network access nearly everywhere. (4) IoT.

Glossaries of Data Science Terminology

Here is a compilation of glossaries of terminology used in data science, big data analytics, machine learning, AI, and related fields:

Data Science Glossary

A tag cloud of data science and machine learning terminology

Data Science Training Opportunities

A few years ago, I generated a list of places to receive data science training. That list has become a bit stale. So, I have updated the list, adding some new opportunities, keeping many of the previous ones, and removing the obsolete ones.

Also, here is a thorough, informative, and interesting article that outlines the critical skills needed in order to be a good data scientist: https://www.toptal.com/data-science#hiring-guide

Here are 30 training opportunities that I encourage you to explore:

  1. The Booz Allen Field Guide to Data Science
  2. NYC Data Science Academy
  3. NVIDIA Deep Learning Institute
  4. Metis Data Science Training
  5. Leada’s online analytics labs
  6. Data Science Training by General Assembly
  7. Learn Data Science Online by DataCamp
  8. (600+) Colleges and Universities with Data Science Degrees
  9. Data Science Master’s Degree Programs
  10. Data Analytics, Machine Learning, & Statistics Courses at edX
  11. Data Science Certifications (by AnalyticsVidhya)
  12. Learn Everything About Analytics (by AnalyticsVidhya)
  13. Big Bang Data Science Solutions
  14. CommonLounge
  15. IntelliPaat Online Training
  16. DataQuest
  17. NCSU Institute for Advanced Analytics
  18. District Data Labs
  19. Data School
  20. Galvanize
  21. Coursera
  22. Udacity Nanodegree Program to Become a Data Scientist
  23. Udemy – Data & Analytics
  24. Insight Data Science Fellows Program
  25. The Open Source Data Science Masters
  26. Jigsaw Academy Post Graduate Program in Data Science & Machine Learning
  27. O’Reilly Media Learning Paths
  28. Data Engineering and Data Science Training by Go Data Driven
  29. 18 Resources to Learn Data Science Online (by Simplilearn)
  30. Top Online Data Science Courses to Learn Data Science

Follow Kirk Borne on Twitter @KirkDBorne

Field Guide to Data Science
Learn the what, why, and how of Data Science and Machine Learning here.

Bias-Busting with Diversity in Data

Diversity in data is one of the three defining characteristics of big data — high data variety — along with high data volume and high velocity. We discussed the power and value of high-variety data in a previous article: “The Five Important D’s of Big Data Variety” We won’t repeat those lessons here, but we focus specifically on the bias-busting power of high-variety data, which was actually the last of the five D’s mentioned in the earlier article: Decreased model bias.

Here, we broaden our meaning of “bias” to go beyond model bias, which has the technical statistical meaning of “underfitting”, which essentially means that there is more information and structure in the data than our model has captured. In the current context, we apply a broader definition of bias: lacking a neutral viewpoint, or having a viewpoint that is partial. We will call this natural bias, since the examples can be considered as “naturally occurring” without obvious intent. This article does not elaborate on personal bias (which might be intentional), though the cause for that kind of prejudice is essentially the same: not considering and taking into account the full knowledge and understanding of the person or entity that is the subject of the bias.

We wrote a longer complete version of this article here: “Busting Bias with More Data Variety” at the Western Digital DataMakesPossible.com blog site.

In that full version of this article, we go on to describe several examples of natural bias and then to present a recommended bias-busting remedy for those of us working in the realm of data science. We refer to that remedy as the CCDI data & analytics strategy: Collect, Curate, Differentiate, and Innovate.

Here is one of the four examples of natural bias that you will find in the longer, complete version of the article:

  • An example of natural bias comes from a famous cartoon. The cartoon shows three or more blind men (or blindfolded men) feeling an elephant. They each feel a different aspect of the elephant: the tail, a tusk, an ear, the body, a leg — and consequently they each offer a different interpretation of what they believe this thing is (which they cannot see). They say it might be a rope (the tail), or a spear (the tusk), or a large fan (the ear), or a wall (the body), or a tree trunk (the leg). Only after the blindfolds are removed (or an explanation is given) do they finally “see” the full truth of this large complex reality. It has many different features, facets, and characteristics. Focusing on only one of those features and insisting that this partial view describes the whole thing would be foolish. We have similar complex systems in our organizations, whether it is the human body (in healthcare), or our population of customers (in marketing), or the Earth (in climate science), or different components in a complex system (like a manufacturing facility), or our students (in a classroom), or whatever. Unless we break down the silos and start sharing our data (insights) about all the dimensions, viewpoints, and perspectives of our complex system, we will consequently be drawn into biased conclusions and actions, and thus miss the key insights that enable us to understand the wonderful complexity and diversity of the thing in its entirety. Integrating the many data sources enables us to arrive at the “single correct view” of the thing: the 360 view!
Collecting high-variety data from diverse sources, connecting the dots, and building the 360 view of our domain is not only the data silo-busting thing to do. It is also the bias-busting thing to do. High-variety data makes that possible, and there is no shortage of biases for high-variety data to bust, including cognitive bias, confirmation bias, salience bias, and sampling bias, just to name a few! …
Read the full story here… “Busting Bias with More Data Variety

Sensor Analytics on Big Data at Micro Scale

We often think of analytics on large scales, particularly in the context of large data sets (“Big Data”). However, there is a growing analytics sector that is focused on the smallest scale. That is the scale of digital sensors — driving us into the new era of sensor analytics.

Small scale (i.e., micro scale) is nothing new in the digital realm. After all, the digital world came into existence as a direct consequence of microelectronics and microcircuits. We used to say in the early years of astronomy big data (which is my background) that the same transistor-based logic microcircuitry that comprises our data storage devices (which are storing massive streams of data) is essentially the same transistor-based logic microcircuitry inside our sensors (which are collecting that data). The latter includes, particularly, the sensors inside digital cameras, consisting of megapixels and even gigapixels. Consequently, there should be no surprise that the two digital data functions (sensing and storing) are intimately connected and that we are therefore drowning in oceans of data.

But, in our rush to crown data “big”, we sometimes may have forgotten that micro-scale component to the story. But not any longer. There is growing movement in the microchip world in new and interesting directions.

I am not only talking about evolutions of the CPU (central processing unit) that we have seen for years: the GPU (graphics processing unit) and the FPGA (field programmable gate array).  We are now witnessing the design, development, and deployment of more interesting application-specific integrated circuits (ASICs), one of which is the TPU (tensor processing unit) which is specifically designed for AI (artificial intelligence) applications. The TPU (as its name suggests) can perform tensor calculations on the chip, in the same way that earlier generation integrated circuits were designed to perform scalar operations (in the CPU) and to perform vector and/or parallel streaming operations (in the GPU).

Speeding the calculations is precisely the goal of these new chips. One that I heard discussed in the context of cybersecurity applications is the BPU (Behavior Processing Unit), designed to detect BOI (behaviors of interest). Whereas the TPU might be detecting persons of interest (POI) or objects of interest in an image, the BOI is looking at patterns in the time series (sequence data) that are indicative of interesting (and/or anomalous) behavioral patterns.

The BOI detector (the BPU) would definitely represent an amplifier to cybersecurity operations, in which the massive volumes of data streaming through our networks and routers are so huge that we never actually capture (and store) all of that data. So we need to detect the anomalous pattern in real-time before a damaging cyber incident occurs!

You can continue reading the long version of this article and learn more about the growing class of new analytics ASIC processors in my article “Sensor Analytics at Micro Scale on the xPU” at the Western Digital DataMakesPossible.com blog site.

Learn more about Machine Learning for Edge Devices at Western Digital here: https://blog.westerndigital.com/machine-learning-edge-devices/

Finally, see what’s cooking in Western Digital’s new Machine Learning Accelerator here: https://blog.westerndigital.com/machine-learning-accelerator-embedded-world-2019/

Brain CPU Chip for AI Acceleration

Brief Guide to xPU for AI Accelerators

Source for graphic: https://www.sigarch.org/a-brief-guide-of-xpu-for-ai-accelerators/

Meta-Learning For Better Machine Learning

In a related post we discussed the Cold Start Problem in Data Science — how do you start to build a model when you have either no training data or no clear choice of model parameters. An example of a cold start problem is k-Means Clustering, where the number of clusters k in the data set is not known in advance, and the locations of those clusters in feature space (i.e., the cluster means) are not known either. So, you start by assuming a value for k and making random assumptions about the cluster means, and then iterate until you find the optimal set of clusters, based upon some evaluation metric. See the related post for more details about the cold start challenge. See the attached graphic below for a simple demonstration of a k-Means Clustering application.

The above example (clustering) is taken from unsupervised machine learning (where there are no labels on the training data). There are also examples of cold start in supervised machine learning (where you do have class labels on the training data).

As an example of a cold start in supervised learning, we look at neural network models, where the weights on the edges that connect the various nodes in the network layers are not known initially. So, random values (e.g., all weights = 1) are assigned to all of the edge weights (which could number in the hundreds or thousands) — that’s the cold start. Following that, the weights can “learn” to get better through a technique known as backpropagation, which is applied through sequential iterations of the neural network learning process. A validation metric estimates the error in each model iteration in the sequence (i.e., the classification error on the validation or hold-out data set), then applies the backpropagation technique to assign some portion of the error to each of the edge weights. Each edge weight is adjusted accordingly using gradient descent (or some similar error correction rate estimator) for the next model in the sequence. The next iteration of the neural network modeling process is executed, applying the same steps as above, and the process continues until the validation metric converges to the optimal final model.

What is missing in the above discussion is the deeper set of unknowns in the learning process. This is the meta-learning phase. We can elucidate this phase through our two examples above.

From the first example above, k-Means Clustering:

  • What is the value of k?
  • Which features in the data set are most effective in creating distinct clusters in the data (i.e., to create the segments that are the most compact internally, and relatively the most separated from each other)? There might be dozens or hundreds or thousands of attributes to choose from, and a vast number of combinations of those attributes in which to explore clustering in different dimensions of parameter space.
  • What distance metric should be used to estimate separation (or what similarity metric should be used to estimate similarity), since clustering is a distance-based algorithm? There are some common choices for distance and similarity metrics (e.g., cosine similarity, Euclidean distance, Manhattan distance, Mahalanobis distance, Lp-Norm, etc.), but that is just the tip of a vast iceberg — just take a look at the 750-page book Encyclopedia of Distances.
  • What evaluation metric should we use to determine if the clusters are “good enough” or optimal (i.e., the most compact set of clusters relative to the separation of the clusters)? There are several choices for such evaluation metrics: Dunn index, Davies-Bouldin index, C-index, and Silhouette analysis are just a few examples.

We need to decide on all of these parameterizations of the clustering model before the cold start interations on the cluster means can begin.

From the second example above, Neural Network modeling, there are also many different preliminary tasks and parameterizations of the network that need to be decided and acted on before the cold start iterations on the edge weights can begin:

This now gets to the heart of meta-learning. It is focused on learning the right tasks to perform and on tuning the modeling hyper-parameters. These are the different tasks and “external” parameters that differentiate various instantiations of a specific model within a broader category of models — those tasks and external parameterizations must be explored before you start building, iterating, and validating a specific model’s “internal” parameters. For example:

  • You can cluster children’s toys in a toy store by color, or by shape, or by electronic vs. non-electronic, or by age-appropriateness, or by functionality, or cluster them by some combination of those features.
  • You can cluster (segment) your customers by the types of products they buy, or by their geographic location, or by their gender, or by their age, or by the day of week that they prefer to shop, or cluster them by some combination of those many different variables.
  • You can cluster medical drug treatments by the types of symptoms that they address, or by the medical diagnoses (outcomes) that they attempt to cure, or by their dosage amounts, or cluster them by the side-effects that are caused when different combinations of the drugs are used.

Deciding on the higher-level hyper-parameterizations of your clustering approach before you build the actual models is good data science and good business, no matter whether you are sorting toys, or discovering segments in your customer database, or prescribing different medications to medical patients.

Similar decisions must be made for the neural network example mentioned earlier as well as for numerous other machine learning modeling techniques. Meta-learning is important to make sure that you are aware of and attentive to the many choices of modeling tasks and parameterizations for the models that you are about to train. Meta-learning is also critical for demonstrating (proving) that you built the best (or optimal or most accurate) model, given the higher level characteristics (e.g., parameters, architecture, or input data sources) of the modeling effort:

  • What is the business case? What outcomes will be actionable?
  • What data do we have? Which combinations of data have we not explored yet?
  • What metric will demonstrate that we have achieved the globally optimal model (or approximately the global optimum), versus some locally good model that doesn’t generalize across a larger data set?

Genetic Algorithms (GAs) are an example of meta-learning. They are not machine learning algorithms in themselves, but GAs can be applied across ensembles of machine learning models and tasks, in order to find the optimal model (perhaps globally optimal model) across a collection of locally optimal solutions.

Learn more about meta-learning from these resources:

Finally, in addition to the awesome 750-page book Encyclopedia of Distances“, please check out some of these top-selling books on Data Science, AI, and Machine Learning:

Top Books in AI and Machine Learning

Top Books in AI and Machine Learning


Disclosure statement: As an Amazon Associate I earn from qualifying purchases.