Monthly Archives: March 2020

The Power of Graph Databases, Linked Data, and Graph Algorithms

In 2019, I was asked to write the Foreword for the book Graph Algorithms: Practical Examples in Apache Spark and Neo4j, by Mark Needham and Amy E. Hodler. I wrote an extensive piece on the power of graph databases, linked data, graph algorithms, and various significant graph analytics applications. In their wisdom, the editors of the book decided that I wrote “too much”. So, they correctly shortened my contribution by about half in the final published version of my Foreword for the book.

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j).

Graph Algorithms book

Now, for the first time, the full unabridged (and unedited) version of my initial contribution as the Foreword for the book is published here. (You should still get the book because it is a fantastic 250-page masterpiece for data scientists!) Any omissions, errors, or viewpoints in the piece below are entirely my own. I publish this in its original form in order to capture the essence of my point of view on the power of graph analytics.

As you read this, just remember the most important message: the natural data structure of the world is not rows and columns, but a graph. And this: perhaps the most powerful node in a graph model for real-world use cases might be “context”. How does one express “context” in a data model? Ahh, that’s the topic for another article. But this might help you get there: https://twitter.com/KirkDBorne/status/1232094501537275904

“All the World’s a Graph”

What do marketing attribution analysis, anti-money laundering, customer journey modeling, safety incident causal factor analysis, literature-based discovery, fraud network analysis, Internet search, the map app on your mobile phone, the spread of infectious diseases, and the theatrical performance of a William Shakespeare play all have in common? No, it is not something as trivial as the fact that all these phrases contain nouns and action verbs! What they have in common is that all these phrases proved that Shakespeare was right when he declared, “All the world’s a graph!”

Okay, the Bard of Avon did not actually say “Graph” in that sentence, but he did say “Stage” in the sentence. However, notice that all the examples mentioned above involve entities and the relationships between them, including both direct and indirect (transitive) relationships — a graph! Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. The relationships between the nodes are the edges in the graph. Therefore, isn’t the very essence of a Shakespearean play the live action portrayal of entities (the nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could have said “Graph” in his famous declaration.

What makes graph algorithms and graph databases so interesting and powerful isn’t the simple relationship between two entities: A is related to B. After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram). What makes graphs so remarkably different and important are directional relationships and transitive relationships. In directional relationships, A may cause B, but not the opposite. In transitive relationships, A can be directly related to B and B can be directly related to C, while A is not directly related to C, so that consequently A is transitively related to C.

Because of transitivity relationships, particularly when they are numerous and diverse with many possible relationship (network) patterns and many possible degrees of separation between the entities, the graph model enables discovery of relationships between two entities that otherwise may seem wholly disconnected, unrelated, and difficult (if not impossible) to discover in a relational database. Hence, the graph model can be applied productively and effectively in numerous network analysis use cases.

Consider this marketing attribution use case: person A sees the marketing campaign, person A talks about it on their social media account, person B is connected to person A and sees the comment, and subsequently person B buys the product. From the marketing campaign manager’s perspective, the standard relational model would fail to identify the attribution, since B did not see the campaign and A did not respond to the campaign. The campaign looks like a failure. But it is not a failure — its actual success (and positive ROI) is discovered by the graph analytics algorithm through the transitive relationship between the marketing campaign and the final customer purchase, through an intermediary (entity-in-the-middle)!

Next, consider the anti-money laundering (AML) use case: person A and person C are under suspicion for illicit trafficking. Any interaction between the two (e.g., a financial transaction in a financial database) would be flagged by the authorities, and the interactions would come under great scrutiny. However, if A and C never transact any business together, but instead conduct their financial dealings through a safe, respected, unflagged financial authority B, then who would ever notice the transaction? Well, the graph analytics algorithm would notice! The graph engine would discover that there was a transitive relationship between A and C through the intermediary B (the entity-in-the-middle).

Similar descriptions of the power of graph can be given for the other use cases mentioned in the opening paragraph above, all of which are examples of network analysis through graph algorithms. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations). Because of their great interest and power, we highlight two more of those use cases: Internet search and Literature-Based Discovery (LBD).

In Internet search a hyperlinked network (graph-based) algorithm is used by a major search engine to find the central authoritative node across the entire Internet for any given set of search words. The directionality of the edge is most important in this use case since the authoritative node in the network is the one that many other nodes point toward.

LBD is a knowledge network (graph-based) application in which significant discoveries are enabled across the knowledgebase of thousands (and even millions) of research journal articles — the discovery of “hidden knowledge” is only made through the connection between two published research results that may have a large number of degrees of separation (transitive relationships) between them. LBD is being applied to cancer research studies, where the massive semantic medical knowledgebase of symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term results, and long-term consequences may be “hiding” previously unknown cures or beneficial treatments of the most impenetrable cases. The knowledge is already in the network, if only we were to connect the dots to discover it.

The book Graph Algorithms: Practical Examples in Apache Spark and Neo4j is aimed at broadening our knowledge and capabilities around these types of graph analyses, including algorithms, concepts, and practical machine learning applications of the algorithms. From basic concepts to fundamental algorithms, to processing platforms and practical use cases, the authors have compiled an instructive and illustrative guide to the wonderful world of graphs.

Chapter 1 provides a beautiful introduction to graphs, graph analytics algorithms, network science, and graph analytics use cases. In the discussion of power-law distributions, we see again another way that graphs differ from more familiar statistical analyses that assume a normal distribution of properties in random populations. Prepare yourself for some unexpected insights when you realize that power-law distributions are incredibly common in the natural world — graph analytics is a great tool for exploring those scale-free structures and their long tails. By the way, I always love a discussion that mentions the Pareto distribution.

Chapter 2 steps up our graph immersion by introducing us to the many different types of graphs that represent the rich variety of informative relationships that can exist between nodes, including directed and undirected, cyclic and acyclic, trees, and more. If you have always wondered what a DAG was, now you have no more excuses for not knowing. It’s all here. The chapter ends with a quick summary of things to come in greater detail in future chapters, by defining the three major categories of graph algorithms: pathfinding, centrality, and community detection.

Chapter 3 focuses on the graph processing platforms that are mentioned in the subtitle to the book: Apache Spark and Neo4j. In the Apache Spark section, you will find information about the Spark Graph Project, GraphFrames, and Cypher (the graph query language). In the Neo4j section, you will learn about its APOC library: Awesome Procedures On Cypher. Brief instructions on installing these graph engines are included, to prepare you for the use cases and sample applications that are provided later in the book.

Chapters 4, 5, and 6 then dive into the three major graph algorithm categories mentioned earlier. For example, the map app on your mobile phone employs a version of the pathfinding algorithm. Root cause analysis, customer journey modeling, and the spread of infectious diseases are other example applications of pathfinding. Internet search and influencer detection in social networks are example applications of the centrality algorithm. Fraud network analysis, AML, and LBD are example applications of community detection.

Marketing attribution is a use case that may productively incorporate applications of all three graph analytics algorithm categories, depending on the specific question being asked: (1) how did the marketing message flow from source to final action? (pathfinding); (2) was there a dominant influencer who initiated the most ROI from the marketing campaign? (centrality); or (3) is there a community (a set of common personas) that are most responsive to the marketing campaign? (community detection).

Let’s not forget one more example application — a well-conceived theatrical masterpiece will almost certainly be an instantiation of community detection (co-conspirators, love triangles, and all that). That masterpiece will undoubtedly include a major villain or a central hero (representing centrality). Such a masterpiece is probably also a saga (the story of a journey), containing intrigues, strategies, and plots that move ingeniously, methodically, and economically (in three acts or less) toward some climactic ending (thus representing pathfinding).

In Chapter 7, we find many code samples for example applications of graph algorithms, thus rendering all the above knowledge real and useful to the reader. In this manner, the book becomes an instructional tutorial as well as a guide on the side. Putting graph algorithms into practice through these examples is one of the most brilliant contributions of this book — giving you the capability to do it for yourself and to start reaping the rewards of exploring the most natural data structure to describe the world: not rows and columns, but a graph! You will be able to connect the dots that aren’t connected in traditional data structures, build a knowledge graph, explore the graph for insights, and exploit it for value. Let’s put this another way: your graph-powered team will be able to increase the value of your organization’s data assets in ways that others may not have ever imagined. Your team will become graph heroes.

Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. The training data and feature sets that feed machine learning algorithms can now be immensely enriched with tags, labels, annotations, and metadata that were inferred and/or provided naturally through the transformation of your repository of data into a graph of data. Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. All of that “metadata” (which is simply “other data about your data”) enables rich discovery of shortest paths, central nodes, and communities.

Graph modeling of your data set thus enables more efficient and effective feature extraction and selection (also described in Chapter 8), as the graph exposes the most important, influential, representative, and explanatory attributes to be included in machine learning models that aim to predict a particular target outcome variable as accurately as possible.

When considering the power of graph, we should keep in mind that perhaps the most powerful node in a graph model for real-world use cases might be “context”, including the contextual metadata that we already mentioned. Context may include time, location, related events, nearby entities, and more. Incorporating context into the graph (as nodes and as edges) can thus yield impressive predictive analytics and prescriptive analytics capabilities.

When all these pieces and capabilities are brought together, the graph analytics engine is thereby capable of exploring deep relationships between events, actions, people, and other things across both spatial and temporal (as well as other contextual) dimensions. Consequently, a graph algorithm-powered analysis tool may be called a Spatial-Temporal Analytics Graph Engine (STAGE!). Therefore, if Shakespeare was alive today, maybe he would agree with that logic and would still say “All the world’s a STAGE.” In any case, he would want to read this book to learn how to enrich his stories with deeper insights into the world and with more interesting relationships.

Shocking Amount of Data

50 years after the publication of Alvin Toffler’s landmark book “Future Shock“, a new book “After Shock” is here. This 540-page compendium collects over 100 modern-day perspectives on After Shock from leading futurists, including deep assessments of Toffler’s formidably prescient predictions from half a century ago, along with a status check on the current exponential growth (“shock”) in all sectors of the world economy, from the unique vantage points of many different contributors. Contributors include David Brin, Po Bronson, Sanjiv Chopra, George Gilder, Newt Gingrich, Alan Kay, Ray Kurzweil, Jane McGonigal, Lord Martin Rees, Byron Reese, and many others. I am honored to be included among those luminary contributors. I present here a short excerpt from my contribution to the book.

After Shock, book published in 2020
After Shock, edited by John Schroeter

“Shocking Amount of Data”

An excerpt from my chapter in the book:

“We are fully engulfed in the era of massive data collection. All those data represent the most critical and valuable strategic assets of modern organizations that are undergoing digital disruption and digital transformation. Advanced analytics tools and techniques drive insights discovery, innovation, new market opportunities, and value creation from the data. However, our enthusiasm for “big data” is tempered by the fact that this data flood also drives us to sensory input shock and awe.

“Among the countless amazing foresights that appeared in Alvin Toffler’s Future Shock was the concept of information overload. His discussion of the topic came long before the creation and proliferation of social networks, the World Wide Web, the Internet, enterprise databases, ubiquitous sensors, and digital data collection by all organizations—big and small, public and private, near and far. The clear and present human consequences and dangers of infoglut were succinctly called out as section headings in Chapter 16 “The Psychological Dimension” of Toffler’s book, including these: “the overstimulated individual,” “bombardment of the senses” and “decision stress.”

“While these ominous forecasts have now become a reality for our digitally drenched society, especially for the digital natives who have known no other experience, there is hope for a lifeline that we can grasp while swimming (or drowning) in that sea of data. And that hope emanates from the same foundation that is the basis of the information overload shock itself. That foundation is data, and the hope is AI – artificial intelligence. The promise of AI is entirely dependent on the flood of sensory input data that fuels the advanced algorithms that activate and stimulate the Actionable Insights (representing another definition of A.I.), which is what AI is really aimed at achieving.

“AI is a great producer of dichotomous reactions in society: hype versus hope, concern versus commercialization, fantasy versus forward-thinking, overstimulation versus overabundance of opportunity, bombardment of the senses versus bountiful insights into solving hard problems, and decision stress versus automated decision support. Could we imagine any technology that has more mutually contradictory psychological dimensions than AI? I cannot.

“AI takes its cue from data. It needs data – and not just small quantities, but large quantities of data, containing training examples of all sorts of behaviors, events, processes, things, and outcomes. Just as a human (or any cognitive creature) receives sensory inputs from its world through multiple channels (e.g., our five senses) and then produces an output—make a decision, take an action—in response to those inputs, similarly that is exactly what an AI does. The AI relies on mathematical algorithms that sort through streams of data, to find the most relevant, informative, actionable, and pertinent patterns in the data. From those presently perceived patterns, AI then produces an output (decision and/or action). For example, when a human detects and then recognizes a previously seen pattern, the person knows how to respond—what to decide and how to act. This is what an AI is being trained to do automatically and with minimal human intervention through the data inputs that cue the algorithms.

“How AI helps with infoglut and thereby addresses the themes of Toffler’s writings (“information overload”, “overstimulation and bombardment of the senses”, and “decision stress”) is through…

…”

You can continue reading my chapter, plus dozens more perspectives, in the full After Shock book here: https://amzn.to/2S01MC7

Top 10 Conversations That You Do Not Want to Have on Data Innovation Day

[Note: this post is a slightly modified version of an earlier post. Although written with light (perhaps weak) humor, this post was created to bring attention to the Benefits of Open Data for innovation, value creation, and digital transformation on Open Data Day.]

Data Innovation is a powerful strategic goal for data-intensive organizations, especially to be celebrated through Data Innovation Day events, whenever they may occur. Here are the top 10 conversations that you do not want to have on that day. Let the countdown begin….

10.  CDO (Chief Data Officer) speaking to Data Innovation Day event manager who is trying to re-schedule the event for Father’s Day: “Don’t do that! It’s pronounced ‘Day-tuh’, not ‘Dadda’.”

9.  CDO speaking at the company’s Data Innovation Day event regarding an acronym that was used to abbreviate his job title in the event program guide: “I am the company’s Big Data ‘As A Service’ guru, not the company’s Bigdata ‘As Software Service’ guru.”  (Hint: that’s BigData-aaS, not Big-aSS)

8.  Data Scientist speaking to Data Innovation Day session chairperson: “Why are all of these cowboys on stage with me? I said I was planning to give a LASSO demonstration.”

​7.  Any person speaking to you: “Our organization has always done big data.”

6.  You speaking to any person: “Seriously? … The title of our Data Innovation Day Event is ‘Big Data is just Small Data, Only Bigger’.

5.  New cybersecurity administrator (fresh from college) sends this e-mail to company’s Data Scientists at 4:59pm on Friday: “The security holes in our data-sharing platform are now fixed. It will now automatically block all ports from accepting incoming data access requests between 5:00pm today and 9:00am Monday.  Gotta go now.  Have a nice weekend.  From your new BFF.”

4.  Data Scientist speaking to new Human Resources Department Analytics ​Specialist regarding the truckload of tree seedlings that she received as her end-of-year company bonus:  “I said in my employment application that I like Decision Trees, not Deciduous Trees.”

3.  Organizer for the huge Las Vegas Data Innovation Day Symposium speaking to the conference keynote speaker: “Oops, sorry. I blew your $100,000 speaker’s honorarium at the poker tables in the Grand Casino.”

2.  Over-zealous cleaning crew speaking to the Data Center Manager as she is arriving for work in the morning after Data Innovation Day event that was held in the company’s shiny new Exascale Data Center: “We did a very thorough job cleaning your data center. And we won’t even charge you for the extra hours that we spent wiping the disk drives clean and deleting all the dirty data that you kept talking about yesterday.”

1.  Announcement to University staff regarding the Data Innovation Day event:  “Dan Ariely’s keynote talk ‘Big Data is Like Teenage Sex‘ is being moved from basement room B002 in the Physics Department to the Campus Football Stadium due to overwhelming student interest.”

Follow Kirk Borne on Twitter @KirkDBorne