Linked Data - Rocket-Powered Data Science

In 2019, I was asked to write the Foreword for the book “Graph Algorithms: Practical Examples in Apache Spark and Neo4j“, by Mark Needham and Amy E. Hodler. I wrote an extensive piece on the power of graph databases, linked data, graph algorithms, and various significant graph analytics applications. In their wisdom, the editors of the book decided that I wrote “too much”. So, they correctly shortened my contribution by about half in the final published version of my Foreword for the book.

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j).

Now, for the first time, the full unabridged (and unedited) version of my initial contribution as the Foreword for the book is published here. (You should still get the book because it is a fantastic 250-page masterpiece for data scientists!) Any omissions, errors, or viewpoints in the piece below are entirely my own. I publish this in its original form in order to capture the essence of my point of view on the power of graph analytics.

As you read this, just remember the most important message: the natural data structure of the world is not rows and columns, but a graph. And this: perhaps the most powerful node in a graph model for real-world use cases might be “context”. How does one express “context” in a data model? Ahh, that’s the topic for another article. But this might help you get there: https://twitter.com/KirkDBorne/status/1232094501537275904

“All the World’s a Graph”

What do marketing attribution analysis, anti-money laundering, customer journey modeling, safety incident causal factor analysis, literature-based discovery, fraud network analysis, Internet search, the map app on your mobile phone, the spread of infectious diseases, and the theatrical performance of a William Shakespeare play all have in common? No, it is not something as trivial as the fact that all these phrases contain nouns and action verbs! What they have in common is that all these phrases proved that Shakespeare was right when he declared, “All the world’s a graph!”

Okay, the Bard of Avon did not actually say “Graph” in that sentence, but he did say “Stage” in the sentence. However, notice that all the examples mentioned above involve entities and the relationships between them, including both direct and indirect (transitive) relationships — a graph! Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. The relationships between the nodes are the edges in the graph. Therefore, isn’t the very essence of a Shakespearean play the live action portrayal of entities (the nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could have said “Graph” in his famous declaration.

What makes graph algorithms and graph databases so interesting and powerful isn’t the simple relationship between two entities: A is related to B. After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram). What makes graphs so remarkably different and important are directional relationships and transitive relationships. In directional relationships, A may cause B, but not the opposite. In transitive relationships, A can be directly related to B and B can be directly related to C, while A is not directly related to C, so that consequently A is transitively related to C.

Because of transitivity relationships, particularly when they are numerous and diverse with many possible relationship (network) patterns and many possible degrees of separation between the entities, the graph model enables discovery of relationships between two entities that otherwise may seem wholly disconnected, unrelated, and difficult (if not impossible) to discover in a relational database. Hence, the graph model can be applied productively and effectively in numerous network analysis use cases.

Consider this marketing attribution use case: person A sees the marketing campaign, person A talks about it on their social media account, person B is connected to person A and sees the comment, and subsequently person B buys the product. From the marketing campaign manager’s perspective, the standard relational model would fail to identify the attribution, since B did not see the campaign and A did not respond to the campaign. The campaign looks like a failure. But it is not a failure — its actual success (and positive ROI) is discovered by the graph analytics algorithm through the transitive relationship between the marketing campaign and the final customer purchase, through an intermediary (entity-in-the-middle)!

Next, consider the anti-money laundering (AML) use case: person A and person C are under suspicion for illicit trafficking. Any interaction between the two (e.g., a financial transaction in a financial database) would be flagged by the authorities, and the interactions would come under great scrutiny. However, if A and C never transact any business together, but instead conduct their financial dealings through a safe, respected, unflagged financial authority B, then who would ever notice the transaction? Well, the graph analytics algorithm would notice! The graph engine would discover that there was a transitive relationship between A and C through the intermediary B (the entity-in-the-middle).

Similar descriptions of the power of graph can be given for the other use cases mentioned in the opening paragraph above, all of which are examples of network analysis through graph algorithms. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations). Because of their great interest and power, we highlight two more of those use cases: Internet search and Literature-Based Discovery (LBD).

In Internet search a hyperlinked network (graph-based) algorithm is used by a major search engine to find the central authoritative node across the entire Internet for any given set of search words. The directionality of the edge is most important in this use case since the authoritative node in the network is the one that many other nodes point toward.

LBD is a knowledge network (graph-based) application in which significant discoveries are enabled across the knowledgebase of thousands (and even millions) of research journal articles — the discovery of “hidden knowledge” is only made through the connection between two published research results that may have a large number of degrees of separation (transitive relationships) between them. LBD is being applied to cancer research studies, where the massive semantic medical knowledgebase of symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term results, and long-term consequences may be “hiding” previously unknown cures or beneficial treatments of the most impenetrable cases. The knowledge is already in the network, if only we were to connect the dots to discover it.

The book Graph Algorithms: Practical Examples in Apache Spark and Neo4j is aimed at broadening our knowledge and capabilities around these types of graph analyses, including algorithms, concepts, and practical machine learning applications of the algorithms. From basic concepts to fundamental algorithms, to processing platforms and practical use cases, the authors have compiled an instructive and illustrative guide to the wonderful world of graphs.

Chapter 1 provides a beautiful introduction to graphs, graph analytics algorithms, network science, and graph analytics use cases. In the discussion of power-law distributions, we see again another way that graphs differ from more familiar statistical analyses that assume a normal distribution of properties in random populations. Prepare yourself for some unexpected insights when you realize that power-law distributions are incredibly common in the natural world — graph analytics is a great tool for exploring those scale-free structures and their long tails. By the way, I always love a discussion that mentions the Pareto distribution.

Chapter 2 steps up our graph immersion by introducing us to the many different types of graphs that represent the rich variety of informative relationships that can exist between nodes, including directed and undirected, cyclic and acyclic, trees, and more. If you have always wondered what a DAG was, now you have no more excuses for not knowing. It’s all here. The chapter ends with a quick summary of things to come in greater detail in future chapters, by defining the three major categories of graph algorithms: pathfinding, centrality, and community detection.

Chapter 3 focuses on the graph processing platforms that are mentioned in the subtitle to the book: Apache Spark and Neo4j. In the Apache Spark section, you will find information about the Spark Graph Project, GraphFrames, and Cypher (the graph query language). In the Neo4j section, you will learn about its APOC library: Awesome Procedures On Cypher. Brief instructions on installing these graph engines are included, to prepare you for the use cases and sample applications that are provided later in the book.

Chapters 4, 5, and 6 then dive into the three major graph algorithm categories mentioned earlier. For example, the map app on your mobile phone employs a version of the pathfinding algorithm. Root cause analysis, customer journey modeling, and the spread of infectious diseases are other example applications of pathfinding. Internet search and influencer detection in social networks are example applications of the centrality algorithm. Fraud network analysis, AML, and LBD are example applications of community detection.

Marketing attribution is a use case that may productively incorporate applications of all three graph analytics algorithm categories, depending on the specific question being asked: (1) how did the marketing message flow from source to final action? (pathfinding); (2) was there a dominant influencer who initiated the most ROI from the marketing campaign? (centrality); or (3) is there a community (a set of common personas) that are most responsive to the marketing campaign? (community detection).

Let’s not forget one more example application — a well-conceived theatrical masterpiece will almost certainly be an instantiation of community detection (co-conspirators, love triangles, and all that). That masterpiece will undoubtedly include a major villain or a central hero (representing centrality). Such a masterpiece is probably also a saga (the story of a journey), containing intrigues, strategies, and plots that move ingeniously, methodically, and economically (in three acts or less) toward some climactic ending (thus representing pathfinding).

In Chapter 7, we find many code samples for example applications of graph algorithms, thus rendering all the above knowledge real and useful to the reader. In this manner, the book becomes an instructional tutorial as well as a guide on the side. Putting graph algorithms into practice through these examples is one of the most brilliant contributions of this book — giving you the capability to do it for yourself and to start reaping the rewards of exploring the most natural data structure to describe the world: not rows and columns, but a graph! You will be able to connect the dots that aren’t connected in traditional data structures, build a knowledge graph, explore the graph for insights, and exploit it for value. Let’s put this another way: your graph-powered team will be able to increase the value of your organization’s data assets in ways that others may not have ever imagined. Your team will become graph heroes.

Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. The training data and feature sets that feed machine learning algorithms can now be immensely enriched with tags, labels, annotations, and metadata that were inferred and/or provided naturally through the transformation of your repository of data into a graph of data. Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. All of that “metadata” (which is simply “other data about your data”) enables rich discovery of shortest paths, central nodes, and communities.

Graph modeling of your data set thus enables more efficient and effective feature extraction and selection (also described in Chapter 8), as the graph exposes the most important, influential, representative, and explanatory attributes to be included in machine learning models that aim to predict a particular target outcome variable as accurately as possible.

When considering the power of graph, we should keep in mind that perhaps the most powerful node in a graph model for real-world use cases might be “context”, including the contextual metadata that we already mentioned. Context may include time, location, related events, nearby entities, and more. Incorporating context into the graph (as nodes and as edges) can thus yield impressive predictive analytics and prescriptive analytics capabilities.

When all these pieces and capabilities are brought together, the graph analytics engine is thereby capable of exploring deep relationships between events, actions, people, and other things across both spatial and temporal (as well as other contextual) dimensions. Consequently, a graph algorithm-powered analysis tool may be called a Spatial-Temporal Analytics Graph Engine (STAGE!). Therefore, if Shakespeare was alive today, maybe he would agree with that logic and would still say “All the world’s a STAGE.” In any case, he would want to read this book to learn how to enrich his stories with deeper insights into the world and with more interesting relationships.

Rocket-Powered Data Science

Data Reflections by Dr. Kirk Borne @KirkDBorne

Tag Archives: Linked Data

The Power of Graph Databases, Linked Data, and Graph Algorithms

“All the World’s a Graph”