Monthly Archives: March 2020

The Power of Graph Databases, Linked Data, and Graph Algorithms

“All the World’s a Graph”

What do marketing attribution analysis, anti-money laundering, customer journey modeling, safety incident causal factor analysis, literature-based discovery, fraud network analysis, Internet search, the map app on your mobile phone, the spread of infectious diseases, and the theatrical performance of a William Shakespeare play all have in common? No, it is not something as trivial as the fact that all these phrases contain nouns and action verbs! What they have in common is that all these phrases proved that Shakespeare was right when he declared, “All the world’s a graph!”

Okay, the Bard of Avon did not actually say “Graph” in that sentence, but he did say “Stage” in the sentence. However, notice that all the examples mentioned above involve entities and the relationships between them, including both direct and indirect (transitive) relationships — a graph! Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. The relationships between the nodes are the edges in the graph. Therefore, isn’t the very essence of a Shakespearean play the live action portrayal of entities (the nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could have said “Graph” in his famous declaration.

What makes graph algorithms and graph databases so interesting and powerful isn’t the simple relationship between two entities: A is related to B. After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram). What makes graphs so remarkably different and important are directional relationships and transitive relationships. In directional relationships, A may cause B, but not the opposite. In transitive relationships, A can be directly related to B and B can be directly related to C, while A is not directly related to C, so that consequently A is transitively related to C.

Because of transitivity relationships, particularly when they are numerous and diverse with many possible relationship (network) patterns and many possible degrees of separation between the entities, the graph model enables discovery of relationships between two entities that otherwise may seem wholly disconnected, unrelated, and difficult (if not impossible) to discover in a relational database. Hence, the graph model can be applied productively and effectively in numerous network analysis use cases.

Consider this marketing attribution use case: person A sees the marketing campaign, person A talks about it on their social media account, person B is connected to person A and sees the comment, and subsequently person B buys the product. From the marketing campaign manager’s perspective, the standard relational model would fail to identify the attribution, since B did not see the campaign and A did not respond to the campaign. The campaign looks like a failure. But it is not a failure — its actual success (and positive ROI) is discovered by the graph analytics algorithm through the transitive relationship between the marketing campaign and the final customer purchase, through an intermediary (entity-in-the-middle)!

Next, consider the anti-money laundering (AML) use case: person A and person C are under suspicion for illicit trafficking. Any interaction between the two (e.g., a financial transaction in a financial database) would be flagged by the authorities, and the interactions would come under great scrutiny. However, if A and C never transact any business together, but instead conduct their financial dealings through a safe, respected, unflagged financial authority B, then who would ever notice the transaction? Well, the graph analytics algorithm would notice! The graph engine would discover that there was a transitive relationship between A and C through the intermediary B (the entity-in-the-middle).

Similar descriptions of the power of graph can be given for the other use cases mentioned in the opening paragraph above, all of which are examples of network analysis through graph algorithms. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations). Because of their great interest and power, we highlight two more of those use cases: Internet search and Literature-Based Discovery (LBD).

In Internet search a hyperlinked network (graph-based) algorithm is used by a major search engine to find the central authoritative node across the entire Internet for any given set of search words. The directionality of the edge is most important in this use case since the authoritative node in the network is the one that many other nodes point toward.

LBD is a knowledge network (graph-based) application in which significant discoveries are enabled across the knowledgebase of thousands (and even millions) of research journal articles — the discovery of “hidden knowledge” is only made through the connection between two published research results that may have a large number of degrees of separation (transitive relationships) between them. LBD is being applied to cancer research studies, where the massive semantic medical knowledgebase of symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term results, and long-term consequences may be “hiding” previously unknown cures or beneficial treatments of the most impenetrable cases. The knowledge is already in the network, if only we were to connect the dots to discover it.

The book Graph Algorithms: Practical Examples in Apache Spark and Neo4j is aimed at broadening our knowledge and capabilities around these types of graph analyses, including algorithms, concepts, and practical machine learning applications of the algorithms. From basic concepts to fundamental algorithms, to processing platforms and practical use cases, the authors have compiled an instructive and illustrative guide to the wonderful world of graphs.

Chapter 1 provides a beautiful introduction to graphs, graph analytics algorithms, network science, and graph analytics use cases. In the discussion of power-law distributions, we see again another way that graphs differ from more familiar statistical analyses that assume a normal distribution of properties in random populations. Prepare yourself for some unexpected insights when you realize that power-law distributions are incredibly common in the natural world — graph analytics is a great tool for exploring those scale-free structures and their long tails. By the way, I always love a discussion that mentions the Pareto distribution.

Chapter 2 steps up our graph immersion by introducing us to the many different types of graphs that represent the rich variety of informative relationships that can exist between nodes, including directed and undirected, cyclic and acyclic, trees, and more. If you have always wondered what a DAG was, now you have no more excuses for not knowing. It’s all here. The chapter ends with a quick summary of things to come in greater detail in future chapters, by defining the three major categories of graph algorithms: pathfinding, centrality, and community detection.

Chapter 3 focuses on the graph processing platforms that are mentioned in the subtitle to the book: Apache Spark and Neo4j. In the Apache Spark section, you will find information about the Spark Graph Project, GraphFrames, and Cypher (the graph query language). In the Neo4j section, you will learn about its APOC library: Awesome Procedures On Cypher. Brief instructions on installing these graph engines are included, to prepare you for the use cases and sample applications that are provided later in the book.

Chapters 4, 5, and 6 then dive into the three major graph algorithm categories mentioned earlier. For example, the map app on your mobile phone employs a version of the pathfinding algorithm. Root cause analysis, customer journey modeling, and the spread of infectious diseases are other example applications of pathfinding. Internet search and influencer detection in social networks are example applications of the centrality algorithm. Fraud network analysis, AML, and LBD are example applications of community detection.

Marketing attribution is a use case that may productively incorporate applications of all three graph analytics algorithm categories, depending on the specific question being asked: (1) how did the marketing message flow from source to final action? (pathfinding); (2) was there a dominant influencer who initiated the most ROI from the marketing campaign? (centrality); or (3) is there a community (a set of common personas) that are most responsive to the marketing campaign? (community detection).

Let’s not forget one more example application — a well-conceived theatrical masterpiece will almost certainly be an instantiation of community detection (co-conspirators, love triangles, and all that). That masterpiece will undoubtedly include a major villain or a central hero (representing centrality). Such a masterpiece is probably also a saga (the story of a journey), containing intrigues, strategies, and plots that move ingeniously, methodically, and economically (in three acts or less) toward some climactic ending (thus representing pathfinding).

In Chapter 7, we find many code samples for example applications of graph algorithms, thus rendering all the above knowledge real and useful to the reader. In this manner, the book becomes an instructional tutorial as well as a guide on the side. Putting graph algorithms into practice through these examples is one of the most brilliant contributions of this book — giving you the capability to do it for yourself and to start reaping the rewards of exploring the most natural data structure to describe the world: not rows and columns, but a graph! You will be able to connect the dots that aren’t connected in traditional data structures, build a knowledge graph, explore the graph for insights, and exploit it for value. Let’s put this another way: your graph-powered team will be able to increase the value of your organization’s data assets in ways that others may not have ever imagined. Your team will become graph heroes.

Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. The training data and feature sets that feed machine learning algorithms can now be immensely enriched with tags, labels, annotations, and metadata that were inferred and/or provided naturally through the transformation of your repository of data into a graph of data. Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. All of that “metadata” (which is simply “other data about your data”) enables rich discovery of shortest paths, central nodes, and communities.

Graph modeling of your data set thus enables more efficient and effective feature extraction and selection (also described in Chapter 8), as the graph exposes the most important, influential, representative, and explanatory attributes to be included in machine learning models that aim to predict a particular target outcome variable as accurately as possible.

When considering the power of graph, we should keep in mind that perhaps the most powerful node in a graph model for real-world use cases might be “context”, including the contextual metadata that we already mentioned. Context may include time, location, related events, nearby entities, and more. Incorporating context into the graph (as nodes and as edges) can thus yield impressive predictive analytics and prescriptive analytics capabilities.

When all these pieces and capabilities are brought together, the graph analytics engine is thereby capable of exploring deep relationships between events, actions, people, and other things across both spatial and temporal (as well as other contextual) dimensions. Consequently, a graph algorithm-powered analysis tool may be called a Spatial-Temporal Analytics Graph Engine (STAGE!). Therefore, if Shakespeare was alive today, maybe he would agree with that logic and would still say “All the world’s a STAGE.” In any case, he would want to read this book to learn how to enrich his stories with deeper insights into the world and with more interesting relationships.

Shocking Amount of Data

Leave a reply

50 years after the publication of Alvin Toffler’s landmark book “Future Shock“, a new book “After Shock” is here. This 540-page compendium collects over 100 modern-day perspectives on After Shock from leading futurists, including deep assessments of Toffler’s formidably prescient predictions from half a century ago, along with a status check on the current exponential growth (“shock”) in all sectors of the world economy, from the unique vantage points of many different contributors. Contributors include David Brin, Po Bronson, Sanjiv Chopra, George Gilder, Newt Gingrich, Alan Kay, Ray Kurzweil, Jane McGonigal, Lord Martin Rees, Byron Reese, and many others. I am honored to be included among those luminary contributors. I present here a short excerpt from my contribution to the book.

After Shock, book published in 2020 — After Shock, edited by John Schroeter

“Shocking Amount of Data”

An excerpt from my chapter in the book:

“We are fully engulfed in the era of massive data collection. All those data represent the most critical and valuable strategic assets of modern organizations that are undergoing digital disruption and digital transformation. Advanced analytics tools and techniques drive insights discovery, innovation, new market opportunities, and value creation from the data. However, our enthusiasm for “big data” is tempered by the fact that this data flood also drives us to sensory input shock and awe.

“Among the countless amazing foresights that appeared in Alvin Toffler’s Future Shock was the concept of information overload. His discussion of the topic came long before the creation and proliferation of social networks, the World Wide Web, the Internet, enterprise databases, ubiquitous sensors, and digital data collection by all organizations—big and small, public and private, near and far. The clear and present human consequences and dangers of infoglut were succinctly called out as section headings in Chapter 16 “The Psychological Dimension” of Toffler’s book, including these: “the overstimulated individual,” “bombardment of the senses” and “decision stress.”

“While these ominous forecasts have now become a reality for our digitally drenched society, especially for the digital natives who have known no other experience, there is hope for a lifeline that we can grasp while swimming (or drowning) in that sea of data. And that hope emanates from the same foundation that is the basis of the information overload shock itself. That foundation is data, and the hope is AI – artificial intelligence. The promise of AI is entirely dependent on the flood of sensory input data that fuels the advanced algorithms that activate and stimulate the Actionable Insights (representing another definition of A.I.), which is what AI is really aimed at achieving.

“AI is a great producer of dichotomous reactions in society: hype versus hope, concern versus commercialization, fantasy versus forward-thinking, overstimulation versus overabundance of opportunity, bombardment of the senses versus bountiful insights into solving hard problems, and decision stress versus automated decision support. Could we imagine any technology that has more mutually contradictory psychological dimensions than AI? I cannot.

“AI takes its cue from data. It needs data – and not just small quantities, but large quantities of data, containing training examples of all sorts of behaviors, events, processes, things, and outcomes. Just as a human (or any cognitive creature) receives sensory inputs from its world through multiple channels (e.g., our five senses) and then produces an output—make a decision, take an action—in response to those inputs, similarly that is exactly what an AI does. The AI relies on mathematical algorithms that sort through streams of data, to find the most relevant, informative, actionable, and pertinent patterns in the data. From those presently perceived patterns, AI then produces an output (decision and/or action). For example, when a human detects and then recognizes a previously seen pattern, the person knows how to respond—what to decide and how to act. This is what an AI is being trained to do automatically and with minimal human intervention through the data inputs that cue the algorithms.

“How AI helps with infoglut and thereby addresses the themes of Toffler’s writings (“information overload”, “overstimulation and bombardment of the senses”, and “decision stress”) is through…

…”

You can continue reading my chapter, plus dozens more perspectives, in the full After Shock book here: https://amzn.to/2S01MC7

Top 10 Conversations That You Do Not Want to Have on Data Innovation Day

Rocket-Powered Data Science

Data Reflections by Dr. Kirk Borne @KirkDBorne

Monthly Archives: March 2020

The Power of Graph Databases, Linked Data, and Graph Algorithms

“All the World’s a Graph”

Shocking Amount of Data

Top 10 Conversations That You Do Not Want to Have on Data Innovation Day