Category Archives: Big Data

SAP Datasphere Powers Business at the Speed of Data

We live in a data-rich, insights-rich, and content-rich world. Data collections are the ones and zeroes that encode the actionable insights (patterns, trends, relationships) that we seek to extract from our data through machine learning and data science. The insights are used to produce informative content for stakeholders (decision-makers, business users, and clients). Content includes reports, documents, articles, presentations, visualizations, video, and audio representations of the insights and knowledge that have been extracted from data.

We could further refine our opening statement to say that our business users are too often in a state of being data-rich, but insights-poor, and content-hungry. With all the data in and around the enterprise, users would say that they have a lot of information but need more insights to assist them in producing better and more informative content. This is where we dispel an old “big data” notion (heard a decade ago) that was expressed like this: “we need our data to run at the speed of business.” Instead, what we really need is for our business to run at the speed of data. It is a major digital transformation challenge for businesses to keep up with data flows coming from a multitude of diverse sources, in different formats, at different cadences, on different dimensions of the enterprise, captured “safely” in different business silos. Transforming data to actionable insights and informative content needs some help!

AI is now helping in all these steps – not simply because it is “Artificial” intelligence, but primarily because AI is a tool for assisted, amplified, and augmented intelligence (the “new AI”) and because AI powers accelerated and automated intelligence, in order to deliver actionable intelligence. It appears that it’s AI everywhere all the time.

AI can help business users extract and produce (i.e., generate) informative content from insights. Plus, AI can also help find key insights encoded in data. And AI can help users find the appropriate data that they need from across the enterprise. In the language of Information Retrieval, AI delivers both high Recall (“did I get all the data that I need?”) and high Precision (“did I get only the data that I need?).

Discover the essential data – that’s AI.

Extract the essential insights from the data – that’s AI.

Produce essential content from the insights – that’s AI.

Well, okay, we can slap the “AI” label on everything, but what good is that? How are we helped when we board the AI hype train? In fact, by putting a single label like AI on all the steps of a data-driven business process, we have effectively not only blurred the process, but we have also blurred the particular characteristics that make each step separately distinct, uniquely critical, and ultimately dependent on specialized, specific technologies and business domain expertise at each step. This is where SAP Datasphere (the next generation of SAP Data Warehouse Cloud) comes in.

The new SAP Datasphere comprehensive data service provides powerful, seamless, and scalable access across the enterprise, across business departments, and across business silos to the specific mission-critical business data collection(s) that are needed for each unique business use case: access to external insights from data on the marketplace and competitors, access to internal insights from data on business processes and enterprise resources, and access to insights on customer-facing business products and services at the intersection of internal and external data sources.

So, if your business users don’t have access to the right data in the right context at the right time for the right business questions, then the whole business data workflow breaks down. SAP Datasphere has arrived to address those pain points, by enabling discovery, access, and integration of the heterogeneous data distributed across the enterprise. And so begins the process of insight discovery and content creation that meets the most significant, timely, and case-specific needs of decision-makers, business knowledge workers, and other stakeholders.

The release of SAP Datasphere was launched and announced globally on March 8, 2023. Live online presentations, demos, and customer testimonials were complemented with new content posted at sap.com/datasphere. Here are just 10 of the many key features of Datasphere that were covered during the launch day announcements:

Datasphere works with the SAP Analytics Cloud and runs on the existing SAP BTP (Business Technology Platform), with all the essential features: security, access control, high availability.
Datasphere accesses and integrates both SAP and non-SAP data sources into end-users’ data flows, including on-prem data warehouses, cloud data warehouses and lakehouses, relational databases, virtual data products, in-memory data, and applications that generate data (such as external API data loads).
Datasphere manages and integrates structured, semi-structured, and unstructured data types.
Datasphere empowers data democratization, by providing all business users with self-service data access, including virtual data products that can be stored, re-used, and shared.
Datasphere is a data discovery tool with essential functionalities: recommendations, data marketplace, and business content (i.e., incorporates the business context of the data and data products that are being recommended and delivered).
Datasphere goes beyond the “big three” data usage end-user requirements (ease of discovery, access, and delivery) to include data orchestration (data ops and data transformations) and business data contextualization (semantics, metadata, catalog services).
Datasphere is an enhanced data warehousing service that includes business semantics (through both analytic and relational models) and a knowledge graph (linking business content with business context).
Datasphere provides full-spectrum data governance: metadata management, data catalogs, data privacy, data quality, and data lineage (provenance) tracking.
Datasphere provides all the outgoing data orchestration functions and incoming data ingestion functions, including replication, federation, real-time stream processing, and application integration.
Datasphere is not just for data managers. It thrives with data consumers, who are doing planning, analytics, data science, and developing intelligent data applications – by providing those users with an end-to-end view of their data landscape in a trusted, secure, and actionable data environment.

Source: https://news.sap.com/2023/03/sap-datasphere-business-data-fabric/

SAP also announced key partners that further enhance Datasphere as a powerful business data fabric. These partners are:

Collibra – providing data governance and discovery (metadata, catalogs) across the entire data landscape.
Confluent – providing access and discovery across real-time event data and streaming data. This emphatically addresses the “data in motion” challenge of enabling “business to run at the speed of data.”
Databricks – providing the complete business context across the evolved Data Warehouse Cloud – the new Data Lakehouse platform.
DataRobot – provides the AI, machine learning (ML), and AutoML capabilities that address the augmented intelligence requirements described at the beginning of this article.

I will finish with three quotes. The first is SAP customer testimonial from Mr. David Johnston, the Chief Information Officer at Messer Americas (leading provider of industrial and medical gases for over 120 years):

“The [Datasphere] business data fabric architecture enables us to bring SAP and non-SAP data together in the seamless and self-service way we’ve been envisioning. SAP Datasphere provided us with a solution to build a harmonized layer, or business data fabric, across SAP and non-SAP, cloud or on-premise data sources, making the best use of our existing investments [in both SAP and non-SAP data services].”

The second quote is from independent analyst Tony Baer:

“SAP’s goal is not simply pairing a data transformation factory with a data warehouse, but instead delivering a service that preserves the context of source data. As you would guess, maintaining context relies on metadata. The challenge is that when you use existing tools for replicating, moving and transforming data, the metadata typically does not usually go along with it. … SAP’s applications are a rich treasure store for business data and the process semantics that go with them. So, it’s logical that SAP has expanded on the business semantic layer of its Data Warehouse Cloud to deliver a data fabric that surfaces the metadata in business terms.”

The third quote is from Juergen Mueller, SAP Chief Technology Officer and member of the Executive Board of SAP SE:

“With SAP customers generating 87% of total global commerce, SAP data is among a company’s most valuable business assets and is contained in the most important functions of an organization, from manufacturing to supply chains, finance, human resources and more. We want to help our customers take the next step to easily and confidently integrate SAP data with non-SAP data from third-party applications and platforms, unlocking entirely new insights and knowledge to bring digital transformation to another level.”

Read, take a tour, try the free tier, deep dive, and learn more about SAP Datasphere here:

https://news.sap.com/2023/03/sap-datasphere-power-of-business-data/

Isn’t it now the time to accelerate your digital transformation efforts and get your business moving at the speed of data?

——————

This article has also been published at https://www.linkedin.com/pulse/sap-datasphere-powers-business-speed-data-kirk-borne-ph-d-/

Follow me on Twitter at @KirkDBorne

My top learning and pondering moments at Splunk .conf22

Leave a reply

I recently attended the Splunk .conf22 conference. While the event was live in-person in Las Vegas, I attended virtually from my home office. Consequently I missed the incredible in-person experience of the brilliant speakers on the main stage, the technodazzle of 100’s of exhibitors’ offerings in the exhibit arena, and the smooth hip hop sounds from the special guest entertainer — guess who?

What I missed in-person was more than compensated for by the incredible online presentations by Splunk leaders, developers, and customers. If you have ever attended a major expo at one of the major Vegas hotels, you know that there is a lot of walking between different sessions — literally, miles of walking per day. That’s good for you, but it often means that you don’t attend all of the sessions that you would like because of the requisite rushing from venue to venue. None of that was necessary on the Splunk .conf22 virtual conference platform. I was able to see a lot, learn a lot, be impressed a lot, and ponder a lot about all of the wonderful features, functionalities, and future plans for the Splunk platform.

One of the first major attractions for me to attend this event is found in the primary descriptor of the Splunk Platform — it is appropriately called the Splunk Observability Cloud, which includes an impressive suite of Observability and Monitoring products and services. I have written and spoken frequently and passionately about Observability in the past couple of years. For example, I wrote this in 2021:

“Observability emerged as one of the hottest and (for me) most exciting developments of the year. Do not confuse observability with monitoring (specifically, with IT monitoring). The key difference is this: monitoring is what you do, and observability is why you do it. Observability is a business strategy: what you monitor, why you monitor it, what you intend to learn from it, how it will be used, and how it will contribute to business objectives and mission success. But the power, value, and imperative of observability does not stop there. Observability meets AI – it is part of the complete AIOps package: ‘keeping an eye on the AI.’ Observability delivers actionable insights, context-enriched data sets, early warning alert generation, root cause visibility, active performance monitoring, predictive and prescriptive incident management, real-time operational deviation detection (6-Sigma never had it so good!), tight coupling of cyber-physical systems, digital twinning of almost anything in the enterprise, and more. And the goodness doesn’t stop there.”
Continue reading my thoughts on Observability at http://rocketdatascience.org/?p=1589

The dominant references everywhere to Observability was just the start of awesome brain food offered at Splunk’s .conf22 event. Here is a list of my top moments, learnings, and musings from this year’s Splunk .conf:

Observability for Unified Security with AI (Artificial Intelligence) and Machine Learning on the Splunk platform empowers enterprises to operationalize data for use-case-specific functionality across shared datasets. (Reference)
The latest updates to the Splunk platform address the complexities of multi-cloud and hybrid environments, enabling cybersecurity and network big data functions (e.g., log analytics and anomaly detection) across distributed data sources and diverse enterprise IT infrastructure resources. (Reference)
Splunk Enterprise 9.0 is here, now! Explore and test-drive it (with a free trial) here.
The new Splunk Enterprise 9.0 release enables DevSecOps users to gain more insights from Observability data with Federated Search, with the ability to correlate ops with security alerts, and with Edge Management, all in one platform. (Reference)
Security information and event management (SIEM) on the Splunk platform is enhanced with end-to-end visibility and platform extensibility, with machine learning and automation (AIOps), with risk-based alerting, and with Federated Search (i.e., Observability on-demand). (Reference)
Customer success story: As a customer-obsessed bank with ultra-rapid growth, Nubank turned to Splunk to optimize data flows, analytics applications, customer support functions, and insights-obsessed IT monitoring. (Reference)
The key characteristics of the Splunk Observability Cloud are Resilience, Security, Scalability, and EXTENSIBILITY. The latter specifically refers to the ease in which developers can extend Splunk’s capabilities to other apps, applying their AIOps and DevSecOps best practices and principles! Developers can start here.
The Splunk Observability Cloud has many functions for data-intensive IT, Security, and Network operations, including Anomaly Detection Service, Federated Search, Synthetic Monitoring, Incident Intelligence, and much more. Synthetic monitoring is essentially digital twinning of your network and IT environment, providing insights through simulated risks, attacks, and anomalies via predictive and prescriptive modeling. [Reference]
Splunk Observability Cloud’s Federated Search capability activates search and analytics regardless of where your data lives — on-site, in the cloud, or from a third party. (Reference)
The new release of the Splunk Data Manager provides a simple, modern, automated experience of data ingest for Splunk Cloud admins, which reduces the time it takes to configure data collection (from hours/days to minutes). (Reference)
Splunk works on data, data, data, but the focus is always on customer, customer, customer — because delivering best outcomes for customers is job #1. Explore Splunk’s amazing Partner ecosystem (Partnerverse) and the impressive catalog of partners’ solutions here.
Splunk .conf22 Invites Organizations to Unlock Innovation With Data.

In summary, here is my list of key words and topics that illustrate the diverse capabilities and value-packed features of the Splunk Observability Cloud Platform that I learned about at the .conf22 event:

– Anomaly Detection Assistant
– Risk-based Alerting (powered by AI and Machine Learning scoring algorithms)
– Federated Search (Observability on-demand)
– End-to-End Visibility
– Platform Extensibility
– Massive(!) Scalability of the Splunk Observability Cloud (to billions of transactions per day)
– Insights-obsessed Monitoring (“We don’t need more information. We need more insights.”)
– APIs in Action (to Turn Data into Doing™)
– Splunk Incident Intelligence
– Synthetic Monitoring (Digital Twin of Network/IT infrastructure)
– Splunk Data Manager
– The Splunk Partner Universe (Partnerverse)

My closing thought — Cybersecurity is basically Data Analytics: detection, prediction, prescription, and optimizing for unpredictability. This is what Splunk lives for!

Follow me on LinkedIn here and on Twitter at @KirkDBorne.

Disclaimer: I was compensated as an independent freelance media influencer for my participation at the conference and for this article. The opinions expressed here are entirely my own and do not represent those of Splunk or of any Splunk partners. Any misrepresentations of the products and services mentioned in my statements are entirely my own responsibility. Nothing here should be construed as an offer to sell or as financial advice of any kind. My comments are entirely of a technical nature, focused on the technical capabilities of the items mentioned in the article.

Splunk Cloud Platform Dashboard.
Source: https://www.splunk.com/en_us/products/splunk-cloud-platform.html

Data Insights for Everyone — The Semantic Layer to the Rescue

Leave a reply

What is a semantic layer? That’s a good question, but let’s first explain semantics. The way that I explained it to my data science students years ago was like this. In the early days of web search engines, those engines were primarily keyword search engines. If you knew the right keywords to search and if the content providers also used the same keywords on their website, then you could type the words into your favorite search engine and find the content you needed. So, I asked my students what results they would expect from such a search engine if I typed the following words into the search box: “How many cows are there in Texas?” My students were smart. They realized that the search results would probably not provide an answer to my question, but the results would simply list websites that included my words on the page or in the metadata tags: “Texas”, “Cows”, “How”, etc. Then, I explained to my students that a semantic-enabled search engine (with a semantic meta-layer, including ontologies and similar semantic tools) would be able to interpret my question’s meaning and then map that meaning to websites that can answer the question.

This was a good opening for my students to the wonderful world of semantics. I brought them deeper into the world by pointing out how much more effective and efficient the data professionals’ life would be if our data repositories had a similar semantic meta-layer. We would be able to go far beyond searching for correctly spelled column headings in databases or specific keywords in data documentation, to find the data we needed (assuming we even knew the correct labels, metatags, and keywords used by the dataset creators). We could search for data with common business terminology, regardless of the specific choice or spelling of the data descriptors in the dataset. Even more than that, we could easily start discovering and integrating, on-the-fly, data from totally different datasets that used different descriptors. For example, if I am searching for customer sales numbers, different datasets may label that “sales”, or “revenue”, or “customer_sales”, or “Cust_sales”, or any number of other such unique identifiers. What a nightmare that would be! But what a dream the semantic layer becomes!

When I was teaching those students so many years ago, the semantic layer itself was just a dream. Now it is a reality. We can now achieve the benefits, efficiencies, and data superhero powers that we previously could only imagine. But wait! There’s more.

Perhaps the greatest achievement of the semantic layer is to provide different data professionals with easy access to the data needed for their specific roles and tasks. The semantic layer is the representation of data that helps different business end-users discover and access the right data efficiently, effectively, and effortlessly using common business terms. The data scientists need to find the right data as inputs for their models — they also need a place to write-back the outputs of their models to the data repository for other users to access. The BI (business intelligence) analysts need to find the right data for their visualization packages, business questions, and decision support tools — they also need the outputs from the data scientists’ models, such as forecasts, alerts, classifications, and more. The semantic layer achieves this by mapping heterogeneously labeled data into familiar business terms, providing a unified, consolidated view of data across the enterprise.

The semantic layer delivers data insights discovery and usability across the whole enterprise, with each business user empowered to use the terminology and tools that are specific to their role. How data are stored, labeled, and meta-tagged in the data cloud is no longer a bottleneck to discovery and access. The decision-makers and data science modelers can fluidly share inputs and outputs with one another, to inform their role-specific tasks and improve their effectiveness. The semantic layer takes the user-specific results out of being a “one-off” solution on that user’s laptop to becoming an enterprise analytics accelerant, enabling business answer discovery at the speed of business questions.

Insights discovery for everyone is achieved. The semantic layer becomes the arbiter (multi-lingual data translator) for insights discovery between and among all business users of data, within the tools that they are already using. The data science team may be focused on feature importance metrics, feature engineering, predictive modeling, model explainability, and model monitoring. The BI team may be focused on KPIs, forecasts, trends, and decision-support insights. The data science team needs to know and to use that data which the BI team considers to be most important. The BI team needs to know and to use which trends, patterns, segments, and anomalies are being found in those data by the data science team. Sharing and integrating such important data streams has never been such a dream.

The semantic layer bridges the gaps between the data cloud, the decision-makers, and the data science modelers. The key results from the data science modelers can be written back to the semantic layer, to be sent directly to consumers of those results in the executive suite and on the BI team. Data scientists can focus on their tools; the BI users and executives can focus on their tools; and the data engineers can focus on their tools. The enterprise data science, analytics, and BI functions have never been so enterprisey. (Is “enterprisey” a word? I don’t know, but I’m sure you get my semantic meaning.)

That’s empowering. That’s data democratization. That’s insights democratization. That’s data fluency/literacy-building across the enterprise. That’s enterprise-wide agile curiosity, question-asking, hypothesizing, testing/experimenting, and continuous learning. That’s data insights for everyone.

Are you ready to learn more how you can bring these advantages to your organization? Be sure to watch the AtScale webinar “How to Bridge Data Science and Business Intelligence” where I join a panel in a multi-industry discussion on how a semantic layer can help organizations make smarter data-driven decisions at scale. There will be several speakers, including me. I will be speaking about “Model Monitoring in the Enterprise — Filling the Gaps”, specifically focused on “Filling the Communication Gaps Between BI and Data Science Teams With a Semantic Data Layer.”

Are You Content with Your Organization’s Content Strategy?

Leave a reply

In this post, we will examine ways that your organization can separate useful content into separate categories that amplify your own staff’s performance. Before we start, I have a few questions for you.

What attributes of your organization’s strategies can you attribute to successful outcomes? How long do you deliberate before taking specific deliberate actions? Do you converse with your employees about decisions that might be the converse of what they would expect? Is a process modification that saves a minute in someone’s workday considered too minute for consideration? Do you present your employees with a present for their innovative ideas? Do you perfect your plans in anticipation of perfect outcomes? Or do you project foregone conclusions on a project before it is completed?

If you have good answers to these questions, that is awesome! I would not contest any of your answers since this is not a contest. In fact, this is actually something quite different. Before you combine all these questions in a heap and thresh them in a combine, and before you buffet me with a buffet of skeptical remarks, stick with me and let me explain. Do not close the door on me when I am so close to giving you an explanation.

What you have just experienced is a plethora of heteronyms. Heteronyms are words that are spelled identically but have different meanings when pronounced differently. If you include the title of this blog, you were just presented with 13 examples of heteronyms in the preceding paragraphs. Can you find them all?

Seriously now, what do these word games have to do with content strategy? I would say that they have a great deal to do with it. Specifically, in the modern era of massive data collections and exploding content repositories, we can no longer simply rely on keyword searches to be sufficient. In the case of a heteronym, a keyword search would return both uses of the word, even though their meanings are quite different. In “information retrieval” language, we would say that we have high RECALL, but low PRECISION. In other words, we can find most occurrences of the word (recall), but not all the results correspond to the meaning of our search (precision). That is no longer good enough when the volume is so high.

The key to success is to start enhancing and augmenting content management systems (CMS) with additional features: semantic content and context. This is accomplished through tags, annotations, and metadata (TAM). TAM management, like content management, begins with business strategy.

Strategic content management focusses on business outcomes, business process improvement, efficiency (precision – i.e., “did I find only the content that I need without a lot of noise?”), and effectiveness (recall – i.e., “did I find all the content that I need?”). Just because people can request a needle in the haystack, it is not a good thing to deliver the whole haystack that contains that needle. Clearly, such a content delivery system is not good for business productivity. So, there must be a strategy regarding who, what, when, where, why, and how is the organization’s content to be indexed, stored, accessed, delivered, used, and documented. The content strategy should emulate a digital library strategy. Labeling, indexing, ease of discovery, and ease of access are essential if end-users are to find and benefit from the collection.

My favorite approach to TAM creation and to modern data management in general is AI and machine learning (ML). That is, use AI and machine learning techniques on digital content (databases, documents, images, videos, press releases, forms, web content, social network posts, etc.) to infer topics, trends, sentiment, context, content, named entity identification, numerical content extraction (including the units on those numbers), and negations. Do not forget the negations. A document that states “this form should not be used for XYZ” is exactly the opposite of a document that states “this form must be used for XYZ”. Similarly, a social media post that states “Yes. I believe that this product is good” is quite different from a post that states “Yeah, sure. I believe that this product is good. LOL.”

Contextual TAM enhances a CMS with knowledge-driven search and retrieval, not just keyword-driven. Contextual TAM includes semantic TAM, taxonomic indexing, and even usage-based tags (digital breadcrumbs of the users of specific pieces of content, including the key words and phrases that people used to describe the content in their own reports). Adding these to your organization’s content makes the CMS semantically searchable and usable. That’s far more laser-focused (high-precision) than keyword search.

One type of implementation of a content strategy that is specific to data collections are data catalogs. Data catalogs are very useful and important. They become even more useful and valuable if they include granular search capabilities. For example, the end-user may only need the piece of the dataset that has the content that their task requires, versus being delivered the full dataset. Tagging and annotating those subcomponents and subsets (i.e., granules) of the data collection for fast search, access, and retrieval is also important for efficient orchestration and delivery of the data that fuels AI, automation, and machine learning operations.

One way to describe this is “smart content” for intelligent digital business operations. Smart content includes labeled (tagged, annotated) metadata (TAM). These labels include content, context, uses, sources, and characterizations (patterns, features) associated with the whole content and with individual content granules. Labels can be learned through machine learning, or applied by human experts, or proposed by non-experts when those labels represent cognitive human-discovered patterns and features in the data. Labels can be learned and applied in existing CMS, in massive streaming data, and in sensor data (collected in devices at the “edge”).

Some specific tools and techniques that can be applied to CMS to generate smart content include these:

Natural language understanding and natural language generation
Topic modeling (including topic drift and topic emergence detection)
Sentiment detection (including emotion detection)
AI-generated and ML-inferred content and context
Entity identification and extraction
Numerical quantity extraction
Automated structured (searchable) database generation from textual (unstructured) document collections (for example: Textual ETL).

Consequently, smart content thrives at the convergence of AI and content. Labels are curated and stored with the content, thus enabling curation, cataloguing (indexing), search, delivery, orchestration, and use of content and data in AI applications, including knowledge-driven decision-making and autonomous operations. Techniques that both enable (contribute to) and benefit from smart content are content discovery, machine learning, knowledge graphs, semantic linked data, semantic data integration, knowledge discovery, and knowledge management. Smart content thus meets the needs for digital business operations and autonomous (AI and intelligent automation) activities, which must devour streams of content and data – not just any content, but smart content – the right (semantically identified) content delivered at the right time in the right context.

The four tactical steps in a smart content strategy include:

Characterize and contextualize the patterns, events, and entities in the content collection with semantic (contextual) tags, annotation, and metadata (TAM).
Collect, curate, and catalog (i.e., index) each TAM component to make it searchable, accessible, and reusable.
Deliver the right content at the right time in the right context to the decision agent.
Decide and act on the delivered insights and knowledge.

Remember, do not be content with your current content management strategy. But discover and deliver the perfect smart content that perfects your digital business outcomes. Smart content strategy can save end-users countless minutes in a typical workday, and that type of business process improvement certainly is not too minute for consideration.

Top 10 Data Innovation Trends During 2020

Leave a reply

The year 2020 was remarkably different in many ways from previous years. In at least one way, it was not different, and that was in the continued development of innovations that are inspired by data. This steady march of data-driven innovation has been a consistent characteristic of each year for at least the past decade. These data-fueled innovations come in the form of new algorithms, new technologies, new applications, new concepts, and even some “old things made new again”.

I provide below my perspective on what was interesting, innovative, and influential in my watch list of the Top 10 data innovation trends during 2020.

1) Automated Narrative Text Generation tools became incredibly good in 2020, being able to create scary good “deep fake” articles. The GPT-3 (Generative Pretrained Transformer, 3rd generation text autocomplete) algorithm made headlines since it demonstrated that it can start with a very thin amount of input (a short topic description, or a question), from which it can then generate several paragraphs of narrative that are very hard (perhaps impossible) to distinguish from human-generated text. However, it is far from perfect, since it certainly does not have reasoning skills, and it also loses its “train of thought” after several paragraphs (e.g., by making contradictory statements at different places in the narrative, even though the statements are nicely formed sentences).

2) MLOps became the expected norm in machine learning and data science projects. MLOps takes the modeling, algorithms, and data wrangling out of the experimental “one off” phase and moves the best models into deployment and sustained operational phase. MLOps “done right” addresses sustainable model operations, explainability, trust, versioning, reproducibility, training updates, and governance (i.e., the monitoring of very important operational ML characteristics: data drift, concept drift, and model security).

3) Concept drift by COVID – as mentioned above, concept drift is being addressed in machine learning and data science projects by MLOps, but concept drift is so much bigger than MLOps. Specifically, it feels to many of us like a decade of business transformation was compressed into the one year 2020. How and why businesses make decisions, customers make decisions, and anybody else makes decisions became conceptually and contextually different in 2020. Customer purchase patterns, supply chain, inventory, and logistics represent just a few domains where we saw new and emergent behaviors, responses, and outcomes represented in our data and in our predictive models. The old models were not able to predict very well based on the previous year’s data since the previous year seemed like 100 years ago in “data years”. Another example was in new data-driven cybersecurity practices introduced by the COVID pandemic, including behavior biometrics (or biometric analytics), which were driven strongly by the global “work from home” transition, where many insecurities in networks, data-sharing, and collaboration / communication tools were exposed. Behavior biometrics may possibly become the essential cyber requirement for unique user identification, finally putting weak passwords out of commission. Data and network access controls have similar user-based permissions when working from home as when working behind the firewall at your place of business, but the security checks and usage tracking can be more verifiable and certified with biometric analytics. This is critical in our massively data-sharing world and enterprises.

4) AIOps increasingly became a focus in AI strategy conversations. While it is similar to MLOps, AIOps is less focused on the ML algorithms and more focused on automation and AI applications in the enterprise IT environment – i.e., focused on operationalizing AI, including data orchestration, the AI platform, AI outcomes monitoring, and cybersecurity requirements. AIOps appears in discussions related to ITIM (IT infrastructure monitoring), SIEM (security information and event management), APM (application performance monitoring), UEBA (user and entity behavior analytics), DevSecOps, Anomaly Detection, Rout Cause Analysis, Alert Generation, and related enterprise IT applications.

5) The emergence of Edge-to-Cloud architectures clearly began pushing Industry 4.0 forward (with some folks now starting to envision what Industry 5.0 will look like). The Edge-to-Cloud architectures are responding to the growth of IoT sensors and devices everywhere, whose deployments are boosted by 5G capabilities that are now helping to significantly reduce data-to-action latency. In some cases, the analytics and intelligence must be computed and acted upon at the edge (Edge Computing, at the point of data collection), as in autonomous vehicles. In other cases, the analytics and insights may have more significant computation requirements and less strict latency requirements, thus allowing the data to be moved to larger computational resources in the cloud. The almost forgotten “orphan” in these architectures, Fog Computing (living between edge and cloud), is now moving to a more significant status in data and analytics architecture design.

6) Federated Machine Learning (FML) is another “orphan” concept (formerly called Distributed Data Mining a decade ago) that found new life in modeling requirements, algorithms, and applications in 2020. To some extent, the pandemic contributed to this because FML enforces data privacy by essentially removing data-sharing as a requirement for model-building across multiple datasets, multiple organizations, and multiple applications. FML model training is done incrementally and locally on the local dataset, with the meta-parameters of the local models then being shared with a centralized model-inference engine (which does not see any of the private data). The centralized ML engine then builds a global model, which is communicated back to the local nodes. Multiple iterations in parameter-updating and hyperparameter-tuning can occur between local nodes and the central inference engine, until satisfactory model performance is achieved. All through these training stages, data privacy is preserved, while allowing for the generation of globally useful, distributable, and accurate models.

7) Deep learning (DL) may not be “the one algorithm to dominate all others” after all. There was some research published earlier in 2020 that found that traditional, less complex algorithms can be nearly as good or better than deep learning on some tasks. This could be yet another demonstration of the “no free lunch theorem”, which basically states that there is no single universal algorithm that is the best for all problems. Consequently, the results of the new DL research may not be so surprising, but they certainly prompt us with necessary reminders that sometimes simple is better than complexity, and that the old saying is often still true: “perfect is the enemy of good enough.”

8) RPA (Robotic Process Automation) and intelligent automation were not new in 2020, but the surge in their use and in the number of providers was remarkable. While RPA is more rule-based (informed by business process mining, to automate work tasks that have very little variation), intelligent automation is more data-driven, adaptable, and self-learning in real-time. RPA mimics human actions, by repetition of routine tasks based on a set of rules. Intelligent automation simulates human intelligence, which responds and adapts to emergent patterns in new data, and which is capable of learning to automate non-routine tasks. Keep an eye on the intelligent automation space for new and exciting developments to come in the near future around hyperautomation and enterprise intelligence, such as the emergence of learning business systems that learn and adapt their processes based on signals in enterprise data across numerous business functions: finance, marketing, HR, customer service, production, operations, sales, and management.

9) The Rise of Data Literacy initiatives, imperatives, instructional programs, and institutional awareness in 2020 was one of the two most exciting things that I witnessed during the year. (The other one of the two is next on my list.) I have said for nearly 20 years that data literacy must become a key component of education at all levels and an aptitude of nearly all employees in all organizations. The world is data, revolves around data, produces and consumes massive quantities of data, and drives innovative emerging technologies that are inspired by, informed by, and fueled by data: augmented reality (AR), virtual reality (VR), autonomous vehicles, computer vision, digital twins, drones, robotics, AI, IoT, hyperautomation, virtual assistants, conversational AI, chatbots, natural language understanding and generation (NLU, NLG), automatic language translation, 4D-printing, cyber resilience, and more. Data literacy is essential for future of work, future innovation, work from home, and everyone that touches digital information. Studies have shown that organizations that are not adopting data literacy programs are not only falling behind, but they may stay behind, their competition. Get on board with data literacy! Now!

10) Observability emerged as one of the hottest and (for me) most exciting developments of the year. Do not confuse observability with monitoring (specifically, with IT monitoring). The key difference is this: monitoring is what you do, and observability is why you do it. Observability is a business strategy: what you monitor, why you monitor it, what you intend to learn from it, how it will be used, and how it will contribute to business objectives and mission success. But the power, value, and imperative of observability does not stop there. Observability meets AI – it is part of the complete AIOps package: “keeping an eye on the AI.” Observability delivers actionable insights, context-enriched data sets, early warning alert generation, root cause visibility, active performance monitoring, predictive and prescriptive incident management, real-time operational deviation detection (6-Sigma never had it so good!), tight coupling of cyber-physical systems, digital twinning of almost anything in the enterprise, and more. And the goodness doesn’t stop there. The emergence of standards, like OpenTelemetry, can unify all aspects of your enterprise observability strategy: process instrumentation, sensing, metrics specification, context generation, data collection, data export, and data analysis of business process performance and behavior monitoring in the cloud. This plethora of benefits is a real game-changer for open-source self-service intelligent data-driven business process monitoring (BPM) and application performance monitoring (APM), feedback, and improvement. As mentioned above, monitoring is “what you are doing”, and observability is “why you are doing it.” If your organization is not having “the talk” about observability, now is the time to start – to understand why and how to produce business value through observability into the multitude of data-rich digital business applications and processes all across the modern enterprise. Don’t drown in those deep seas of data. Instead, develop an Observability Strategy to help your organization ride the waves of data, to help your business innovation and transformation practices move at the speed of data.

In summary, my top 10 data innovation trends from 2020 are:

GPT-3
MLOps
Concept Drift by COVID
AIOps
Edge-to-Cloud and Fog Computing
Federated Machine Learning
Deep Learning meets the “no free lunch theorem”
RPA and Intelligent Automation
Rise of Data Literacy
Observability

If I were to choose what was hottest trend in 2020, it would not be a single item in this top 10 list. The hottest trend would be a hybrid (convergence) of several of these items. That hybrid would include: Observability, coupled with Edge and the ever-rising ubiquitous IoT (sensors on everything), boosted by 5G and cloud technologies, fueling ever-improving ML and DL algorithms, all of which are enabling “just-in-time” intelligence and intelligent automation (for data-driven decisions and action, at the point of data collection), deployed with a data-literate workforce, in a sustainable and trusted MLOps environment, where algorithms, data, and applications work harmoniously and are governed and secured by AIOps.

If we learned anything from the year 2020, it should be that trendy technologies do not comprise a menu of digital transformation solutions to choose from, but there really is only one combined solution, which is the hybrid convergence of data innovation technologies. From my perspective, that was the single most significant data innovation trend of the year 2020.

Analytics Insights and Careers at the Speed of Data

Leave a reply

*How to make smarter data-driven decisions at scale*: http://bit.ly/3rS3ZQW

The determination of winners and losers in the data analytics space is a much more dynamic proposition than it ever has been. One CIO said it this way, “If CIOs invested in machine learning three years ago, they would have wasted their money. But if they wait another three years, they will never catch up.” Well, that statement was made five years ago! A lot has changed in those five years, and so has the data landscape.

The dynamic changes of the business requirements and value propositions around data analytics have been increasingly intense in depth (in the number of applications in each business unit) and in breadth (in the enterprise-wide scope of applications in all business units in all sectors). But more significant has been the acceleration in the number of dynamic, real-time data sources and corresponding dynamic, real-time analytics applications.

We no longer should worry about “managing data at the speed of business,” but worry more about “managing business at the speed of data.”

One of the primary drivers for the phenomenal growth in dynamic real-time data analytics today and in the coming decade is the Internet of Things (IoT) and its sibling the Industrial IoT (IIoT). With its vast assortment of sensors and streams of data that yield digital insights in situ in almost any situation, the IoT / IIoT market has a projected market valuation of $1.5 trillion by 2030. The accompanying technology Edge Computing, through which those streaming digital insights are extracted and then served to end-users, has a projected valuation of $800 billion by 2028.

With dynamic real-time insights, this “Internet of Everything” can then become the “Internet of Intelligent Things”, or as I like to say, “The Internet used to be a thing. Now things are the Internet.” The vast scope of this digital transformation in dynamic business insights discovery from entities, events, and behaviors is on a scale that is almost incomprehensible. Traditional business analytics approaches (on laptops, in the cloud, or with static datasets) will not keep up with this growing tidal wave of dynamic data.

Another dimension to this story, of course, is the Future of Work discussion, including creation of new job titles and roles, and the demise of older job titles and roles. One group has declared, “IoT companies will dominate the 2020s: Prepare your resume!” This article quotes an older market projection (from 2019), which estimated “the global industrial IoT market could reach $14.2 trillion by 2030.”

In dynamic data-driven applications, automation of the essential processes (in this case, data triage, insights discovery, and analytics delivery) can give a power boost to ride that tidal wave of fast-moving data streams. One can prepare for and improve skill readiness for these new business and career opportunities in several ways:

Focus on the automation of business processes: e.g., artificial intelligence, robotics, robotic process automation, intelligent process automation, chatbots.
Focus on the technologies and engineering components: e.g., sensors, monitoring, cloud-to-edge, microservices, serverless, insights-as-a-service APIs, IFTTT (IF-This-Then-That) architectures.
Focus on the data science: e.g., machine learning, statistics, computer vision, natural language understanding, coding, forecasting, predictive analytics, prescriptive analytics, anomaly detection, emergent behavior discovery, model explainability, trust, ethics, model monitoring (for data drift and concept drift) in dynamic environments (MLOps, ModelOps, AIOps).
Focus on specific data types: e.g., time series, video, audio, images, streaming text (such as social media or online chat channels), network logs, supply chain tracking (e.g., RFID), inventory monitoring (SKU / UPC tracking).
Focus on the strategies that aim these tools, talents, and technologies at reaching business mission and goals: e.g., data strategy, analytics strategy, observability strategy (i.e., why and where are we deploying the data-streaming sensors, and what outcomes should they achieve?).

Insights discovery from ubiquitous data collection (via the tens of billions of connected devices that will be measuring, monitoring, and tracking nearly everything internally in our business environment and contextually in the broader market and global community) is ultimately about value creation and business outcomes. Embedding real-time dynamic analytics at the edge, at the point of data collection, or at the moment of need will dynamically (and positively) change the slope of your business or career trajectory. Dynamic sense-making, insights discovery, next-best-action response, and value creation is essential when data is being acquired at an enormous rate. Only then can one hope to realize the promised trillion-dollar market value of the Internet of Everything.

For more advice, check out this upcoming webinar panel discussion, sponsored by AtScale, with data and analytics leaders from Wayfair, Cardinal Health, bol.com, and Slickdeals: “How to make smarter data-driven decisions at scale.” Each panelist will share an overview of their data & analytics journey, and how they are building a self-service, data-driven culture that scales. Join us on Wednesday, March 31, 2021 (11:00am PT | 2:00pm ET). Save your spot here: http://bit.ly/3rS3ZQW. I hope that you find this event useful. And I hope to see you there!

Please follow me on LinkedIn and follow me on Twitter at @KirkDBorne.

How We Teach The Leaders of Tomorrow To Be Curious, Ask Questions and Not Be Afraid To Fail Fast To Learn Fast

Leave a reply

I recently enjoyed recording a podcast with Joe DosSantos (Chief Data Officer at Qlik). This was one in a series of #DataBrilliant podcasts by Qlik, which you can also access here (Apple Podcasts) and here (Spotify). I summarize below some of the topics that Joe and I discussed in the podcast. Be sure to listen to the full recording of our lively conversation, which covered Data Literacy, Data Strategy, Data Leadership, and more.

The Age of Hype Cycles

The data age has been marked by numerous “hype cycles.” First, we heard how Big Data, Data Science, Machine Learning (ML) and Advanced Analytics would have the honor to be the technologies that would cure cancer, end world hunger and solve the world’s biggest challenges. Then came third-generation Artificial Intelligence (AI), Blockchain and soon Quantum Computing, with each one seeking that honor.

From all this hope and hype, one constant has always been there: a focus on value creation from data. As a scientist, I have always recommended a scientific approach: State your problem first, be curious (ask questions), collect facts to address those questions (acquire data), investigate, analyze, ask more questions, include a sensible serving of skepticism, and (above all else) aim to fail fast in order to learn fast. As I discussed with Joe DosSantos when I spoke with him for the latest episode of Data Brilliant, you don’t need to be a data scientist to follow these principles. These apply to everyone, in all organizations and walks of life, in every sector.

One characteristic of science that is especially true in data science and implicit in ML is the concept of continuous learning and refining our understanding. We build models to test our understanding, but these models are not “one and done.” They are part of a cycle of learning. In ML, the learning cycle is sometimes called backpropagation, where the errors (inaccurate predictions) of our models are fed back into adjusting the model’s input parameters in a way that aims to improve the output accuracy. A more colloquial expression for this is: good judgment comes from experience, and experience comes from bad judgment.

Data Literacy For All

I know that for some, the term data and some of the other terminology I’ve mentioned already can be scary. But they shouldn’t be. We are all surrounded by – and creating – masses of data every single day. As Joe and I talked about, one of the first hurdles in data literacy is getting people to recognize that everything is data. What you see with your eyes? That’s data. What you hear with your ears? Data. The words that come out of your mouth that other people hear? That’s all data. Images, text, documents, audio, video and all the apps on your phone, all the things you search for on the internet? Yet again, that’s data.

Every single day, everyone around the world is using data and the principles I mention above, many without realizing it. So, now we need to bring this value to our businesses.

How To Build A Successful Enterprise Data Strategy

In my chat with Joe, we talked about many data concepts in the context of enterprise digital transformation. As always, but especially during the race toward digital transformation that has been accelerated by the 2020 pandemic, a successful enterprise data strategy that leads to business value creation can benefit from first addressing these six key questions:

(1) What mission objective and outcomes are you aiming to achieve?

(2) What is the business problem, expressed in data terminology? Specifically, is it a detection problem (fraud or emergent behavior), a discovery problem (new customers or new opportunities), a prediction problem (what will happen) or an optimization problem (how to improve outcomes)?

(3) Do you have the talent (key people representing diverse perspectives), tools (data technologies) and techniques (AI and ML knowledge) to make it happen?

(4) What data do you have to fuel the algorithms, the training and the modeling processes?

(5) Is your organizational culture ready for this (for data-informed decisions; an experimentation mindset; continuous learning; fail fast to learn fast; with principled AI and data governance)?

(6) What operational technology environment do you have to deploy the implementation (cloud or on-premise platform)?

Data Leadership

As Joe and I discussed, your ultimate business goal is to build a data-fueled enterprise that delivers business value from data. Therefore, ask questions, be purposeful (goal-oriented and mission-focused), be reasonable in your expectations and remain reasonably skeptical – because as famous statistician, George Box, once said “all models are wrong, but some are useful.”

Now, listen to the full podcast here.

Data Science Blogs-R-Us

Leave a reply

[UPDATED December 31, 2022]

I have written articles in many places. I will be collecting links to those sources here. The list is not complete and will be constantly evolving. There are some older blogs that I will be including in the list below as I remember them and find them. Also included are some interviews in which I provided detailed answers to a variety of questions.

In 2019, I was listed as the #1 Top Data Science Blogger to Follow on Twitter.

And then there’s this — not a blog, but a link to my 2013 TedX talk: “Big Data, Small World.” (Many more videos of my talks are available online. That list will be compiled in another place soon.)

The Power of Graph Databases, Linked Data, and Graph Algorithms

Leave a reply

In 2019, I was asked to write the Foreword for the book “Graph Algorithms: Practical Examples in Apache Spark and Neo4j“, by Mark Needham and Amy E. Hodler. I wrote an extensive piece on the power of graph databases, linked data, graph algorithms, and various significant graph analytics applications. In their wisdom, the editors of the book decided that I wrote “too much”. So, they correctly shortened my contribution by about half in the final published version of my Foreword for the book.

The book is awesome, an absolute must-have reference volume, and it is free (for now, downloadable from Neo4j).

Now, for the first time, the full unabridged (and unedited) version of my initial contribution as the Foreword for the book is published here. (You should still get the book because it is a fantastic 250-page masterpiece for data scientists!) Any omissions, errors, or viewpoints in the piece below are entirely my own. I publish this in its original form in order to capture the essence of my point of view on the power of graph analytics.

As you read this, just remember the most important message: the natural data structure of the world is not rows and columns, but a graph. And this: perhaps the most powerful node in a graph model for real-world use cases might be “context”. How does one express “context” in a data model? Ahh, that’s the topic for another article. But this might help you get there: https://twitter.com/KirkDBorne/status/1232094501537275904

“All the World’s a Graph”

What do marketing attribution analysis, anti-money laundering, customer journey modeling, safety incident causal factor analysis, literature-based discovery, fraud network analysis, Internet search, the map app on your mobile phone, the spread of infectious diseases, and the theatrical performance of a William Shakespeare play all have in common? No, it is not something as trivial as the fact that all these phrases contain nouns and action verbs! What they have in common is that all these phrases proved that Shakespeare was right when he declared, “All the world’s a graph!”

Okay, the Bard of Avon did not actually say “Graph” in that sentence, but he did say “Stage” in the sentence. However, notice that all the examples mentioned above involve entities and the relationships between them, including both direct and indirect (transitive) relationships — a graph! Entities are the nodes in the graph — these can be people, events, objects, concepts, or places. The relationships between the nodes are the edges in the graph. Therefore, isn’t the very essence of a Shakespearean play the live action portrayal of entities (the nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could have said “Graph” in his famous declaration.

What makes graph algorithms and graph databases so interesting and powerful isn’t the simple relationship between two entities: A is related to B. After all, the standard relational model of databases instantiated these types of relationships in its very foundation decades ago: the ERD (Entity-Relationship Diagram). What makes graphs so remarkably different and important are directional relationships and transitive relationships. In directional relationships, A may cause B, but not the opposite. In transitive relationships, A can be directly related to B and B can be directly related to C, while A is not directly related to C, so that consequently A is transitively related to C.

Because of transitivity relationships, particularly when they are numerous and diverse with many possible relationship (network) patterns and many possible degrees of separation between the entities, the graph model enables discovery of relationships between two entities that otherwise may seem wholly disconnected, unrelated, and difficult (if not impossible) to discover in a relational database. Hence, the graph model can be applied productively and effectively in numerous network analysis use cases.

Consider this marketing attribution use case: person A sees the marketing campaign, person A talks about it on their social media account, person B is connected to person A and sees the comment, and subsequently person B buys the product. From the marketing campaign manager’s perspective, the standard relational model would fail to identify the attribution, since B did not see the campaign and A did not respond to the campaign. The campaign looks like a failure. But it is not a failure — its actual success (and positive ROI) is discovered by the graph analytics algorithm through the transitive relationship between the marketing campaign and the final customer purchase, through an intermediary (entity-in-the-middle)!

Next, consider the anti-money laundering (AML) use case: person A and person C are under suspicion for illicit trafficking. Any interaction between the two (e.g., a financial transaction in a financial database) would be flagged by the authorities, and the interactions would come under great scrutiny. However, if A and C never transact any business together, but instead conduct their financial dealings through a safe, respected, unflagged financial authority B, then who would ever notice the transaction? Well, the graph analytics algorithm would notice! The graph engine would discover that there was a transitive relationship between A and C through the intermediary B (the entity-in-the-middle).

Similar descriptions of the power of graph can be given for the other use cases mentioned in the opening paragraph above, all of which are examples of network analysis through graph algorithms. Each of those cases deeply involves entities (people, objects, events, actions, concepts, and places) and their relationships (touch points, both causal and simple associations). Because of their great interest and power, we highlight two more of those use cases: Internet search and Literature-Based Discovery (LBD).

In Internet search a hyperlinked network (graph-based) algorithm is used by a major search engine to find the central authoritative node across the entire Internet for any given set of search words. The directionality of the edge is most important in this use case since the authoritative node in the network is the one that many other nodes point toward.

LBD is a knowledge network (graph-based) application in which significant discoveries are enabled across the knowledgebase of thousands (and even millions) of research journal articles — the discovery of “hidden knowledge” is only made through the connection between two published research results that may have a large number of degrees of separation (transitive relationships) between them. LBD is being applied to cancer research studies, where the massive semantic medical knowledgebase of symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term results, and long-term consequences may be “hiding” previously unknown cures or beneficial treatments of the most impenetrable cases. The knowledge is already in the network, if only we were to connect the dots to discover it.

The book Graph Algorithms: Practical Examples in Apache Spark and Neo4j is aimed at broadening our knowledge and capabilities around these types of graph analyses, including algorithms, concepts, and practical machine learning applications of the algorithms. From basic concepts to fundamental algorithms, to processing platforms and practical use cases, the authors have compiled an instructive and illustrative guide to the wonderful world of graphs.

Chapter 1 provides a beautiful introduction to graphs, graph analytics algorithms, network science, and graph analytics use cases. In the discussion of power-law distributions, we see again another way that graphs differ from more familiar statistical analyses that assume a normal distribution of properties in random populations. Prepare yourself for some unexpected insights when you realize that power-law distributions are incredibly common in the natural world — graph analytics is a great tool for exploring those scale-free structures and their long tails. By the way, I always love a discussion that mentions the Pareto distribution.

Chapter 2 steps up our graph immersion by introducing us to the many different types of graphs that represent the rich variety of informative relationships that can exist between nodes, including directed and undirected, cyclic and acyclic, trees, and more. If you have always wondered what a DAG was, now you have no more excuses for not knowing. It’s all here. The chapter ends with a quick summary of things to come in greater detail in future chapters, by defining the three major categories of graph algorithms: pathfinding, centrality, and community detection.

Chapter 3 focuses on the graph processing platforms that are mentioned in the subtitle to the book: Apache Spark and Neo4j. In the Apache Spark section, you will find information about the Spark Graph Project, GraphFrames, and Cypher (the graph query language). In the Neo4j section, you will learn about its APOC library: Awesome Procedures On Cypher. Brief instructions on installing these graph engines are included, to prepare you for the use cases and sample applications that are provided later in the book.

Chapters 4, 5, and 6 then dive into the three major graph algorithm categories mentioned earlier. For example, the map app on your mobile phone employs a version of the pathfinding algorithm. Root cause analysis, customer journey modeling, and the spread of infectious diseases are other example applications of pathfinding. Internet search and influencer detection in social networks are example applications of the centrality algorithm. Fraud network analysis, AML, and LBD are example applications of community detection.

Marketing attribution is a use case that may productively incorporate applications of all three graph analytics algorithm categories, depending on the specific question being asked: (1) how did the marketing message flow from source to final action? (pathfinding); (2) was there a dominant influencer who initiated the most ROI from the marketing campaign? (centrality); or (3) is there a community (a set of common personas) that are most responsive to the marketing campaign? (community detection).

Let’s not forget one more example application — a well-conceived theatrical masterpiece will almost certainly be an instantiation of community detection (co-conspirators, love triangles, and all that). That masterpiece will undoubtedly include a major villain or a central hero (representing centrality). Such a masterpiece is probably also a saga (the story of a journey), containing intrigues, strategies, and plots that move ingeniously, methodically, and economically (in three acts or less) toward some climactic ending (thus representing pathfinding).

In Chapter 7, we find many code samples for example applications of graph algorithms, thus rendering all the above knowledge real and useful to the reader. In this manner, the book becomes an instructional tutorial as well as a guide on the side. Putting graph algorithms into practice through these examples is one of the most brilliant contributions of this book — giving you the capability to do it for yourself and to start reaping the rewards of exploring the most natural data structure to describe the world: not rows and columns, but a graph! You will be able to connect the dots that aren’t connected in traditional data structures, build a knowledge graph, explore the graph for insights, and exploit it for value. Let’s put this another way: your graph-powered team will be able to increase the value of your organization’s data assets in ways that others may not have ever imagined. Your team will become graph heroes.

Finally, in Chapter 8, the connection between graph algorithms and machine learning that was implicit throughout the book now becomes explicit. The training data and feature sets that feed machine learning algorithms can now be immensely enriched with tags, labels, annotations, and metadata that were inferred and/or provided naturally through the transformation of your repository of data into a graph of data. Any node and its relationship to a particular node becomes a type of contextual metadata for that particular note. All of that “metadata” (which is simply “other data about your data”) enables rich discovery of shortest paths, central nodes, and communities.

Graph modeling of your data set thus enables more efficient and effective feature extraction and selection (also described in Chapter 8), as the graph exposes the most important, influential, representative, and explanatory attributes to be included in machine learning models that aim to predict a particular target outcome variable as accurately as possible.

When considering the power of graph, we should keep in mind that perhaps the most powerful node in a graph model for real-world use cases might be “context”, including the contextual metadata that we already mentioned. Context may include time, location, related events, nearby entities, and more. Incorporating context into the graph (as nodes and as edges) can thus yield impressive predictive analytics and prescriptive analytics capabilities.

When all these pieces and capabilities are brought together, the graph analytics engine is thereby capable of exploring deep relationships between events, actions, people, and other things across both spatial and temporal (as well as other contextual) dimensions. Consequently, a graph algorithm-powered analysis tool may be called a Spatial-Temporal Analytics Graph Engine (STAGE!). Therefore, if Shakespeare was alive today, maybe he would agree with that logic and would still say “All the world’s a STAGE.” In any case, he would want to read this book to learn how to enrich his stories with deeper insights into the world and with more interesting relationships.

Shocking Amount of Data

Leave a reply

50 years after the publication of Alvin Toffler’s landmark book “Future Shock“, a new book “After Shock” is here. This 540-page compendium collects over 100 modern-day perspectives on After Shock from leading futurists, including deep assessments of Toffler’s formidably prescient predictions from half a century ago, along with a status check on the current exponential growth (“shock”) in all sectors of the world economy, from the unique vantage points of many different contributors. Contributors include David Brin, Po Bronson, Sanjiv Chopra, George Gilder, Newt Gingrich, Alan Kay, Ray Kurzweil, Jane McGonigal, Lord Martin Rees, Byron Reese, and many others. I am honored to be included among those luminary contributors. I present here a short excerpt from my contribution to the book.

After Shock, book published in 2020 — After Shock, edited by John Schroeter

“Shocking Amount of Data”

An excerpt from my chapter in the book:

“We are fully engulfed in the era of massive data collection. All those data represent the most critical and valuable strategic assets of modern organizations that are undergoing digital disruption and digital transformation. Advanced analytics tools and techniques drive insights discovery, innovation, new market opportunities, and value creation from the data. However, our enthusiasm for “big data” is tempered by the fact that this data flood also drives us to sensory input shock and awe.

“Among the countless amazing foresights that appeared in Alvin Toffler’s Future Shock was the concept of information overload. His discussion of the topic came long before the creation and proliferation of social networks, the World Wide Web, the Internet, enterprise databases, ubiquitous sensors, and digital data collection by all organizations—big and small, public and private, near and far. The clear and present human consequences and dangers of infoglut were succinctly called out as section headings in Chapter 16 “The Psychological Dimension” of Toffler’s book, including these: “the overstimulated individual,” “bombardment of the senses” and “decision stress.”

“While these ominous forecasts have now become a reality for our digitally drenched society, especially for the digital natives who have known no other experience, there is hope for a lifeline that we can grasp while swimming (or drowning) in that sea of data. And that hope emanates from the same foundation that is the basis of the information overload shock itself. That foundation is data, and the hope is AI – artificial intelligence. The promise of AI is entirely dependent on the flood of sensory input data that fuels the advanced algorithms that activate and stimulate the Actionable Insights (representing another definition of A.I.), which is what AI is really aimed at achieving.

“AI is a great producer of dichotomous reactions in society: hype versus hope, concern versus commercialization, fantasy versus forward-thinking, overstimulation versus overabundance of opportunity, bombardment of the senses versus bountiful insights into solving hard problems, and decision stress versus automated decision support. Could we imagine any technology that has more mutually contradictory psychological dimensions than AI? I cannot.

“AI takes its cue from data. It needs data – and not just small quantities, but large quantities of data, containing training examples of all sorts of behaviors, events, processes, things, and outcomes. Just as a human (or any cognitive creature) receives sensory inputs from its world through multiple channels (e.g., our five senses) and then produces an output—make a decision, take an action—in response to those inputs, similarly that is exactly what an AI does. The AI relies on mathematical algorithms that sort through streams of data, to find the most relevant, informative, actionable, and pertinent patterns in the data. From those presently perceived patterns, AI then produces an output (decision and/or action). For example, when a human detects and then recognizes a previously seen pattern, the person knows how to respond—what to decide and how to act. This is what an AI is being trained to do automatically and with minimal human intervention through the data inputs that cue the algorithms.

“How AI helps with infoglut and thereby addresses the themes of Toffler’s writings (“information overload”, “overstimulation and bombardment of the senses”, and “decision stress”) is through…

…”

You can continue reading my chapter, plus dozens more perspectives, in the full After Shock book here: https://amzn.to/2S01MC7

Rocket-Powered Data Science

Data Reflections by Dr. Kirk Borne @KirkDBorne

Category Archives: Big Data

SAP Datasphere Powers Business at the Speed of Data

My top learning and pondering moments at Splunk .conf22

Data Insights for Everyone — The Semantic Layer to the Rescue

Are You Content with Your Organization’s Content Strategy?

Top 10 Data Innovation Trends During 2020

Analytics Insights and Careers at the Speed of Data

How We Teach The Leaders of Tomorrow To Be Curious, Ask Questions and Not Be Afraid To Fail Fast To Learn Fast

Data Science Blogs-R-Us

The Power of Graph Databases, Linked Data, and Graph Algorithms

“All the World’s a Graph”

Shocking Amount of Data