Quantcast
Channel: Blog – Neo4j Graph Database Platform
Viewing all 1139 articles
Browse latest View live

The ROI on Connected Data: The Overlooked Value of Context for Business Insights [+ Airbnb Case Study]

$
0
0
Your data is inherently valuable, but until you connect it, that value is largely hidden.

Those data relationships give your applications an integrated view that powers real-time, higher-order insights traditional technology cannot deliver.

Learn why you need data context for business insights in this series on the ROI of connected data


In this series, we’ll examine how investments in connected data return dividends for your bottom line – and beyond. Last week, we explored how increasing data’s connectedness increases its business value.

This week, we’ll take a closer look at how connected data gives you contextual insights for essential business use cases.

Connected Data Offers Business Context


The biggest benefit of connected data is the ability to provide an integrated view of the data to your analytic and operational applications, thereby gaining and growing intelligence downstream.

The connections can be made available to applications or business users to make operational decisions. You also obtain context that allows you to more deeply or better refine the pieces of information you’re collecting or the recommendations you’re producing.

Marketing may determine the best time to send an email to customers who previously purchased winter coats and dynamically display photos in their preferred colors. The more understanding you have of the relationships between data, the better and more refined your system is downstream.

Business Use Cases of Connected Data


Connected data applies to a variety of contexts.

In addition to refining the output of your recommendation engines, you can better understand the flow of money to detect fraud and money laundering (see below), and assess the risk of a network outage across computer networks.


A connected dataset for a fraud detection use case


Connected data also helps you see when and how relationships change over time. For example, you can determine when a customer moves and change the applicable data (such as mailing address) so that customer data doesn’t become obsolete.

Connected data is most powerful when it provides operational, real-time insights and not just after-the-fact analytics. Real-time insights allow business users and applications to make business decisions and act in real time. Thus, recommendation engines leverage data from the current user session – and from historical data – to deliver highly relevant suggestions.

IT organizations proactively mitigate network issues that would otherwise cause an outage using a connected-data view, and anti-fraud teams put an end to potentially malicious activity before it results in a substantial loss.

Case Study: Airbnb


With over 3500 employees located across 20 global offices, Airbnb is growing exponentially.

As a result of employee growth, they have experienced an explosion in both the volume and variety of internal data resources, such as tables, dashboards, reports, superset charts, Tableau workbooks, knowledge posts and more.

As Airbnb grows, so do the problems around the volume, complexity and obscurity of data. Information and people become siloed, which creates inefficiencies around navigating personalized tribal knowledge instead of clear and easy access to relevant data.

In order for this ocean of data resources to be of any use at all, the Airbnb team would need to help employees navigate the varying quality, complexity, relevance and trustworthiness of the data. In fact, lack of trust in data was a constant occurrence because employees were afraid of accidentally using outdated or incorrect information. Rather, employees would create their own additional data resources, further adding to the problem of myopic, isolated datasets.

To address these challenges, the Airbnb team created the Dataportal, a self-service system providing transparency to their complex and often-obscure data landscape. This search-and-discovery tool democratizes data and empowers Airbnb employees to easily find or discover data and feel confident about its trustworthiness and relevance.

When creating the Dataportal, the Airbnb team realized their ecosystem was best represented as a graph of connected data. Nodes were the various data resources: tables, dashboards, reports, users, teams, etc. Relationships were the already-present connections in how people used the data: consumption, production, association, team affinity, etc.

Using a graph data model, the relationships became just as pertinent as the nodes. Knowing who produced or consumed a data resource can be just as valuable as the resource itself. Connected data thus provides the necessary linkages between silos of data components and provides an understanding of the overall data landscape.

Given their connected data model, it was both logical and performant to use a graph database to store the data. Using Apache Hive as their master data store, Airbnb exports the data using Python to create a weighted PageRank of the graph data before pushing it into Neo4j where it’s synced with Elasticsearch for simple search and data discovery.

Conclusion


As you can see, once you surface the connections in your data, the use cases are endless.

The insights that these connections enable allow your organization to remain nimble in a changing business world and overcome the challenges of digital transformation. In the end, having a connected-data view of your enterprise is a future-proof solution to unknown future business requirements.

Next week, we’ll explore how to harness connected data using graph database technology in conjunction with your existing data platforms and analytics tools.


Get more value from the connections in your data:
Click below to get your copy of The Return on Connected Data and learn how to create a sustainable competitive advantage with graph technology.


Read the White Paper


Catch up with the rest of the ROI on connected data blog series:

The post The ROI on Connected Data:<br /> The Overlooked Value of Context for Business Insights [+ Airbnb Case Study] appeared first on Neo4j Graph Database Platform.


Using Neo4j to Investigate Connections in the U.S. Healthcare System

$
0
0
Editor’s Note: This presentation was given by Yaqi Shi at GraphConnect San Francisco in 2016.

Presentation Summary


In this presentation, Yaqi Shi describes the process of developing Health-Graph, a model of how some of the key elements in the healthcare system are connected. With a background in the Chinese healthcare system, Yaqi was bewildered by the U.S. system’s complexity, which inspired her to make this model.

Her process of developing a data model using open data provides a basic introduction to graph data modeling, as do her examples of addressing the challenges of data integration.

She concludes with a demo of some of the queries that her model affords, for example, showing the systematic flow of money between lobbyists and legislators, and how often, and by whom, frequently prescribed medications are prescribed.

Full Presentation: Using Neo4j to Investigate Connections in the U.S. Healthcare System


This post is about using Neo4j to show how some of the key elements of the healthcare system in the United States are connected.



Introduction


I am currently studying health informatics at the University of San Francisco; I also have a bachelor’s degree in medicine from China. I was very fortunate to have an internship with Neo4j and am now working on the Health-Graph project, which is the basis of my presentation.

Before we start, I’d like to mention that all the code from this post is housed on GitHub. Feel free to grab any information you think would be helpful for you. I also documented the project development in detail on the Neo4j blog, which may also be of interest.

Why Make a Model of the U.S. Healthcare System?


Why did I choose to do a project on healthcare?

When I first started my master’s program at USF, I was surprised by how differently the healthcare system in the United States works in comparison with how it works in China. For example, there are a lot of different stakeholders and different regulations. I was overwhelmed by the differences.

The image below shows a few of the stakeholders in the United States, such as providers, patients, pharmaceutical companies, medical device vendors, payers, and policymakers, and the list can go on.

I also realized that each stakeholder generates a lot of data every day. I wanted not only to understand how the healthcare industry works, but I also wanted to acquire some interesting insights from the data available out there and to be able to integrate the data.

The complications of the U.S. healthcare system.


I encountered a few challenges. First, there are many different roles in the healthcare system. Second, I found that how they relate to each other is complicated. And third, that there is a lot of data. In April 2016, I was introduced to Neo4j, and you can imagine how excited I was because Neo4j allows me to visualize the relationships between different healthcare stakeholders and handles data integration.

Developing a Data Model


Next came a lot of research on how the elements of the healthcare system relate to each other. Below is the first version of Health-Graph model.

Learn how Yaqi Shi used Neo4j for data integration and modeling to investigate the U.S. healthcare system.


I’d like to quickly walk you through this initial model. It starts from the Provider node, where a prescription is written for a generic drug. A drug firm will brand a drug, and a drug firm will probably also sign some disclosure with a lobbying firm.

The disclosure would have different kinds of issues around healthcare or Medicare. Those issues are being lobbied by lobbyist. The lobbyist will also make a contribution to different committees, and the committee will find a legislator.

Also, there will be information on the legislator, such as which state they represent, which party they belong to and which body are they are elected to.

So this is the data model of the Health-Graph, and this is the information you can query. Next, I looked for data sources.

The Data Integration Process


This is the list of the data that I integrated into my Health-Graph.

Data integration challenges for the health graph project.


This list shows the sources and the different types of data. If you have any experience working with open data, you probably know that it’s not the cleanest data ever.

In fact, one of the biggest challenges I encountered was integrating the data. Because it comes from different sources, it usually doesn’t have unique identifiers or a foreign key to connect it. As illustrated in the image below, the challenge is to find the grey connections.

Data integration list for the health graph project.


I found two solutions. The first one is to use public APIs.

First Solution: Using Public APIs


Data integration solution one, RxNorm APIs.


In the image above, what I’ve been trying to do is to connect the node Prescription with the node Drug. I stored the generic name under the genericName nodes. You have probably heard of Rxnorm, which is part of the National Institutes of Health (NIH) National Library of Medicine and provides a unique identifier for clinical drugs.

So I send the generic name in both prescription and drug to RxNorm. That gives me a unique identifier: rxcui. I also stored the rxcui as a property under the Prescription node.

I also created an intermediate node here called GenericDrug, which has exactly the same generic name as the GenericDrug name node, and store the rxcui there as well. At this point I can connect those three nodes based on the rxcui and the genericName.

Second Solution: FussyWussy


The second solution is FussyWussy stream matching.

What I’m trying to do here is to connect the known client with the known drug firm based on the theory that if the company’s name looks similar to the client’s, I will create a relationship. The lever that I use in Python is called FussyWussy, which returns the matching score between two strings.

The image below shows an example of two strings being compared. The second string is a substring of the first one.

Data integration solution two, FussyWussy.


Calling the first function partial_ratio returns a score of 100. And calling the original function returns the score of 65.

Obviously, no one can identify they are the same string unless they are the same company, so I decided to choose the cut-off value of 70 because most of the companies’ names look similar if they have a score over 70. So I create a relationship if the score is above 70 for this node.

Demo: Some Health-Graph Queries


Now let’s look at a demo of the Health-Graph. Watch the video below for the full demo.



That’s the demo for the health graph and I am sure that if you have the data that you are interested in, you probably will run into the same issues that I did, such as how to integrate your data and how to do the data modeling. But this presentation is just to give you an idea of how you can use Neo4j to handle any industry you’re interested in.


Inspired by Yaqi’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Register for GraphConnect

The post Using Neo4j to Investigate Connections in the U.S. Healthcare System appeared first on Neo4j Graph Database Platform.

The Neo4j JDBC Driver 3.3.1 Release Is Here [+ Examples]

$
0
0
Learn about the brand-new Neo4j JDBC Driver (v3.3.1) and what's new in this release (+ examples)Our team at LARUS has been quite busy since the last JDBC driver release. Today, we’re happy to announce the 3.3.1 release of the Neo4j-JDBC driver.

The release has been upgraded to work with recent Neo4j 3.3.x versions and Bolt driver 1.4.6. (Work on Neo4j 3.4.x and drivers 1.6.x is in progress.)

Neo4j-JDBC Driver Improvements and Upgrades


We worked on a number of improvements:
    • Added Bolt+routing protocol to let the driver work with the cluster and being able to route transactions to available cluster members.
    • Added support for in-memory databases for testing and embedded use cases.
    • Added a debug feature to better support the development phase or inspect how the driver works when used by third-party tools.
    • Added support for TrustStrategy so that you can now configure how the driver determines if it can trust the encryption certificates provided by the Neo4j instance it is connected to.
    • Implemented the DataSource interface so that you can now register the driver with a naming service based on the Java Naming and Directory Interface (JNDI) API and get a connection via JNDI lookups.
    • PLEASE NOTE: We’ve deprecated the usage of , as the parameter separator in favour of & to be compliant with the URL parameter syntax. Please update your connection URL because in future releases we’ll manage just &. (In the future, we want to use , for parameters that can have a list of values).

Updated Documentation + Matlab Example


The documentation has been updated to explain how to use the new features and now includes a Matlab example.

A Matlab example of the Neo4j JDBC driver


Open connection:

conn = database('','neo4j','test','org.neo4j.jdbc.BoltNeo4jDriver',
                'jdbc:neo4j:bolt://localhost:7687')

Fetch Total Node Count:

curs = exec(conn,'MATCH (n) RETURN count(*)')
curs = fetch(curs);
curs.Data
close(curs)

Output:

ans = '102671'

Besides Matlab, Neo4j-JDBC can of course be used with many other tools. Here is a short list:
    • Squirrel SQL
    • Eclipse / BIRT
    • Jasper Reports
    • RapidMiner Studio
    • Pentaho Kettle
    • Streamsets

API / Interface Work for JDBC Compatibility


We implemented the DataSource interface so that you can now register the driver with a naming service based on the Java Naming and Directory Interface (JNDI) API and get a connection via JNDI lookups. This should help a lot when you need a server managed connection to Neo4j in a JEE environment.

We also added implementations for several methods in Driver, Connection, Statement, ResultSet that were not there previously.

This helps you use the Neo4j-JDBC driver with MyBatis and other frameworks, like Spring JDBC.

Introducing New Support for Causal Clustering


It’s not always easy to adapt the brand-new Neo4j features and protocols to an old-fashioned interface such as the Java Database Connectivity (JDBC). This is because the capabilities of Neo4j Clusters and the Neo4j Java Bolt driver are evolving very rapidly.

Our latest task at LARUS was to make the Neo4j-JDBC driver interact with a Neo4j Causal Cluster providing all the client-side clustering features supported by the Bolt driver:
    • The possibility to route reads and writes to the server with the correct role
    • Defining a routing context
    • Managing bookmarks for causal consistency
    • Supporting multiple bootstrap servers
We’re very happy to present what we’ve been able to achieve!

Bolt+Routing Protocol


If you’re connecting to a Neo4j Causal Cluster and you want to manage routing strategies the JDBC URL must have this format:

jdbc:neo4j:bolt+routing://host1:port1,host2:port2,…​,
     hostN:portN/?username=neo4j,password=xxxx

You might have noticed we introduced the new protocol jdbc:neo4j:bolt+routing which indeed allows you to create a routing driver.

The list of [host:port] pairs in the URL correspond to the list of servers that are participating as Core instances in the Neo4j Cluster. If you or your preferred tool doesn’t support this format you can fall back to the dedicated parameter routing:servers, as in the following example:

jdbc:neo4j:bolt+routing://host1:port1?username=neo4j,password=xxxx,
     routing:region=EU&country=Italy&routing:servers=host2:port2;…​;hostN:portN

In that case, the address in the URL must be that of a Core server and the alternative servers must be ; separated (instead of ,).

Routing Context


Routing driver with routing context is an available option with a Neo4j Causal Cluster of version 3.2 or above. In such a setup, you can include a preferred routing context via the routing:policy parameter.

jdbc:neo4j:bolt+routing://host1:port1,host2:port2,…​,
     hostN:portN?username=neo4j,password=xxxx,routing:policy=EU

While for custom routing strategies you can use the generic routing: parameter:

jdbc:neo4j:bolt+routing://host1:port1,host2:port2,…​,
     hostN:portN?username=neo4j,password=xxxx,routing:region=EU&country=Italy

Access Mode (READ, WRITE)


Transactions can be executed in either read or write mode (see access mode), which is a really useful feature to support in JDBC too. The user can start a transaction in read or write mode via the Connection#setReadOnly method.

Note: Beware not to invoke that method while a transaction is currently open. If you do, the driver will raise an SQLException.

By using this method, when accessing the Neo4j Causal Cluster, write operations will be forwarded to Core instances while read operations will be managed by all cluster instances (depending on routing configuration).

You can find an example after the next paragraph.

Bookmarks


When working with a Causal Cluster, causal chaining is carried out by passing bookmarks between transactions in a session (see “causal chaining” in the Neo4j docs).

The JDBC driver allows you to read bookmarks by calling the following method:

connection.getClientInfo(BoltRoutingNeo4jDriver.BOOKMARK);

Of course, you can set the bookmark by calling the corresponding method:

connection.setClientInfo(BoltRoutingNeo4jDriver.BOOKMARK, "my bookmark");

Bolt+Routing with Bookmark Example


Run query:

String connectionUrl = 
       "jdbc:neo4j:bolt+routing://localhost:17681,localhost:17682,
        localhost:17683,localhost:17684,localhost:17685,localhost:17686,
        localhost:17687?noSsl&routing:policy=EU";

try  (Connection connection = 
      DriverManager.getConnection(connectionUrl, "neo4j", password)) {
    connection.setAutoCommit(false);

    // Access to CORE instances, as the connection is opened by 
    // default in write mode (connection.setReadOnly(false))
    try (Statement statement = connection.createStatement()) {
        statement.execute("CREATE (:Person {name: 'Anna'})");
    }

    // closing transaction before changing access mode
    connection.commit();

    // printing the transaction bookmark
    String bookmark = connection.getClientInfo(
                          BoltRoutingNeo4jDriver.BOOKMARK);
    System.out.println(bookmark);

    // Switching to read-only mode to access all cluster instances
    connection.setReadOnly(true);
    try (Statement statement = connection.createStatement()) {
        try (ResultSet resultSet = statement.executeQuery(
         "MATCH (p:Person {name:'Anna'}) RETURN count(*) AS total")) {
            if (resultSet.next()) {
                Long total = resultSet.getLong("total");
                assertEquals(1, total);
            }
        }
    }
    connection.commit();
}

Thanks to the bookmark, we of course expect that the total number of Person nodes returned is 1 (given an empty database), even if we are switching from a Core node – where we perform the CREATE operation – to some instance in the cluster, where instead we’ve performed the MATCH operation.

Usage


We really hope you enjoyed our work, and we’d love to hear from you, not just about issues, but also how you use the JDBC driver in your projects or which tools you use that we haven’t mentioned.

If you want to use the Neo4j-JDBC driver in your application, you can depend on org.neo4j:neo4j-jdcb:3.3.1 in your build setup, while for use with standalone tools it’s best to grab the release from GitHub.

Enjoy!
Lorenzo


Want to learn more about how relational databases compare to their graph counterparts? Get The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

The post The Neo4j JDBC Driver 3.3.1 Release Is Here [+ Examples] appeared first on Neo4j Graph Database Platform.

The ROI on Connected Data: Extract More Value from Existing Data Using Graphs [+ Telia Case Study]

$
0
0
From LinkedIn to Facebook to Google, top companies are driving their businesses using graphs. And use cases span industries as well.

What those new to graph technology may not understand is that creating a graph does not require starting from scratch. Existing data stores, such as data lakes and relational databases, represent an onramp to graph analytics on your existing data.

Learn how connected data helps you get increased value out of your big data with graph technology.


In this series, we’ll examine how investments in connected data return dividends for your bottom line – and beyond. In previous weeks, we’ve explored how increasing data’s connectedness correspondingly increases its business value and the overlooked value of context for business insights.

In this final post, we’ll explain how graph technology helps you get more value from current application data and big data alike, including a business case study for broadband provider Telia.

Connected Data: A Review of the Basics


How does data become connected? Don’t worry; you don’t have to boil the ocean. Connecting data is easy and approachable. In fact, you don’t even have to throw out what you’re already doing.

A graph database makes it easy to express and persist relationships across many types of data elements (people, processes and things). A graph database is a highly scalable transactional and analytic database that stores data relationships as first-class entities.

Think of a graph data model as a mind map composed of two elements: A node and a relationship. Each node represents an entity, and each relationship represents how two nodes are associated.

Nodes and relationships that connect your data.


By assembling the simple abstractions of nodes and relationships into connected structures (imagine connecting two circles with an arrow or line), a graph database enables you to build sophisticated models that map closely to a problem domain.

Graph databases are simple and agile due to their schema-optional nature. Because the structural overhead of traditional database schema is eliminated, graphs are easily changed or updated.

You aren’t required to have a similar structure for every node. You can put data into a graph format simply by reading in a table, and if you want to add another column or attribute to that data element, you simply add the attribute.

How Connected Data Delivers More Value from Existing Datasets


Graph technology and connected data are already being used today. In fact, companies like LinkedIn and Google recognized early on the need to use the relationships in their data.

Fortunately, graph databases have become more accessible. Mainstream, out-of-the-box graph technology makes it easy for any company to apply the concept of connected data to more problems.

Regardless of where you are today in your big data initiative, a graph database helps you reach the next level of maturity. If you’re just starting out, a graph database addresses the challenge of migrating all of your data (or your metadata) into one connected place.

There are no data typing requirements or fixed structure for how the graph is supposed to be defined. The graph automatically determines the data typing, and additional nodes can be added at any time.

You also get more value out of your big data technology by using graphs to transfer knowledge of what the organization has done across other departments. Graph database technoloy keeps relationship information always at hand, but that information changes over time. The graph therefore becomes a longstanding system of record that delivers value to different departments.

For example, it enables the customer service department to approve or deny a product exchange based on the customer’s purchase, return and exchange history.

By connecting data, graph databases also enable you to leverage all of the data you’ve dumped into your data lake. You simply port your tables or CSV data into the graph database, and the database makes nodes out of the rows in the tables.

Connected data using existing data.


By looking at how the tables join together and building relationships from that information, you create a graph that links discrete data, such as data from SAP to Oracle to Salesforce to Marketo to your bespoke shopping cart application, and from there you reveal the connections that a data lake doesn’t expose.

Identifying connections in your data is only helpful if you know what to do with them. Graphs help you with that, too.

Today’s applications – whether they be recommendation engines or anti-money laundering systems – are powered by graphs. The graph database operationalizes the data by making it possible for the app to traverse the data. Once you have this ability, the data becomes actionable.

Telia case study on connected data.

Case Study: Telia


Telia is the incumbent telecommunications carrier in the Nordic market across mobile, broadband, consumer and enterprise markets. In order to stay competitive, the Telia team decided to reinvigorate an area of least innovation: their consumer broadband business.

Taking a closer look, the Telia team recognized their roughly 1 million installed Wi-Fi routers as underutilized assets. Their goal was twofold: help consumers simplify their lives and help consumers better entertain within their homes.

The solution: The Telia Zone. After tracking all Wi-Fi connected devices within their customers’ homes and recording when a device leaves or enters (all collected with prior consent), the Telia team provides this connected dataset and a number of APIs to a partner-driven ecosystem of apps and services.

The result is a smart home platform that allows consumers to pick and choose apps that meet the two goals of simplification and home entertainment.

Telia Zone apps do everything from reminding consumers when they forget to lock the door after they leave for the day to letting them know when their kids arrive home after school (and turning on the lights when they get there).

Another example is the Coplay app which automatically generates a Spotify playlist based on which party guests are connected to a home’s Wi-Fi. If a guest leaves, their songs are dynamically removed from the playlist, and when they return, so do their favorite songs.

By tracking patterns of connected data around when users leave and arrive home, the Telia Zone also offers highly accurate and relevant suggestions to other apps regarding when users will most likely be at home – information that is both elusive and salient to every home delivery service.

Powering it all on the backend is the Neo4j Graph Platform. The Telia team chose Neo4j because their smart home dataset was already a graph of connected data.

Also, with Neo4j’s Causal Clustering architecture, the Telia team was able to horizontally scale their operations without always knowing what given traffic loads might be. Furthermore, the graph data model allows them the flexibility to keep evolving the Telia Zone platform without breaking existing components.

Conclusion


There is tremendous untapped business value in the connections hidden in your data.

Graph analytics bring these connections to light, resulting in lightning fast queries no matter how much data you have, serving numerous use cases across industries.

This concludes our series on the ROI of connected data. We hope these blogs have inspired you to explore how to make use of connected data to grow your bottom line and beyond.


Find out how your data can bring you more ROI: Click below to get your copy of The Return on Connected Data and learn how Neo4j can drive results for you.

Read the White Paper


Catch up with the rest of the Return on Connected Data blog series:

The post The ROI on Connected Data:<br /> Extract More Value from Existing Data Using Graphs [+ Telia Case Study] appeared first on Neo4j Graph Database Platform.

Connecting the Dots in Early Drug Discovery at Novartis

$
0
0
Editor’s Note: This presentation was given by Stephan Reiling at GraphConnect San Francisco in October 2016.

Presentation Summary


In this talk, Senior Scientist Stephan Reiling describes some of the ways that a major pharmaceutical company conducts the search for medical compounds, and the significant role that graph database technology plays in that effort.

The underlying problem is how to construct a system of scalable biological knowledge.

This means not just connecting vast amounts of heterogeneous data, but enabling researchers to construct a query for a particular kind of triangular relationship: The nodes are chemical compounds, specific biological entities, and diseases described in the research literature, and the system has to use the uncertainty in key links as part of the query.

A specific way graph technology furthers this effort is that it enables the system to capture the strength of the relationship between terms in a medical research text by encoding it in the properties of a graph connecting these terms.

This in turn provides a foundation for later queries that link the literature to observed chemical or biological data. These results can also be tested by knocking out some of the links to see how the results differ.

Full Presentation: Connecting the Dots in Early Drug Discovery


This blog post is about how we have combined lots of heterogeneous data and integrated it into one big knowledge graph that we are using to help us discover cures for diseases.



A Graph for Biomedical Research


The Novartis Institutes for BioMedical Research is the research arm of Novartis, a large pharmaceutical company. Our research is focused on identifying the next generation of medicines to cure disease.

We identify medicines using biomedical research. This enterprise has become a big data merging exercise where you generate lots of data, you analyze even more data, and all of this at the very end is distilled into a small pill or injection or in some treatment.

This is a project that we’ve been working on for almost three years, and we are now getting the first results and they look very promising. One topic is how we have combined lots of heterogeneous data and integrated that into one big graph that we are now using for querying.

Some of the data we are putting into this graph is coming from text mining, and we are doing this a little bit differently in terms of what we extract from the text and how we do the pattern detection. Further down, I have some examples of how this can actually be used.

Why We Built the Graph


Let me start with why we are doing this in the first place.

In the present, it’s about images used to capture complex biology. In the past, when you wanted to understand biology and you were interested in proteins, you would isolate a protein, you would purify it, you would put it into a little test tube, and you would characterize it.

If you wanted to identify a compound or something that acts on it, you would put the compound into the same test tube and see what it does to the activity of that enzyme or some other protein. This is very reductionist, but it allows you to get very precise measurements if you want to, and we have decades worth of data along those lines.

For a long time that was the workhorse for drug discovery. This initial compound goes into a multi-year effort that at the very end results into a drug that comes onto the market. And, as we found out, biology cannot be reduced to a single protein in a test tube. It’s much more complex.

Over the last five to eight years or so, all of the pharmaceutical companies and research institutes across the world are now starting to try to capture the complexity of this biology. One way to do it is to run high content screens.

High-Content Screens


The image on the left below is a snapshot of a cell culture; this is a crop of a much larger image. We run large-scale screening assays on these kinds of things. We have big automated microscopes with lab automation, and we can easily generate terabytes and terabytes of images along those lines.

Why did we build a graph

What you see in white is basically the cell body, the red circles are the nucleus, and green are the nuclei of cells that do not contain a certain protein. And let’s say the goal is to increase the number of the cells that express this protein, which is colored white here.

So let’s say you add some treatment to it; it could be a compound, it could be a small RNA, it could be anything. Then you see that not only do the numbers change, but I hope you can see that there is more going on. The shape of the cells is changing. Now you add a different compound to it, and something else happens. We call this a phenotype.

We’re now running very large-scale screening efforts where we use phenotypes to understand the underlying biology, and also to better identify the compounds that will go into this long process to get a drug in the end.

This process of generating data is only making our job harder. Lab automation is progressing, and we can generate more and more of this data. There are now movies coming up where we take live cell imaging and we follow the cells over time. There is 3D. We take 3D slides and so on. There’s more and more of this data coming our way.

The Basic Problem: Scalable Biological Knowledge


Here is the basic problem statement: We have decades’ worth of this reductionist data. We have these large compound live-reads that are annotated. We know what they did in the past, but now things are changing, and we’re getting more and more of this imaging data and these phenotypic assays.

When we try to analyze this, it becomes much more apparent that we need to have a way to scale biological knowledge, to have a system that allows us to store biological knowledge, and then run queries against it so that we can use this store to analyze this data. I still don’t know exactly what “biological knowledge” means, but we’re going to make a system for it.

Scalable biological knowledge, why build a graph.

In the image above, the key triangle connects a compound, a gene and a phenotype. The way we think about it is that for successful drug discovery, you need to be able to navigate this triangle.

Using the data that we accumulated historically, we are very strong on one edge of the triangle, between the compound and the gene, but not so much on the other edges. We’re trying to fill this knowledge gap.

Text Mining for Chemicals, Diseases and Proteins


I mentioned that one of the sources of information that we are using is text mining. In the slide below is some scientific text, and there are tools available that will then identify entities of interest in this text.

How did we build the graph

Here we are identifying compounds, genes, diseases and processes and so on. And some of these tools cheat you a little bit in that they tell you only this is a compound, but they don’t tell you what the compound is composed of.

One of the things we have been working on is to identify exactly what compound or gene it really is. We then basically re-engineer the chemical structure from what we identify in the text.

We also try to identify relationships between the entities in the text and come up with statements: “The compound inhibits this target.” And so on.

The Richness of the PubMed Library


The corpus that we were using initially is PubMed from the National Library of Medicine (see below). It’s an incredible resource. Since 1946, they have been collecting articles from about 5,600 scientific journals, and they make abstracts of these articles freely available. It’s a really incredible resource.

But what is sometimes not appreciated is that when an article or an abstract gets entered into this library, it is tagged by human experts. They call these MeSH terms: the medical subject headings.

National Institute of Health PubMed

In the image above, the text on the left gives you an impression of how many tags you actually get, shown on the right. And the nice thing about these tags is how they are organized and that they’re coming from human experts.

Sometimes tags describe the article in a way that you would not be able to get from the text because of the human curation that is happening.

I mentioned that the MeSH terms are organized, and below there is a little bit about how they are organized. They’re organized in trees. The diagram shows five of them, which are the ones we’re interested in. In total there are 16 trees, which include things like geography or occupation, which for our purposes, is not of that much interest.

Structure of Mesh Term Hierarchy

There are about a quarter million tags that are used in these trees. And I have to make one disclosure here: In reality they don’t look quite that nice. The overall organization is a tree.

There are some connections between the branches and so the actual layout of this doesn’t quite look that nice but, in general, it’s organized as a tree.

Organization tree for the graph

This slide shows a little bit how deep and how rich this is. This is sized by the number of articles that are annotated with a specific tag. And you cannot see the tags.

Deep rich organizational tree.

As you zoom further in, you can tell, just from the tags, that the article is going to cover a specific subtype of a receptor. So we’re using very rich and detailed information.

But the disadvantage of this is that from the text mining we’re getting entities, and we’re getting relationships between these entities. Here we can only tell that the tag is in the abstract, or it’s annotated as this. And we would like to combine these by putting the PubMed information into our graph. So how do we do it?

Constructing Relationships in a Graph from Relationships in a Text


We’re exploiting the fact that we’re looking across all of these 25 million abstracts. Here’s a simple example using four articles. And if you do this and you look across these articles, you will see that some of these entities – like Compound A and Gene 1 – occur together quite frequently.

We’re using association rule mining to establish the probability of co-occurrence for these entities. So we’re not using the verbs of the sentence; we’re only using these entities. We can collect the lift, as it is known, and we scale it from zero to one.

If it’s above a certain threshold, we then say there is a relationship between entities (in this example Compound A and Gene 1) that we put it into the graph. So here’s a way that we can use the statistical analysis of relationships in a text to generate relationships in a graph.

Association rule mining concurrences.

The lift is the association strength, and we actually store that as the uncertainty of the association. The reason is that later on, we’re going to do a lot of things where we do graph traversals, and we’re going to use this as a distance measure in the graph.

The associations will give us a confidence measure for the association, and we just take the inverse. A high association strength has a low uncertainty, and that’s the way that we’re entering this into the graph.

Putting This Metric to Work


What can you do with this?

An earlier graph showed the triangle that we’re trying to navigate, so now I can just do a triangle query, where I say, “I want to find a triangle between a compound, a gene and a phenotype or a disease, and I want to take the sum of the distances of the three edges, and use that sum to rank these triangles. And I want to find the triangle that has the lowest uncertainty at the top.”

What can you do with this.

When I do this (shown on the left), one of the first triangles that comes up shows tafamidis amyloid neuropathies as a disease and transthyretin as an enzyme. How can we validate if this is common knowledge, or not?

We go to Wikipedia and we check. The text in the left box is taken from the Wikipedia page about tafamidis. One of the first sentences on the Wikipedia page states that tafamidis is the drug that is used to treat this amyloid neuropathy, and it’s caused by a transthyretin-related hereditary amyloidosis.

This is the validation, or one way of validating this, that with this approach, with this association, we have captured this.

We can do this again, and the second one that comes up (shown on the right) is another triangle and again the text is from the Wikipedia page. It talks about Canavan disease, and in this case, the compound is not the drug treatment, but it’s actually the accumulation of this compound that causes the disease.

The point I’m trying to make here is that we are learning more details about this relationship. In the one case, the compound played the role of the drug, and in this case the compound plays the role as the causative agent. The important thing is that we get this overall picture and that’s what we care about.

So we go to Wikipedia, and we check. Why wouldn’t we just load Wikipedia?

Why not just load Wikipedia

On the left above is the Wikipedia page for this compound that was at the top of the second triangle on the previous slide. This is the entry for N-Acetylaspartic acid, and as far as Wikipedia pages go, it’s a rather short page. And really the only thing that’s on this page is the relationship of this compound with Canavan disease and the enzyme on the previous slide.

Now using our graph, I can do a query for this compound, asking to rank order the diseases that are associated with it.

The results are shown in the box on the right: these are the top five diseases and the first one is Canavan disease. And since this is data from the PubMed literature, I can now ask for the supporting evidence for this association.

Shown in the lower box is one article that you can pull up, and directly in the title it says that this compound plays a role in Pelizaeus-Merzbacher disease, which I didn’t know.


When we analyze cellular assays, we are actually not that interested in diseases. However, I can basically take the same query that I had before and I just exchange the disease element with a cellular component, for example, and I get the results on the left below.

Graph query analysis

And if you remember, Canavan disease is a neural disease, so the components shown here are all related to the central nervous system.

On the right, I’m doing the same thing. I just replace cellular component with biological process, and here I get the list of biological processes that are associated with this compound. And for the analysis of cellular assays, these cellular processes are really something that we’re interested in. All of this comes from these mixtures.

So this gives us a very broad annotation or knowledge of what is in PubMed. This was the text mining I was talking about in which we are integrating heterogeneous data sets.

Other Data Sources in the Graph Database


The slide below shows what we have put into this graph database so far.

Other data sources integrated

The green ones, the top three, I covered earlier in my section on text mining. There are ten more sources.

The selection was done so that they should complement each other. We have toxicogenomics database in there. We have protein-protein interactions. We have systems biology and pathways, proteins and gene annotations.

All of this, we integrate and match up in terms of the identifiers and the objects that are in there. And we have about 30 million nodes that are now in this database. We’re also putting the articles in there. And the majority of these nodes are these articles.

For us, what is especially important is that we have about two million compounds. That, for us, is a really important number. But we’re not going to use this to identify new genes or anything like that. For us, this is all about relationships.

Relationship nodes in the graph.

We have about 91 different relationships and about 400 million total relationships. And the 91 different relationships, that’s just the richness of the biology and the underlying data that we’re getting.

Below you can see briefly what these relationships look like. Here are two examples: a protein phosphorylates a certain protein, and the compound affects the expression of this protein. We’re trying to be pretty broad here.

The Overall Build Process


Here is a little bit about the technical side.

Below is the infrastructure we have in place where the PubMed abstracts first go into the Mongo database (MongoDB), and that is really what is driving the text analysis.

The main reason is that we are using the PostgreSQL database in the middle is because of existing data warehouses at Novartis, where there is already work that has been done to do some pre-summarization of data, internal and external. And we can just do ETL to get this into this Postgres database.

And that is, for us, very fortunate. Many years went into these upstream data warehouses as you can imagine.

Graph build process.

To get data into Neo4j we use the CSV batch importer. At this point, we’re still figuring out exactly how we do the text mining and so on, so doing this staging through CSV files has worked very well for us.

Using the Neo4j Database: An Example


What do we really use this for?

Here is one example: I talked about running these imaging assays and in a lot of cases, we know what we’re looking for, we want to see this phenotype, and we don’t want to see other phenotypes. So we can categorize compounds and categorize them as active or inactive.

Analysis of compound activities.

And if we now want to analyze this data, what we had done in the past, and we can do this now also here in a graph form. We can look for other target annotations that we have for these compounds, and we call this “target enrichment.”

Graph data target enrichment.

We have been doing this based on relational databases, and it’s a hit or miss. But now we can now go into the graph and run queries where we are saying, “What nodes can we reach for each active or each inactive compound within a certain distance?”

Graph run queries.

And this is where the uncertainty comes back. We’re trying to identify nodes that can be reached from the actives, but not from the inactives, or are much closer to the actives than they are to the inactives.

Identify nodes in the graph.

And the idea is that this is something that the actives have in common. It could be a gene, it could be a biological process, it could be a pathway that differentiates the actives from the inactives.

The slide below shows a little bit of the technicality here.

We run these queries and we get rows of information. Every row says for the node, which is a compound, the distance, and then we pivot this into a distance metric for every compound. We have an indicator column, the distance to these nodes, and if it cannot be reached, we just set it to a really high value.

Earlier I mentioned we’re doing this for hundreds of thousands of compounds. This was an example given to me by a biologist and it has 34 compounds, so that makes it easy to display here.

Querying numerous compounds.

We then run a partitioning. We try to identify a cut that separates the actives from the inactives. In the example here on the right, we can find a cut that splits 12 of the 14 actives on one side, and all of the inactives on the other side.

The way that these decision trees or partitioning methods work, they will also find other cuts that work as well, so this is called a surrogate split, and we also use those.

In this example we found two of these internal nodes that we’re looking for, as shown on the next slide.

Internal nodes in the graph.

These are now just the active compounds, and the blue are the compounds, the red are the nodes that are in this network, and in this case these are all genes. The yellow highlight shows the two central nodes that seem to determine if the compound shows the desired phenotype or not.

In this example, it turned out to be true.

The other reason why I’m showing this is that the relationships are colored by the date of source.

The green links, and these are all the connections that go from the compound to these initial layer of nodes, are our internal data. The grey, which is all the rest, are coming from these association rules.

Being able to see this is something we wouldn’t be able to do without integrating all of these data sources, and also doing the analysis from the text mining.

But is that really true? In this case, we wanted to find out these associational rules, what is going on, and why couldn’t we it differently.

Text mining associational rules graph

The objective here is to find the connections in the graph on the left above when the gray association rules are ignored. Every relationship is annotated with the data source that it’s coming from, so I can run the same query without the text mining.

The blue box, within which are just six nodes, is the focus. On the right side of the slide are the results of the new query, just focusing on the nodes in the blue box.

The six nodes with yellow circles are the ones that are in the blue box on the left. You can find a connection between them, but the connection looks much more complicated. If you sum up these uncertainties, it’s a much weaker statement.

So the role that the association results play is that they provide shortcuts across all of these underlying data sources. You can now take these as a shortcut to identify something, and then you can use this to drill down, to better understand what is going on.

This is where these abstracts come in, because most of this is coming from the literature via the database.

Drilling Down into the Association Rules


Once you have these association rules, you would then also like to know, what is the supporting evidence for them?

Test hypothesis graph

Above is the same query that I ran before, and all of these purple spheres are the articles that constitute the evidence for this association. Since we have them in the database, you can click on them, and you’ll see what is in that article.

The one that is shown at the bottom right of the slide talks about the same relationship that we identified, and it also talks about a compound that should behave similarly to the ones that we tested.

So not only did we find a hypothesis in this graph, but we also now have a way to test the hypothesis and see if the compound behaves the way that we would predict from this. We have not done this step yet.

Conclusion: A Reality Check on Biological Knowledge


I would like to do a little bit of reality check and circle back to the question of biological knowledge.

Biological knowledge reality check

Above is the result I showed earlier, and it is similar to the very first time we got this to work. At that point, we were also trying to identify connections between things.

So, the very first time we did this, we took the results and we went to a biologist and asked, “What do you think? Does that make sense?” And the feedback that we got was, “I knew that,” with a little bit of disappointment in there because the biologist wanted us to find a really novel insight he didn’t know about, something cool and earth shattering.

But that’s not how this is going to work in most cases.

You can find out how your compound is going to behave and get novel insights into why they do what they do, but it’s not always going to be a smash hit of earth shattering results.

But as I said about one of the first slides, when I described trying to capture this biological knowledge, it is very fuzzy and I don’t know exactly what it means.

But here we have something that is non-trivial to deduce from the data, and it’s mostly about relations between biological entities. And if you take this to a biologist and the biologist says, “I know that,” then that is biological knowledge.

That’s what we’re trying to capture in our graph database by combining the literature and these data sources.

Where This Research Is Going


Here is a list of things that we’re trying to do with this in the short-term.

We’re trying to do a better job with text mining. We have now started doing an analysis of full texts, not just abstracts, mostly for purposes of data mining internal documents.

We are also looking at this concept of uncertainty, which started with just picking a threshold to use, but there are a couple of things about it that we should be a little more rational about.

Where is this graph analysis going

What we don’t have at this point is a way of automatically updating the database with new additions to the library.

Every day there are new articles that are published and put into the library, and we don’t have an automatic process for getting them. And there’s always more data. What is not in there on purpose is a lot of the genomics data: genes, gene expression data, and so on.

Once we put that in there, it’s going to at least double the size of the database.


Inspired by Stephan’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Register for GraphConnect

The post Connecting the Dots in Early Drug Discovery at Novartis appeared first on Neo4j Graph Database Platform.

Visualizing Healthy Lifestyles: 5-Minute Interview with Alicia Powers, SVP at NYCEDC

$
0
0
“The future of graph technology is already here. It’s everywhere, because we can model anything in a way that’s more close to how it is in real life,” said Alicia Powers, Senior Vice President at New York City Economic Development Corporation (NYCEDC).

For Alicia, it’s all about the visualization. While working as a researcher at the New York City Economic Development Corporation, she took on a side project to build a recommendation engine to see different connections in eating and how it relates to health. To see the data patterns through the lens of Neo4j, she was able to make deeper connections to the people representing her data.

In this week’s five-minute interview (conducted at GraphConnect New York) we discuss Alicia’s project in which she understand food and eating by building a recommendation engine using Neo4j.



Talk to us about how you guys use Neo4j at New York City Economic Development Corporation.


Alicia Powers: Well, I came to Neo4j to work on a hobby project. I was really interested in understanding food and trying to build a recommendation engine. And I wanted to use Neo4j to help me visualize data and its connection to health. So that’s how I came to it. And I’ve been working on this project, my side project, for two years now.

Why did you choose Neo4j?


Powers: So what made Neo4j stand out to me was the ability to really see connections between different aspects of eating. So not only do you have a person, you have when they’re eating, how they’re eating, how much they’re eating. There’re all these different points of data that you can use to make a recommendation engine.

And if you’re using like a SQL or even a document database, it’s really hard to start to see the patterns, visually. And I’m a visual learner. I love pictures. And Neo4j presents the data in the way that people actually see the data, experience the data, live the data.

What were some of your most interesting or surprising results you’d had while using Neo4j?


Powers: Neo4j actually makes the data very personal to me because I can see the person, and I can see the characteristics. And I know their age, and I know their weight, and I know their height. I know what they eat for breakfast. I know where they eat breakfast.

So I really seem to understand the people in my data differently than just a tabulation of the average calories they consume. And so I feel like it’s brought me a lot closer to the project and a lot closer to the people in the data.

If you could start over with Neo4j, taking everything you know now, what would you do differently?


Powers: So if I were going to start the project over from the beginning, knowing everything I would know now, I actually probably would rely on Neo4j help more than I did at the beginning.

I kind of just dove in and worked, reading different things. There’s a Slack channel, and a community – there are lots of people you can talk to, and they’re extremely friendly. So I would’ve relied on that more probably early on in the process.

Learn how Alicia Powers, Senior Vice President at New York City Economic Development Corporation uses Neo4j to build a recommendation engine.

What do you think the future of graph technology looks like?


Powers: The future of graph technology is already here. It’s everywhere, because we can model anything in a way that’s more close to how it is in real life. So neural networks – connections – help people connect to one another in an actual social network. All of these things are graphs. Everything can be a graph.

And so when I come to GraphConnect, it’s really clear by the presentations that graph databases can be used for any single domain that there is. And I think, as far as the future, it’s just a matter of more organizations, more companies, using them and using them well.

Anything else you want to add or say?


Powers: I really can’t stress enough how the graph community is so kind and nice. Sometimes in the tech community, people are not as friendly or open. And so this particular conference – this is my second conference – I’ve found people to be extremely open and very passionate and excited about graphs. Which, I am too, so it kind of works out.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


Level up your recommendation engine:
Learn why a recommender system built on graph technology is more powerful and efficient with this white paper, Powering Recommendations with Graph Databases – get your copy today.


Read the White Paper

The post Visualizing Healthy Lifestyles: 5-Minute Interview with Alicia Powers, SVP at NYCEDC appeared first on Neo4j Graph Database Platform.

The Next Generation of Service Assurance: Your Network Is a Graph

$
0
0
The next generation of service assurance, learn how your network is a graph.
Service complexity is exploding.

Communication Service Providers (CSPs) need a complete view of their network and its myriad interdependencies to drive real-time decisions and predict the impact of changes on the user experience. A native graph approach makes sense of complex networks and drives innovation, bringing new services from prototype to production.

The next generation of service assurance, learn how your network is a graph.

In this series, we examine challenges in optimizing network services and how graph database technology overcomes them. Last week we covered trends driving major advances in service assurance.

This week, we’ll explain how graphs are ideal for modeling networks and show how relational database management systems (RDBMS) can’t keep pace with the interconnectedness and scale of today’s networks.

The Network Is a Graph


Telecommunications and enterprise networks are hyper-connected ecosystems of components, services and behaviors. And yet, critical information is often siloed: from physical items (such as end devices, routers, servers and applications), to activities (like calls, media and data) and customer information (rights and subscriptions).

A graph data model reveals cross-domain dependencies with a single unified view of the infrastructure and topology. Breaking down silos is at the heart of NGSA solutions.

Discover how a graph data model solve service assurance challenges.

Advantages of a Native-Graph Approach


Neo4j, the world’s most relied upon graph platform, naturally captures relationships between data and therefore easily models network and service complexity. This emphasis on connections is unlike relational databases (RDBMS) that require careful schema development and complex joins to map multiple levels of relationships, which are then inflexible when adding new data.

Neo4j is specifically designed to store and process such multi-dimensional associations.

Learn the difference between a graph database and traditional data storage methods.

Service assurance requires performance and predictability, such as monitoring real-time end-user experience for automated responses. Because of its native graph storage, Neo4j thrives in querying such complexity at scale, easily outperforming RDBMS and other NoSQL data stores.

To deal with this type of processing, RDBMS approaches often resort to batch process or pre-computed schemes, which cannot deliver the results needed for immediate results.

Advantages for IT and Network Operators Using Neo4j


Network administrators, system administrators and application operators are able to readily recognize and understand the actual impact of maintenance activities and avoid surprising users downstream. They rapidly diagnose failures by correlating and tracing user complaints back to the application and infrastructure or cloud service where they are hosted.

Advantages for Telecom Operators Using Neo4j


Engineers and operators may create a singular view of operations across multiple networks at once, including cell towers, fiber lines, cable, customers and consumer subscribers or content providers. They improve customer experience by minimizing the impact of system maintenance or outages, such as being able to re-route services during an unexpected interruption, or identifying and preemptively upgrading vulnerable servers based on their maintenance history and availability.

Conclusion


As you can see, your network is a graph. Optimizing network services requires a real-time holistic view of network operations that only a graph database can provide.

Next week, we’ll explore how to rapidly build unique service assurance solutions – and how top telecom providers use Neo4j to differentiate their offerings.

Innovate and scale:
Find out how top CSPs use connected data to capitalize on new market opportunities in this white paper, Optimize Network Services: Advanced Service Assurance with Neo4j. Click below to get your free copy.


Read the White Paper


Graph Databases for Beginners: Why Connected Data Matters

$
0
0
Learn the critical business value of data relationships and why they matter more than discrete data
“It’s not what you know, it’s who you know.”

Sound familiar?

Depressingly, we all know it’s true: The boss’s son has a better shot at that corner office than you do. That’s because in business – and in life – relationships matter a whole lot more than individual skills or competencies.

Here’s the kicker: Why don’t we think about our data in the same way?

Even as the volume of discrete, individual data points increases (and will continue to do so), the real value – the real bottom-line, business-defining ROI – comes from the connections between the data that’s collected.

Got 2 million Facebook likes? Great. Got 2000 committed shoppers to your ecommerce site? Super. But are you connecting who likes your Facebook page with who’s most likely to shop next? Are you drawing the relationships (or even able to draw the relationships) between Facebook promotions and purchased items from your most loyal shoppers?

2k, 2M or 2B – the number doesn’t matter if you’re not making the connection.

We live in an ever-more-connected world, and these critical data relationships (a.k.a. connected data), will only increase in years to come. If your team is going to succeed, you’ve got to leverage those connections for all they’re worth – but you’ll need the right technology.

With so many systems built on relational databases or aggregate NoSQL stores, you may not know of a third option that outperforms them both: graph databases.

Learn the critical business value of data relationships and why they matter more than discrete data


In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. Last week, we tackled why graph technology is the future.

This week, we’ll discuss why data relationships matter – and how that realization affects your next choice of database.

The Irony of Relational Databases


Relational databases (RDBMS) were originally designed to codify paper forms and tabular structures, and they still do this exceedingly well. (There’s a reason they continue to dominate the database market for that use case.)

Ironically, however, relational databases aren’t effective at handling data relationships, especially when those relationships are added or adjusted on an ad hoc basis.

The greatest weakness of relational databases is that their model is too inflexible. Your business needs are constantly changing and evolving, but the schema of a relational database can’t efficiently keep up with those dynamic and uncertain variables.

To compensate, your development team can try to leave certain columns empty (tech lingo: nullable), but this approach requires more code to handle the greater number of exceptions in your data.

But the relational data model isn’t the only challenge: Performance matters too. Even worse, as your data multiplies in complexity and diversity (and it always will), your relational database becomes burdened with large JOIN tables which disrupt performance and hinder further development.

JOINs aren’t too bad if you’re only making two to three hops across tables, but once you start to make multiple hops (don’t think four, but fourteen or forty), then your RDBMS is doomed. In fact, the results may never fully calculate.

Unfortunately, your end-users can’t wait for never. They’re probably going to click or swipe away after more than two seconds (a lot less than never). Your database needs to meet – or exceed – their expectations. For connected data, an RDBMS isn’t up to the task.

Consider the example relational database model below.


An example relational database (RDBMS) data model

An example relational database model where some queries are inefficient-yet-doable (e.g., “What items did a customer buy?”) and other queries are prohibitively slow (e.g., “Which customers bought this product?”).

In order to discover what products a customer bought, your developers would need to write several JOIN tables which significantly slow the performance of the application.

Furthermore, asking a reciprocal question like, “Which customers bought this product?” or “Which customers buying this product also bought that product?” becomes prohibitively expensive. Yet, questions like these are essential if you want to build a proper recommendation engine for your transactional application.

At a certain point, your business needs will entirely outgrow this current database schema. The problem, however, is that migrating your data to a new RDBMS schema becomes incredibly effort-intensive.

Now imagine your business 10, 20 or 50 years from now. How often do you imagine your data model will need to evolve in order to match the changing needs of your business? (Hint: A lot.)

Why NoSQL Databases Don’t Fix the Problem Either


NoSQL (or Not only SQL) databases store sets of disconnected documents, values and columns, which in some ways gives them a performance advantage over relational databases. However, their disconnected construction makes it harder to harness connected data properly.

Some developers add data relationships to NoSQL databases by embedding aggregate identifying information inside the field of another aggregate (tech lingo: they use foreign keys). But joining aggregates at the application level later becomes just as prohibitively expensive as in a relational database (i.e., there’s no free lunch).

These foreign keys have another weak point too: they only “point” in one direction, making reciprocal queries too time-consuming to run. If you can’t imagine a scenario where you would never want to know a reciprocal query, then you’re not thinking big enough.

Developers usually work around this reciprocal-query problem by inserting backward-pointing relationships or by exporting the dataset to an external compute structure, like Hadoop, and computing the result with brute force. Either way, the results are slow and latent.

Graph Technology Puts Data Relationships at the Center


When you want a cohesive picture of your big data, including the connections between elements, you need a graph database. In contrast to relational and NoSQL databases, graph databases store data relationships as relationships. This explicit storage of relationship data means fewer disconnects between your evolving schema and your actual database.

In fact, the flexibility of a graph data model allows you to add new nodes and relationships without compromising your existing network or expensively migrating your data. All of your original data (and its original relationships) remain intact.

With data relationships at their center, graph databases are incredibly efficient when it comes to query speeds, even for deep and complex queries. In their book Neo4j in Action, Partner and Vukotic performed an experiment between a relational database and a graph database (Neo4j!).

Their experiment used a basic social network to find friends-of-friends connections to a depth of five degrees. Their dataset included 1,000,000 people each with approximately 50 friends. The results of their experiment are listed in the table below.

A performance experiment run between relational databases (RDBMS) and the Neo4j graph database

A performance experiment run between relational databases (RDBMS) and Neo4j shows that graph databases handle data relationships with high efficiency.

At the friends-of-friends level (depth two), both the relational database and graph database performed adequately. However, as the depth of connectedness increased, the performance of the graph database quickly outstripped that of the relational database. It turns out data relationships are vitally important.

This comparison isn’t to say NoSQL stores or relational databases don’t have a role to play (they certainly do), but they fall short when it comes to connected data relationships. Graph technology, however, is extremely effective at handling connected data.

And for mission-critical insights and nimble business agility, connected data matters.


Want to dive deeper into the world of graph database technology? Click below to get your free copy of the O’Reilly Graph Databases book and learn how to apply graph thinking to your biggest connected data challenges.

Get My Copy of the Book



Catch up with the rest of the Graph Databases for Beginners series:

Neo4j Launches Commercial Kubernetes Application on Google Cloud Platform Marketplace

$
0
0
Learn about Neo4j's new commercial Kubernetes application on the Google Cloud Platform Marketplace.
On behalf of the Neo4j team, I am happy to announce that today we are introducing the availability of the Neo4j Graph Platform within a commercial Kubernetes application to all users of the Google Cloud Platform Marketplace.

This new offering provides customers with the ability to easily deploy Neo4j’s native graph database capabilities for Kubernetes directly into their GKE-hosted Kubernetes cluster.

Learn about Neo4j's new commercial Kubernetes application on the Google Cloud Platform Marketplace.

The Neo4j Kubernetes application will be “Bring Your Own License” (BYOL). If you have a valid Neo4j Enterprise Edition license (including startup program licenses), the Neo4j application will be available to you.

Commercial Kubernetes applications can be deployed on-premise or even on other public clouds through the Google Cloud Platform Marketplace.

What This Means for Kubernetes Users


We’ve seen the Kubernetes user base growing substantially, and this application makes it easy for that community to launch Neo4j and take advantage of graph technology alongside any other workload they may use with Kubernetes.

Kubernetes customers are already building some of these same applications, and using Neo4j on Kubernetes, a user combines the graph capabilities of Neo4j alongside an existing application, such as an application that is generating recommendations by looking at the behavior of similar buyers, or a 360-degree customer view that uses a knowledge graph to help spot trends and opportunities.

GCP Marketplace + Neo4j


GCP Marketplace is based on a multi-cloud and hybrid-first philosophy, focused on giving Google Cloud partners and enterprise customers flexibility without lock-in. It also helps customers innovate by easily adopting new technologies from ISV partners, such as commercial Kubernetes applications, and allows companies to oversee the full lifecycle of a solution, from discovery through management.

As the ecosystem leader in graph databases, Neo4j has supported containerization technology, including Docker, for years. With this announcement, Kubernetes customers can now easily pair Neo4j with existing applications already running on their Kubernetes cluster or install other Kubernetes marketplace applications alongside Neo4j.


Ready to get started with your one-click Neo4j deployment to Google Kubernetes Engine?

Let’s Get Started

Don’t Choose One Database… Choose Them All!

$
0
0
Discover how and why you should use multiple databases to suit your needs.
Editor’s Note: This presentation was given by Dave da Silva at GraphConnect Europe in May 2017.

Presentation Summary


Capgemini, a global leader in consulting, technology and outsourcing services, believes that using multiple databases improves your ability to unlock the business benefits of data.

While one database might be good for certain tasks, not all databases are the right solution in every case.

The data scientists at Capgemini believe you should start with one database that partially meets all your analysis needs, such as SQL, then add in other databases such as graph or NoSQL to clear up any bottlenecks that may arise.

Later in this post, you’ll see examples of using multiple databases to solve a business problem. We’ll then explore the pros and cons of using multiple databases.

Full Presentation: Don’t Choose One Database… Choose Them All!


Capgemni’s Dave da Silva introduces the idea of using multiple databases to fulfill all of your data analysis needs.



Capgemini U.K. is a huge global company. For this blog, however, we’re going to explore the work we do within the data science team and within the U.K. That’s a team of around about 50 or so U.K.-based data scientists.

We work very closely with our colleagues overseas – especially in Northern Europe, in North America and in India – but we’re very focused on the U.K. clients specifically. We get all shapes and sizes, but our clients tend to be quite large organizations.

Check out a depiction of Capgemini's database challenge.

A client comes to us and says, “We have lots of data; we’ve been gathering it for many years now. The data sits in our nice relational databases, but we wanted to use that data to gain insight, and to reap the business benefit. Come and help us do that.”

Those are the sort of questions we get. They’re not well-formed analytical problems. They’re more questions about discovery or conducting experiments. And that poses an interesting challenge for us. It’s definitely interesting in that we’re not mandated to take a certain approach or find a certain answer, but it is certainly challenging.

It’s very hard to design a solution when you have an amorphous problem or poorly defined problem. That’s really driven us down to this approach to say, “We can’t select the best database from the beginning because we’re probably going to be wrong. But actually, even if we do have a very well-defined problem, there probably isn’t one database that meets that analytical need perfectly.”

You might find one or two that meet the analytical need quite well. In some cases, it might just be one database. But more often than not, it’s quite a wide problem and you want to be able to implement the best technologies to help to address the challenge.

Discover how and why you should use multiple databases to suit your needs.

The above image provides a quick overview of some of the big database solutions out there. We’re all familiar with those. We’ve then got Hadoop, which has emerged over the last decade as a way to start looking at really large data sets. Of course, you’ve got graph databases.

In this image, we’ve got the appearance of more in-memory databases and in-memory analytical approaches as well. Again, they are good for very fast ad hoc querying, but there present challenges as well. And then the other big group to highlight above is Lucene, which is a very complex, free text search database.

These are just five data points in that whole landscape. There are hundreds more, however, and there are many hybrids sitting between these broad categories. So this image gives you a bit of a snapshot into what these technologies look like when you’re a data scientist, and when you’re looking out there at what helps address the analytical problems clients pose to us.

Learn about challenges of using the wrong database for the wrong use case.

Strengths and Limitations of Using Different Solutions


What are the strengths and limitations of these different solutions?

SQL has been around for years, and will be around for years. It’s very good and powerful, and handles fairly complex queries. However, these queries grow and grow and grow, and as a result get slower and slower, especially as you start to do multiple joins across large databases.

When trying to do analysis and follow a train of thought through the data, it really breaks that up if you’re having to write these long queries and wait many minutes or hours to execute and get a solution.

We’ve all been in the situation where we’ve fired a query off at 5 o’clock in the evening, come in the next day, and it’s crashed overnight or it’s not quite what we’ve been looking for. So then we lose another day of productivity and get an increasingly pissed off client at that point.

We then go on to our free text queries. We know with SQL databases regexes may be performed, where one is able to do some nice fuzzy logic around trying to find those features within the unstructured data fields. But again, it often proves to be quite challenging and time-consuming to write those queries and get your regexes working perfectly. It may also be very time-consuming for them to run. Likewise, they’re quite expensive queries to run on the database.

There are better solutions out there, such as the Lucene-based indexes, for example, like Elasticsearch. Then we go down to the in-memory databases. These are lightning fast, but as the name suggests you need a lot of memory. As your data set gets bigger and bigger, if it won’t fit in any more memory, you’ve got to be clever about how you select data to put into memory or you need to just go out and buy more hardware. There’s obviously implications to that, and so, in-memory databases are not a silver bullet solution to your database woes.

Finally, you’ve got a whole range of other nice SQL solutions out there – the Hadoops of this world – and a whole other range of derivatives that are very good if you’re finding specific key values. But again, these solutions have their limitations and are often limited to batch-type queries rather than the real-time querying we all aspire to do.

Discover why you might not want to use multiple databases.

Why Not Just Use Multiple Databases?


Our solution is very straightforward and it’s slightly pointing out the blindingly obvious.

Rather than trying to find the perfect database that does all of the things you require of it – it’s cheap, it’s performant, it’s reliable, it has the community behind it, and all those good things – we quickly came to the conclusion the perfect database doesn’t really exist for every analytical problem out there. But rather, each database exists for very specific analytical problems or specific use cases.

But, when you’re trying to acquire a range of things with a single client or multiple clients, you’re setting yourself up to fail if you try to find the perfect database. You end up with these lengthy discussions with colleagues, with the community and meet ups with the clients about this database versus that database, the pros and cons, and you end up disagreeing. It doesn’t really go anywhere.

So we said, “Well, let’s just embrace that sort of ethos, and go for as many databases as needed to resolve that problem.”

We’re saying, “Well, actually, if you’ve got a problem of trying to traverse data in a graph traversal, then get a graph database to do it. Don’t try and hack your SQL database to do it or optimize your SQL database to do something it’s not really designed to do. Just embrace the technology that is designed to do that.”

If your problem also needs fuzzy text search on top of that, again, you may try and hack Cypher or hack SQL to make it do that for you – and it will do it – but it will be creaking along by then. At that point, it’s time to bring in a database that is designed and optimized to traverse data.

Databases are not data science – they’re components of it. So on top of those individual databases, the next layer up is our database APIs.

So again, how are you connecting with those databases? How are you writing data to them and reading data from them? Most of the common databases have multiple APIs so it’s not a limiting factor here. You’re not going to be caught in a situation where you can’t actually access your database.

Then we’ve got a step above that again. So we’ve now got the data science technologies, the typical languages – your Pythons, your R, Spark and various other data science languages – you might bring to bear in this particular problem. You’ve also got your graphical user interfaces – your SASs, SPSS, etc. They all sit on top of this and you’re using those APIs that connect to all the different databases. You might find you have one or more technologies that connect to one or more databases.

Ultimately, you’ve got your databases containing either your whole data set duplicated or components replicated across multiple data stores. But you’re probably going to need a master version of the truth. And that’s particularly pertinent in very regulated industries. We’ll hear from clients, “Tell me what the authoritative version of this data is.”

So, of course, you may well have a master data source sitting at the bottom – which one would depend on your use case. We’re specifically not saying what that master data store should be, because it would depend on your combination of other databases and the nature of that store. Personally, I’ve always found that a good old relational database generally fits that purpose, but again, it depends on your use case.

Examples of Using Multiple Databases


Let’s move on to a business example: your typical car insurance fraud example.

Learn about multiple databases via a car insurance fraud example.

Now we’re going to bring together multiple database technologies to produce a table or a vector that you might feed into more advanced analytics.

Let’s start with an SQL database, everyone’s favorite.

SQL databases are great for joins – bringing in other fields of data and joining them to your original source data. For our fraud example, we might start with someone’s credit score or how many previous applications they’ve applied for – just a simple integer. Their income might also be a characteristic we want to look at when we’re trying to assess their propensity to commit fraud or have problems with their car insurance. These typical joins are great for an SQL database.

But then you want to do multiple traversals. For example, who is a friend of a friend of a friend, who shares the same address or same historic addresses. Or maybe certain individuals have been connected to the same vehicle.

These are queries that would require two or three or four inner joins on a SQL database, and this is a very cheap query to do in a graph database. So that’s where you really start to bring in the graph technologies alongside your SQL technologies.

A typical field you might wish to generate from your car insurance fraud detection [or related] example is the links to bad applicants. So, who have we seen in the past who’s committed fraud or has been a problem for the insurance company in the past? Are they within two or three steps of our current applicants?

Discover why fraud detection could require multiple databases.

Let’s review the NoSQL database.

This is where you might start to bring in things like telemetry data. A lot of insurance companies in the U.K. are now starting to encourage their users to fit these little dash-top boxes to their vehicles. It registers things like how aggressive you are when you accelerate or brake, for example, generating loads of machine data worth analyzing and incorporating into your fraud example.

It’s those types of queries against that lovely machine-generated-but-very-vast-data set, where a NoSQL database really comes into its own. We just called it behavior normalcy in the graphic below. So we want to know, are they an aggressive driver? Are they a passive driver? We’re adding this to our vector to use for our detailed analysis later.

Discover how NoSQL can help in complex queries like fraud detection.

And finally, we arrive at text data. We’re a big fan of the team atElasticsearch – they have a great database. We’re doing many searches against free text data. So looking at chat logs with insurance companies, maybe they’ve provided comments within their application, where they’ve tried to describe an accident they’ve had in the past or explain away a speeding ticket they’ve had.

When trying to analyze that data, to bring it into your application, again, use a database that’s really configured to do that sort of thing and it will do it fast. Because remember, what we’re trying to do here is for the end users. They want to be able to go onto the company’s website, tap in all their details and get a response instantly. They can’t afford to wait 24 hours. You’ve already lost that customer by that point.

Learn about text search and multiple databases.

Bringing together just these four technologies, in this particular example, you’re now able to generate an input vector or a feature vector, which you might then feed into machine learning, a sort of clustering approach, really advanced analytics approaches, to try and assess the propensity of this applicant to cause you problems in your insurance industry.

Bring together all the facets of fraud detection and database technologies.

Pros and Cons of Multiple Databases


Let’s talk about the benefits and costs of using multiple databases, as a sort of penultimate slide.

There are pros and cons to this approach. I’m not claiming that you should immediately embrace having multiple databases, because there is a cost to doing that. But before we go onto the cost, let’s talk quickly about the advantages.

So starting at the bottom of the graphic below, we have productivity. This may seem like an obvious one, but if you can maintain momentum as you’re doing analysis – as you’re trying to feed results out to users, to customers, etc., where they’re getting web performance, where they click, look for a result and it appears within milliseconds or seconds – that’s really what we’re aspiring toward.

You simply won’t achieve that in the majority of applications with a single database, because it’s optimized for one purpose and you’re trying to use it for multiple purposes.

Learn about the benefits and costs of using multiple databases.

Next, let’s think about insights potential.

What that means from our perspective is there are things you simply can’t do in a relational database, or in a graph database, that you can do in others. You might get to a point in your analysis, or in your products, where you say, “We simply can’t add that feature because our underlying database isn’t capable of doing it. It’s a complete dead-end.”

You can spend hours and hours trying to hack it to make it do that for you, but ultimately some things just don’t work. The example we hear all the time for graph is four or five or six hops from your first node. In a large relational database, you’ll simply run out of memory or time, and you’ll have a whole avenue of insights there you don’t have access to.

Pragmatically, as a data scientist, you say, “I could spend three or four days trying to make this regex work reliably and quickly, or I could not do it and take a bit of a shortcut with an alternative approach that will take me a couple of hours.” We’ve all got to be pragmatic when we’re looking at these solutions, especially when you’re working on a deadline for clients.

So actually, by giving our data scientists a whole range of tools to draw upon, you’ve suddenly greatly increased their ability to find these insights within the time available.

The final big benefit is governance.

I know it’s probably the most boring one, but this is the real killer. If you’re trying to productionize your systems, governance will be the thing, the red line, that prevents that from happening. Having multiple databases is a very good way of segmenting the queries your users are carrying out, but it may also potentially be segmenting your data.

So you might say, “Well, we’ve got this whole range of BI users over here who just need to access these sorts of databases to generate their dashboards.” So let’s put all that in a relational database, give them access to that, connect Tableau to that (or whatever tool you might be using, Spotfire, etc.), and it’s really very separate and distinct from the rest of your data at that point.

Then you might say, “We’ve then got the data science users who want access to everything.” Fine, do that for them, but make sure it’s a very specific and limited group of users who have superuser power. So by having separate databases, it’s very clear-cut where access is to those different data resources. That makes your governance model very black and white, which is a huge benefit once you’re trying to productionize these systems and reassure security you’re not going to cause a data breach for your organization.

There are a lot of positives, but we’ve got to talk about negatives and give a reasoned argument to this.

So, of course, increasing IT spend is a negative. Not all of these databases are free. Even with Neo4j, there are paid full versions. So potentially, as you start to bring in multiple enterprise databases, you have to pay for multiple licenses. Depending on how the database is, how the applications are licensed, this may scale quite nicely or it may be horrible to scale.

Again, it depends on which ones you use. Of course, databases don’t tend to play well when you have multiple databases on the same server. You might need additional hardware to service these different databases. So there is a cost to doing it in terms of your hardware and software.

Another negative is integration complexity. You’re looking at a tangled web of databases with multiple connections going to multiple applications, all of which you’re trying to secure and maintain access for for your users. It can become a bit of a nightmare, so you’ve got to bear in mind there’ll be an increased cost of building the system and also an increased cost of maintaining it as well in terms of integrations.

Finally, diverse skill sets are needed.

If you have an SQL database, say an Oracle database, then you need an Oracle DBA and a few people who know how to use it. It’s a very specific, very bounded skill set. Soonyou start to bring in three, four, five databases of different flavors, and potentially very different flavors if you’re bringing in things like Hadoop, as well, for example.

Suddenly, the range of skills you call upon for your team is increasing as well. You’ve got the interfaces between them, so you need people who are able to configure Kerberos across all of these different databases. It’s a huge challenge in terms of manning that team.

Summarizing the use of multiple databases.

The key point here is, we found within the data science team in Capgemini, that these multiple databases do give us a greater ability to unlock business insights. And that’s really what you’re trying to do here.

You’re not trying to create a playpen for data scientists. You’re trying to create a system that allows you to get business benefit for your clients. As a bonus, though, you’re going to have happier data scientists who do now have a bigger playpen. You’ll also have happier end users who get faster, more performant systems and they do their jobs far more easily without having to wait for queries to execute overnight.

That said, there is a cost. You may find a multiple database approach actually fails for smaller projects. If you’ve only got a team of two developers and you’ve got a couple of weeks to put together a discovery activity, standing up five different database types is probably going to be prohibitively expensive. Or it’s going to require skill sets that don’t exist within your two-developer team.

However, when you get to larger engagements, larger projects here and certainly more towards your live solutions, suddenly that cost-benefit balance starts to shift. The advantages of multiple databases start to outweigh the costs of implementing that.

I’d recommend working iteratively, starting small. You might say, “Well, we can use a relational database that meets 90 percent of our needs. Let’s start with that.” Prove some business benefit and prove where performance bottlenecks are.

If you are doing multiple inner joins, and you’re finding that’s the query that’s killing your database, that really then gives you the justification to move on and say, “We need to bring in a graph database alongside that to take the weight off the relational database at that point.”

Iterate in this manner, again always adding a database, testing to see if it meets your performance needs.


Inspired by Dave’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Get My Ticket

Metadata for Real People: 5-Minute Interview with Pieter Visser, Solutions Architect at the University of Washington

$
0
0
Check out this 5-minute interview with Peter Visser on graph databases and metadata.
With metadata, everything’s connected. That’s what makes it interesting, said Pieter Visser, Solutions Architect at the University of Washington (at the time of the interview).

At the University of Washington, the IT team serving the business tried more than one tool to provide end users with a way to find out about all the data at their disposal. Those tools failed at connecting all their metadata and handling the ever-changing schema of the university’s data. To serve end users, the team built its own metadata tool using Neo4j.

In this week’s five-minute interview (conducted at GraphConnect New York) we discuss what inspired Pieter and his team to build this tool, as well as the many uses they are finding for metadata.

Check out this 5-minute interview with Peter Visser on graph databases and metadata.

Talk to us about how you guys use Neo4j at the University of Washington.


Pieter Visser: I work for the University of Washington’s IT department. We’re a unit that’s specific to the business side of the university, not the student side. There are a lot of people using Neo4j for research, but that’s not what we do.

We use Neo4j in a couple different ways, but the main way we’re using it is as a metadata repository that stitches together information for the enterprise data warehouse and our BI tools.

With metadata, by definition, everything’s different yet everything’s connected. That’s what makes it interesting. It’s how a table is connected to a term, how a column is connected to a report or how’s it being used.

When we tried to do that on a relational database, trying to connect those relationships is almost impossible. You think you have it defined, and all of a sudden, someone says now we want to add a cube to this with this many dimensions, and I want to connect that dimension to a different term. Relational databases can’t do that.

What made you choose Neo4j?


Visser: We tried a couple of different tools. We purchased a cloud-based tool, and we purchased some other tools, and none of them really worked right for us. I feel like it’s the Swiss Army knife kind of thing. You get a Swiss Army knife, but if you ever try to use it as a screwdriver, you just want a real screwdriver or a real hammer. We decided we wanted to make our own metadata tool.

Neo4j gave us the ability to basically connect any node to any other node and then show that visually. It gives us the context of our metadata.

What else made Neo4j stand out?


Visser: Our main goal was to make it easy for the end user. Metadata tools, by definition, are usually written for metadata managers. They’re really just not easy to use. While metadata managers love them, end users say, “I don’t understand this at all.” With Neo4j, end users may quickly visualize and get context for metadata. It is fantastic.

Can you talk to me about some of your most interesting or surprising results you’d had while using Neo4j?


Visser: I think what is interesting now is, even outside of that project, to just look at that data, and see what we can do with the data that we’ve collected. We mix it with other data.

For example, we mix it with security information and create a semantic discretionary access control (DAC) layer. Or we mix it with organizational structure and do report recommendations based on org structure. So it’s a lot more than just the metadata as you start mixing in all the kind of data sources, and it tells a different story based on the same data.

If you could start over with Neo4j, taking everything you know now, what would you do differently?


Visser: I think I would change our data model. We added versioning right at the beginning. That complicated things significantly for us because, all of a sudden, you can no longer just traverse your nodes. You can’t simply say, “Go from this table to that table.” You have to say, “Go from this version of this table to that version of that table.” And that complicated all our Cypher queries tremendously.

That’s kind of the bread and butter of Cypher, right? Just to quickly say, “Show me the shortest path to that thing.”

If you have versions, you say, “From this active version to that active version.” And if I want to trace the lineage of something, I can no longer just traverse my trail. I have to traverse the trail and then figure out which version of this trail to use to go to the next one.

That’s not a Neo4j thing. It’s just that the data model that we designed complicated things and probably went against the best way to use Neo4j.

What do you think the future of graph technology looks like for metadata?


Visser: It’s about the UI. I know we saw some tools for UI this morning, but I don’t really think it’s enough yet. I feel there will have to be much better UI tools that allow the end user to do an analysis.

Today we saw a textual Cypher query, something that converts text to Cypher. The user still has to understand the context, and they have to understand what they’re asking if they use it that way.

I would like to see a much more visual way to query. And I don’t mean by graph, going to from this node to that node. Imagine you had Cypher snippets on the left and you say, “Well, I’m going to drag these snippets and connect them and get a brand-new result from that.” And then you say, “This snippet is everything I want to exclude. And this one is what I want to use to boost my results.”

Those are the kind of tools that will be much more user-friendly for my end users.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


Want to learn more on how relational databases compare to their graph counterparts? Get The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

The Next Generation of Service Assurance: Differentiating Your Solution

$
0
0
Different your solutions to achieve next-generation service assurance.
Successful companies are embracing next-generation service assurance that leverages a comprehensive, real-time view of services and infrastructure with an eye on end-user experiences, new service creation and predictive modeling.

But to compete in today’s market, communication service providers need a flexible foundation for their next-generation service assurance solutions. They need agility and speed to innovate and the capacity to find connections at scale in large and growing datasets.

Different your solutions to achieve next-generation service assurance.

In the first post of this series, we examined challenges in optimizing network services. Last week, we described how leading companies overcome those challenges using graph database technology.

In this final post, we’ll explain how graph technology empowers you to rapidly build next-generation service assurance solutions, including a glimpse into how firms like Zenoss, Cisco, Orange and Telenor use Neo4j.

Quickly Build Unique Service Assurance Solutions with Neo4j


Neo4j is a highly scalable, native graph platform that delivers real-time insights into data relationships. Incredibly fast writes of dynamic topology and lightning speed traversals means you can provide your customers the ability to make decisions at the pace of their business.

Neo4j naturally stores, manages and analyzes data within the context of connections. With the flexibility provided by Neo4j, and its schema-less model, you continually improve your network solutions of all types by accommodating new data sources and formats – without a rewrite of your data model.

Built-in high availability features ensure network and subscriber data is always available to your mission-critical service assurance solution. Data is integrated into a Neo4j cluster and then modeled and queried based on its connections, creating a foundation for crafting advanced capabilities that power your solutions for next-generation service assurance.

Differentiate Your Solution


Graph technology offers ways to set your next-generation service assurance solution apart:

    • Augment and enhance existing applications to better leverage data relationships and integrate siloed information. Illustrate your ability to make sense of complex networks.
    • Capitalize on new market opportunities by creating products and services that uncover hidden patterns and insights.
    • Provide performance at scale with native-graph technology designed to query highly connected data and improve response times from minutes to milliseconds when compared to relational databases.
    • Offer the assurance of partnering with the graph Industry market leader with over 15 years delivering hundreds of deployments and 24×7 production applications. Leverage enterprise capabilities that include:
        • High-performance caching
        • Enterprise lock manager
        • Clustering
        • Hot backups

      Go to Market Faster

      A flexible, powerful graph platform like Neo4j enables you to build your solution faster:

        • Prototype faster with data models that reflect real-world business models as opposed to relational databases that require very timely programming and joining to relate data.
        • Complete rapid proof of concept projects by taking advantage of our proven methodologies to ensure you get the most out of integrating Neo4j. We provide expertise, training and standard or customized workshops to ensure partner success.
        • Quickly execute when it is time to take concepts into production with the simplicity of storing all data elements and relationships within a native graph database, as opposed to the complexity and overhead of implementing solutions layered on top of a relational store.
        • Quickly iterate and expand your solution with our highly flexible data model that enables you to easily add, remove or change data elements and sources more efficiently without changing the database schema, impacting performance or requiring downtime.

      What Companies Are Doing with Neo4j


      Zenoss

      Discover how Zenoss uses Neo4j. Zenoss is a leader in hybrid IT monitoring and analytics software, providing complete visibility for cloud, virtual and physical IT environments.

      Neo4j is used by the Zenoss Service Impact solution, which maintains dependencies between IT services in real-time, enabling more precise root cause analysis and minimizing downtime.

      Zenoss monitors 1.2 million devices and 17 billion data points a day and more than 60 million data points every five minutes.

      Cisco

      Discover how Cisco uses Neo4j. Since the company’s inception, Cisco engineers have been leaders in the development of Internet Protocol (IP)-based networking technologies and today sells hardware, software, networking and communications technology services.

      Cisco uses Neo4j for content management, master data management and as a solution and OEM partner. According to Cisco’s Peter Walker, graph queries using Neo4j beat relational database management systems (RDBMS) hands down: “We tried doing our work with RDBMS, but found that the queries were just going too slowly.”

      Orange

      Discover how Orange uses Neo4j. Orange is one of the world’s leading telecommunications operators with over 269 million customers including 208 million mobile customers and 19 million fixed broadband customers. Orange is also a leading provider of global IT and telecommunication services to multinational companies.

      Orange uses Neo4j for security insights and overall infrastructure monitoring. Orange’s Nicolas Rouyer put it this way: “We use Neo4j to find security issues in our information systems and to give us a fresh perspective on IT and a bird’s-eye view of all its components.”

      Telenor

      See how Telenor uses Neo4j. Telenor Norway is the leading supplier of the country’s telecommunications and data services. Using Neo4j, they provide businesses and residential customers with a self-service portal that brings together information about corporate structures, subscription information, price plan and owner/payer/user data, billing accounts and any discount agreements.

      Telenor’s Sebastian Verheughe cites Neo4j’s performance and flexible schema: “Neo4j’s high-performance engine provides flexibility of data representation along with features that go beyond traditional relational databases.”

      We’re Here to Help


      We work closely with customers and partners to help you quickly integrate and even extend Neo4j so you are able to create distinctive and sustainable solutions.

      You’ll also lower your overall costs by embedding Neo4j advanced graph capabilities for less than the cost of implementing and maintaining traditional relational database solutions. And you’re encouraged to take advantage of our flexible pricing models so you may align your cost of goods to your business model.

      We are 100% dedicated to graph-based solutions and helping our partner ecosystem thrive. We look forward to collaborating on innovative solutions. Please contact your account representative or reach out at oempartners@neo4j.com for more information on how we can work together.

      Conclusion


      Neo4j is an ideal foundation for next-generation service assurance solutions.

      Unlike other technologies, Neo4j is designed from the ground up to store and retrieve data and its connections. Relationships are first-class entities in a native graph database – making them easier to query and analyze.

      Neo4j’s versatile property graph model makes it easier for organizations to evolve solutions as data types and sources change.

      Neo4j’s native graph processing engine supports high-performance graph queries on large user datasets to enable real-time decision making.

      The built-in, high-availability features of Neo4j ensure your user data is always available to your mission-critical next-generation service assurance solution.

      This concludes our series on the next generation of service assurance. We hope these blogs have inspired you to explore Neo4j as the foundation for your applications.

      Innovate and scale:
      Find out how top CSPs use connected data to capitalize on new market opportunities in this white paper, Optimize Network Services: Advanced Service Assurance with Neo4j. Click below to get your free copy.


      Read the White Paper


      Catch up with the rest of the service assurance blog series:

Graph Databases for Beginners: The Basics of Data Modeling

$
0
0
Learn the basics of data modeling in addition to why it matters (like a lot) which data model you choose
For six-ish months of my life, I was a database developer.

Starting out, the first thing I learned was data modeling. Our team was using a relational database (RDBMS), specifically MySQL (we later switched to Postgres). Like a lot of backend developers at the time, we didn’t intentionally choose to use an RDBMS, it was just the default (that’s no longer the case).

Of course, that meant my lessons in data modeling would follow the relational data model – and not to spoil the ending – but it sucked.

This isn’t to say that the RDBMS model is always bad (it isn’t) or that it always sucks (it doesn’t). But when it’s used as the one-size-fits-all data model for every project and application under the sun, well, there’s going to be a lot of mismatch.

The good news: The relational model doesn’t have to be your default.

Other data models exist, and they are awesome. Today, we’re going to take a closer look at one in particular – the graph data model – and walk you through a better first-time data modeling experience than I originally had.

Learn the basics of data modeling in addition to why it matters (like a lot) which data model you choose


In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve covered why graph technology is the future and why connected data matters.

This week, we’ll discuss the basics of data modeling for graph technology.

(Psst! If you’re already a data modeling vet, check out this article on how to deploy your seasoned skills to a graph database model.)

What Is Data Modeling Exactly?


Data is like water. It’s probably useless if you don’t put it in a helpful container. The shape, size and functionality of that container depends on your intended use, but in general, a container is necessary.

The same goes with data. When it comes to creating a new application or data solution, you need to provide a structure for that data. That structuring process is known as data modeling.

Often reserved solely for senior database administrators (DBAs) or principal developers, data modeling is sometimes presented as an esoteric art unknownable to mere mortals. You may worship the expert data modeler from afar.

While some data modeling scenarios really are best left up to the experts, it doesn’t have to be difficult by default. In fact, data modeling is as much a business concern as a technological one. So if you don’t know a single line of code, you’re in luck.

Anyone can do basic data modeling, and with the advent of graph database technology, matching your data to a coherent model is easier than ever.

A Brief Overview of the Data Modeling Process


Data modeling is an abstraction process. You start with your business and user needs (i.e., what you want your application to do). Then, in the modeling process you map those needs into a structure for storing and organizing your data. Sounds simple, right?

With traditional database management systems, modeling is far from simple.

After whiteboarding your initial ideas, a relational database requires you to create a logical model and then force that structure into a tabular, physical model. By the time you have a working database, it looks nothing like your original whiteboard sketch (making it difficult to tell whether it’s meeting user needs).

On the other hand, modeling your data for graph technology couldn’t be simpler. Imagine what your whiteboard structure looks like. Probably a collection of circles and boxes connected by arrows and lines, right?

Here’s the kicker: That model you drew is already a graph. Creating a graph database from there is just a matter of running a few lines of code.

Relational vs. Graph Data Modeling: A Match-Up


Let’s dive into an example.

In this data center management domain (pictured below), several data centers support a few applications using infrastructure like virtual machines and load balancers.

An Entity-Relationship (E-R) diagram of a data center domain

A sample model of a data center management domain in its initial “whiteboard” form.

We want to create an application that manages and communicates with this data center infrastructure, so we need to create a data model that includes all relevant elements.

Now, for our match-up.

The Relational Data Model

If we were working with a relational database, the business leaders, subject-matter experts and system architects would convene and create a data model similar to the image above that shows the entities of this domain, how they interrelate and any rules applicable to the domain. It would require a lot of back and forth as well as a lot of what-if thinking trying to plan for every possible exception or rule-breaking (i.e., model-breaking) instance.

It would be a long meeting.

From there, a senior DBA would create a logical model from this initial whiteboard sketch before mapping it into the tables and relations you can see below.


The relational database version of our initial “whiteboard” data model. Several join tables have been added just so different tables can communicate with one another.

The relational database version of our initial “whiteboard” data model. Several JOIN tables have been added just so different tables can communicate with one another.

In the diagram above, we’ve had to add a lot of complexity into the system to make it fit the relational model. First, everywhere you see the annotation FK (tech lingo: foreign key) is another point of added complexity. And if you’re not a seasoned sysadmin, I’ll let you in on what you should think when you hear “complexity”: shit will break more often.

On top of all this, new tables have crept into the diagram such as AppDatabase and UserApp. These new tables are known as JOIN tables. (“JOIN” is written in all caps by industry convention, but it’s also a great visual aid to think of JOIN tables as shouting at you. They’re shouting because they’re difficult to work with.)

I hate to be the bearer of bad news, but JOIN tables significantly slow down the speed of a query (and imagine how many queries will be running through your 24×7, mission-critical enterprise application. Yeah, lots.). Unfortunately, they’re also a necessary evil in the relational data model.

The Graph Data Model

Now let’s look at how we would build the same application with a graph data modeling approach. At the beginning, our work is identical – decision makers convene to produce a basic whiteboard sketch of the data model (pictured again below for reference). But there’s a key difference to this meeting: They get out early and enjoy a few hours of extra jet ski acrobatics (like anyone would).

Why’s that? Because with the graph data model, they didn’t have to plan for every possible expansion, exception or fire hazard that might affect the database. Today’s meeting was just a starting point, and if something comes up later the model is adaptable. No sweat.

An Entity-Relationship (E-R) diagram of a data center domain

Our sample model of a data center management domain (again).

Refreshed from their jet ski session, our data modelers return to the next step in the process. After the initial whiteboarding process, everything looks different. Instead of altering the initial whiteboard model into tables and JOINs, they enrich the whiteboard model according to their business and user needs.

That’s right: The data model gets better, not worse.

After enrichment, here’s what the newly enriched data model looks like after adding labels, attributes and relationships:


A graph data model example of a data center management domain

Our enriched graph data model with added labels, attributes and relationships.

As you can see, the enriched data model isn’t that much different than the initial whiteboard sketch, except that it’s, you know, more helpful. In fact, this data model is now ready to load into a graph database (such as Neo4j!), because with graph technology what you sketch on the whiteboard is what you store in the database.

Bottom line: The only thing standing between you and your completed data model is an EXPO marker and a blank whiteboard.

Why Data Modeling Isn’t a One-Off Activity (No Matter What Database You Use)


It’s easy to dismiss the major differences in data modeling between relational and graph databases. After all, data modeling is just an activity you have to complete once at the beginning of your application development – right? Wrong.

Let’s go back to story time (yay!): RDBMS data modeling was rough for a liberal arts grad, but then it got worse (boo!).

While we were still in the whiteboarding and brainstorming phase, changes were easy to make to our data model. Of course figuring out which relationships had to be one-to-one and which ones had to be one-to-many wasn’t always easy, but executing those changes was a breeze. After the whiteboarding phase, not so much.

Once we’d plugged and chugged our whiteboard model into Postgres, changes were a lot more difficult. Schema migration is literally no one’s favorite database activity (probably that guy just skipped to the comments). And once the database was live and in production, my answer to any proposed changes: fuggedaboutit.

Of course, the suits didn’t forget about it. They still needed changes, because the user needs were constantly changing. And business requirements changed too, constantly. That’s because, life alert: change happens.

Why would anyone assume that change wouldn’t happen to their data model? Wouldn’t it just be better to use a data model that accepted change as a fact and prepared for it, instead of digging in its heels and bracing for the inevitable?

Conclusion: Change You Can Believe In


Systems change, and in today’s development world, they change often. In fact, your application or solution might (read: will) change significantly even in mid-development. Over the lifetime of your application, your data model constantly shifts and evolves to meet changing business and user needs.

Relational databases – with their rigid schemas and complex modeling process – aren’t a good fit for rapid change. What you need is a data model that doesn’t sacrifice performance and that supports ongoing evolution while maintaining the integrity of your data.

Now that you know the basics of data modeling, the choice is clear.

If you’re creating an application with a well-understood, minimally changing data model, stick with the tried-and-true relational database. Seriously, just stick with what already works.

But maybe your path is leading you somewhere else. Maybe you’re creating something new. Maybe you’re trail-blazing into uncharted territory. Maybe you can’t plan for a database with all the right answers, because you don’t even know the questions users are going to ask it.

If this describes your next project, then you need a data model that’s agile. You need a data model that evolves alongside development (without breaking down or lagging behind). You need a graph data model.

The future is uncertain (you can count on that). Choose a data model that matches that reality.


Want to dive deeper into the world of graph database technology? Click below to get your free copy of the O’Reilly Graph Databases book and learn how to apply graph thinking to your biggest connected data challenges.

Get My Book



Catch up with the rest of the Graph Databases for Beginners series:

Telia Zone: Scaling Neo4j to Millions of Homes with Kubernetes

$
0
0
Discover how Telia Zone uses Neo4j for their connected home platform.
Editor’s Note: This presentation was given by Rickard Damm and Lars Ericsson at GraphConnect Europe in May 2017.

Presentation Summary


Telia Zone is the Telia company’s router, which is used in approximately 1.5 homes in Sweden. Telia uses Neo4j with Kubernetes to connect to all of these homes, and hosts causal clusters with Kubernetes to graph all the different actions that take place in and through the routers.

Telia has now moved to expand the use cases possible with the router and is using the graph to determine how many possible use cases there are so the company continues to build out capabilities for the router beyond basic connection to the Internet.

Some of these additional use cases could include connecting Sonos speakers to Spotify for multiple users within a home and texting parents when their children arrive home. All of these actions create instances that are tracked within the graph database to ensure things like not overloading servers and identifying which users are in certain zones the same time. These use cases are enabled by Telia’s APIs.

On the router side, more than 1.5 billion requests per day can take place. With causal clustering with Neo4j, Telia is doing a phased rollout rather than pushing this to all homes at once.

Full Presentation: Scaling Neo4j to Millions of Homes with Kubernetes


Rickard Damm: The following is how Telia’s Telia Zone router is integrated with Neo4j, with Kubernetes to scale to the million-plus homes in Sweden that use Telia. I’m the head of the product for Telia Zone.



About Telia


Telia is the leading incumbent carrier in the Nordic market. We are in the telecom space and are a quad player, meaning we have both mobile networks and the fixed network with the enterprise and the consumer’s place. One of the areas where there’s been the least innovation on our side of the business is the broadband side.

On the broadband side, everybody genuinely understands what it does: it’s router, it does Wi-Fi, so it terminates your Wi-Fi signal to your fiber.

In Sweden, we have a lot of fiber, but we also have ADSL. We have roughly one million of these distributed in Sweden, so this blog is focused specifically around Sweden, however we’ll be deploying the Telia Zone also outside of the Swedish market.

So, we have a million of these. I saw them as an underutilized asset. We have a really great customer relationship when it comes to TV service, so roughly 60 or 70 percent of our users also subscribe to our TV service. That’s one way to interact. But hardly anyone has been innovating and doing things with the router in many, many years. We started thinking about is and decided to try and see what capabilities are there.

Expanding Telia Zone’s Use Cases


Roughly half of our install base subscribers are age 55 or older. If we want to make something smart for the home – smart phone home platform, which the Telia Zone is – we have to abstract quite a few levels above Neo4j. It has to be a bit more simple.

We unpacked our router and thought, “What is inside of here? Could we add something? Yes, I think we can.” So, we are adding stuff to your router. We are going to do two things for you: We are going to help you simplify your life, and we are going to help you entertain yourself together with the others in your home. Those are our two core value propositions.

How do we do that? Well, we do that by introducing services or building services, allowing other companies to build services for you based on a context. We are adding the context of the home – when you enter the home, when you leave the home, etc.

We packaged it as the Telia Zone and are deploying it to everybody in Sweden who has a Telia router. It’s neither opt in nor opt out. This is something that is going to be rolled out to everybody – roughly one million homes. The idea here is that we want people to be so proud of being a Telia subscriber that they put a sticker on their mailbox: “This is a Telia Zone home.”

Instead of explaining this as a technology – because at the end of the day, it’s a new technology that we have introduced – let’s look at it in terms of the use cases.

Essentially, your broadband connection is getting a new life. And your connected home – the Telia Zone – is the centerpiece of your connected home. But we are not trying to scare away users by trying to push a connected home platform onto them. Instead, we are inviting them to just start experimenting with our service.

For instance, you can receive a message when the kids come home – this is a really good feature. Without an app running, without anything running on the phone, you can get an SMS when the kids come home. I have three kids, and this is a magical feature.

How about the lights turning on when you come home? How about the music changing when you get home? Or when you leave the home, and you forgot to lock the door, how about we give you a message there and remind you that you forgot to lock the door? Those are things the Telia Zone does.

We built a playlist generator together with Spotify (Coplay is the app). This is the Sonos moment, the 9:00 p.m. Sonos moment where – if you have a Sonos system and you have a party with 10 people – the Sonos phone playlist goes around the table, and people start changing songs to their favorite.

How about if we generate that playlist automatically based on who is connected to the Wi-Fi? Coplay generates a playlist in Spotify, which dynamically removes you if you leave the house. If you come back, it adds your songs again. That’s what it does. One last thing to note about customer experimentation: We also made an IFTTT (If This Then That) channel. Many of you likely know what IFTTT is. For the 55-years-plus consumer, most people don’t know what IFTTT is, but we’re trying to introduce this a little bit as a way to experiment with your connected home.

Check out Telia Zone's connected home playlist app.

The Telia Zone is something that lives inside our partners, so we don’t have a Telia Zone app. We don’t have our own experience. Instead, we are integrating inside partner experiences.

One solution we have out there, is Glue Lock, which is a smart lock that turns the knob on the lock, so to speak, retrofit. This solution works with reminders. We also have a smart home, like the Nest thermostat, which is running for geothermal heating and a few others. We have curated them to see and calibrate what the customers are interested in.

Now we’re arriving at why we use a graph database. We actually don’t know yet what the killer use cases are for the Telia Zone. We realized that we probably have to build it as a much more open platform than a closed platform.

So we have an API today. The API is found on premiumzone.com. This is the English-speaking website where you can, as a developer, go in and look at our APIs. We have four APIs.

Learn about Telia Zone's API features.

Working with Telia Zone’s APIs


So what you can do with our APIs?

We’re telling our developer communities that they see when somebody comes and leaves the home. So, there’s a registered device that’s comes into or leaves the zone. WebHooks is an app for that, where it allows the user to identify devices as clients within the zone, and then you can get all the other clients that are in the zone.

Those are the basic APIs. They may not sound like much if you are not building apps, but we have about 150 developers in our system today.

The last thing is that you can authenticate clients in a zone.

So the last one, Spotify, is super interesting. In this use case, Spotify is run by the mother of the family, and it’s the children sharing the account with their parents. And perhaps there are battles over this one account. What happens when the kid starts playing? You get a message on your phone that somebody is playing on the same account.

If you are using the Telia Zone home account or authentication, you’re able to start proposing things to users that you could never do before. You can see the MAC address, the registered devices, and may automatically generate new accounts for your users and then later on start filling in the required information.

In the background, using our API, you then get access to unique identifiers for every single device that is connected to this service and generates an account for those automatically, and then populate that information afterwards.

How and Why We Built on Neo4j


Lars Ericsson: Let’s get into the technology on how this actually works.

So first, just a small rundown on how the infrastructure is built up and how we are hosting Neo. Then we’ll go back into how we are scaling Neo4j and what we use Neo4j for.

Basically, we run everything on Google Cloud Platform. We’ve created a micro-service architecture out of Node.js apps. We’re using the Neo driver for Node and we’re hosting a causal cluster within Kubernetes. We also have some other services.

This is a simplified version of our architecture.

Check out Telia Zone's API architecture with Neo4j.

It is really easy for us to scale within Kubernetes, and it is also really easy for us to upgrade. I actually upgraded to a Neo4j 3.2 just before we went up on stage, in development at least.

So if we have a user connecting, what is actually happening in the back inside?

Well you, of course, connect to the router. The router sends a message to our backend, to our Node applications, where we then distribute the data out to our storage solutions. What you see in the graphic above is that we are actually running more storage solutions than Neo4j. They each have different use cases.

We’re using Cloud SQL to hold a state of all our routers out there. And then, we’re building the graph database on what’s actually important to us, and what’s important to our applications to make good decisions.

What happens after we push that into our storage solutions is that we also notify you, as a third party, with a WebHook saying, “This client of yours actually has connected to this zone.” (Router is the same thing as a zone in our topology, at least for now).

So what happens then? What if you want to ask what clients are in a specific zone? Which of my users are in this specific zone? As a third party, you make the request to our API. We go to Cloud SQL to fetch the current state of that zone, but then we go down to Neo4j.

What we want to deliver back to you is who the users are. Not only the MAC addresses that we use internally in our system; we want to deliver something more valuable back to you. So we want to deliver the actual usernames, tokens or some identifier for your users. We store all of that information in Neo4j.

Before we actually know that one of these users is a specific user in your system, first you need to register that user with us. This is where we really need the hardware to do this, because how we identify users is first by the MAC address and then make an API request from within a zone.

That request goes through the router, and then we look at your IP address and know what MAC address it is.

Before you have done this last step, we won’t actually expose any devices to you – this is a super important privacy feature, of course, for all devices out there. You need to activate a device first and have something activating that device before we will provide information back to you.

So scaling this, Sweden is a pretty small country, but it’s still a lot of routers. What we have seen so far is that each Wi-Fi device generates around 100 connects or disconnects, changing the state on your Wi-Fi 100 times a day. That is 100 requests per day, per device. In each home we have seen so far, on average, you have around 11 Wi-Fi connected devices. So this adds up to around 1,100 requests per day. Still not a lot of a load, but as we scale this to up to over a million routers, it starts to add up.

All of this also requires ride operations, and that is why we are moving over to causal clustering.

This only accounts for the traffic that actually sends the update up to our system. The bigger load that we actually need to build for is what all the third parties are interested in. So just on the router side, over 1.5 billion requests per day.

We really like Kubernetes and we really like Neo4j. Neo4j’s causal clustering is working out really well for us.

As we scale this, we are doing a rollout where we’re not pushing this to all homes at once. And anyone hosting can adapt to how many users you have. We don’t want to spike out our servers to handle all the requests that might come. We want to have servers that handle the requests that are coming in.

See how Telia Zone works with Kubernetes.

What Is Kubernetes and How Does It Work with Neo4j?


For those who don’t know about Kubernetes, it’s open source software that comes from Google that basically lets you create a cluster to run Docker containers in. Kubernetes does a lot, and it fits really well with Neo4j.

We can divide clusters into development and production. But how does that fit with Neo4j?

Well, we do node selection. We have servers in our clusters, and they are not all the same. We need specific servers or specific hardware where we want to run Neo4j. And within Kubernetes, we target specific servers that we want to run Neo4j on. Why is that important?

Well, since we do this dynamically, we want to be able to make sure that all future instances of Neo4j also end up on these type of nodes, not only the ones that we’ve actually manually installed or something.

Learn how compatible Kubernetes and Neo4j are.

Another feature that works really well for us, together with causal clustering, is StatefulSets. What StatefulSets in Kubernetes does is guarantee a certain order where you start your nodes. As we want to scale up, we guarantee that some nodes are already there, and some instances of Neo4j are already there, before we start scaling. We just set up the initial cluster and are then able to scale directly.

Autoscaling is super important for us as well. As the load increases, we scale up to more replicas of Neo4j. What happens is, at one point or another, you grow out of your cluster. The resources in the cluster aren’t enough anymore. So then, we actually also autoscale servers.

So why are we using Neo4j for this?

Well, it’s actually already a graph. And also, the scalability is super important to us. Being able to scale horizontally without actually knowing what the load might be tomorrow is super important to us. We have a data model that is constantly changing.

Discover how Kubernetes works in the Telia Zone.

How Telia Zone Works with the Neo4j Graph


So just a quick look at the graph database.

Everything centers around the zone. And to the zone, we tie relationships with devices. The devices at your home will have a strong connection to the zone at your home. Then you run apps in that zone, so apps get a relationship.

Say you run Spotify at home or you let Spotify run for anyone in your home. Well then, Spotify gets a relationship to both your device and your zone. Of course, we have different zones and other zones also run that same app. Devices might even go between zones.

Maybe you go visit a friend, you get a relationship to that zone as well, and that’s how we build up that graph.

Also, we have multiple apps. Apps are related to other devices, and the graph keeps on growing.

What do we need this for then?

We want to be able to ask questions like, “What other devices are running this app?” Or, “What kind of relation did this device have to this other zone?” Or, maybe we even want to look at predictions.

Check out Telia Zone's Neo4j graph visualization.

Doing business intelligence on this, looking at when you’re most likely to adopt a new application for something is super valuable data to us, and this is why we use Neo4j.

Check out Neoj and Telia Zone.



Looking Ahead


Rickard Damm: To wrap up, I would like to peek a little bit into the future.

The Telia Zone is a consumer offering. Per definition, we’re a consumer brand, first and foremost, as Telia is in Sweden. From the consumer offering side, we want to expand and try to find new services that are very relevant. (That’s one part of the predictions.) The other part is that we’re also expanding this into externalizing the intelligence that we gather as a B2B offering. That is the next thing that is going to happen down the road.

This is just one graph of a subset of households where we have connect and disconnect events – so people leaving, people coming. People are very predictable. We’ve seen that just with a very simple prediction algorithm, that we can quite accurately predict when people are, for instance, coming or leaving or at home.

To give a glimpse into a little bit of a simple use case from this could be – taking one example of a food delivery app. Here is a home shopping app for food. So you just select your things there, and then you get a suggested date to when you can get this delivered.

Say I have a bag here that’s to be delivered for 500 crowns, like €50, next Monday. That’s usually what the user experiences look like on these apps. With the premium zone technology or the Telia Zone brand, we can enrich this quite a lot to become super granular and increase the customer value proposition even more.

In this case, we would have a suggestion for a delivery time. It says Monday at 5:30 in the afternoon,could propose that to the delivery company or the payment company. We don’t have to give them the entire graph. We don’t have to give them the entire data set. We just suggest one time when we believe that this person will be at home. It’s, of course, on the user – the user has to consent.

When showing this to companies, like the postal services and to other delivery services, they go absolutely wild about these ideas. Because they have had these things before. It’s still on the customer terms, you still have to accept, but the accuracy will be very much increased and also the customer satisfaction is going to be this delightful surprising experience.

We can also deliver very interesting insights. For instance, we made a run on what was the Christmas present of the year. We had a few thousand zones rolled out over Christmas, so I asked the team, “Which new devices were activated after 3 o’clock?”

So, 3 o’clock is usually when people start opening their presents in Sweden. We made a cross analysis to the vendors of those MAC addresses, and we saw that, for instance, iPhone was 68.8 percent of all the new devices. So you can do super interesting analysis on the data set that we generate. That is totally unique and nobody has thought about in our business before.

The underlying technology is found in all the descriptions of the APIs, and this current form is found under the premiumzone brands, so we have premiumzone.com. We’re positioning this as the technology, and we are expanding outside of our footprint sooner than later, for sure. We have inbound requests from quite a few other operators, as well. I would love for this to become some sort of standard.


Inspired by Dave’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Get My Ticket

Why 20 Women Gathering in the Swedish Mountains Might Just Change the World

$
0
0
Pink Programming camp sponsored in part by Neo4j, a camp for female data scientists.
As a Swedish Generation Xer, I was brought up in the spirit that men and women were created equal, and that gender differences were a social construct that was out of fashion. In Senior High, my technology teacher warned us girls that we should be prepared for difficult times as female engineers; I judged him as an oldie who was hopelessly after his time.

At technical university, I calculated that across specializations, 25% of the students were women. In my mind, this would translate to a work environment of 25% of women. Not bad at all. And by the unstoppable force of development, that number would naturally increase until there was an approximate gender balance in my field.

Once out of university, I quickly realized this was by no means the case. I hadn’t paid attention to the fact that a slow trickle of young into existing hierarchies makes any changes slow as molasses, and that those in power were Baby Boomers with a whole different frame of reference when it came to gender roles.

As a female engineer, I was the definite odd one. As an opinionated female engineer, I stuck out like a sore thumb.

Pink Programming camp sponsored in part by Neo4j, a camp for female data scientists.

What’s the Real Challenge?


Fast forward to 2018: I have learned that progress is even slower than I thought.

Female enrollment in computer science is still only around 5%. In technical universities, the total number is closer to 30%, but the overall female enrollment declined last year. And among those who graduate, as much as 50% are likely to leave the profession.

This is at a time when the industry is screaming for skilled people, and the future for a developer is brighter than ever. Why do women pass at this opportunity?

After 20 years in IT, I have concluded there are no objective reasons for those numbers. I have not yet seen a challenge that was too great because of the fact that I was a woman. On the contrary, I see ample opportunities to grow at a pace that is the right for me, and into areas that I find interesting.

What is challenging?

    • The lack of female role models
    • The lack of female colleagues
    • Being a parent and a professional (unrelated to being a woman)

Focus on Solutions…


When I heard of Pink Programming I got curious.

They are an organisation with the mission to create an inspiring environment for all women regardless of age and background. They promote female role models, create a network for like-minded women and generate curiosity while sharing the joy of programming and helping women to take the first step. They do it in the form of coding camps and programming Sundays with exclusively female instructors and attendees.


Pink Programmng created programming camps for women and transgender people.

It took me some time to get beyond the questions, “Whatever would be the reason of hanging out with only women?” and “There are so many nice men; why exclude them?” I bet there is a ton of research about being in a minority situation I ought to have read in order to understand the dynamics I have experienced, but I haven’t yet. However, the feeling of joy and calm confidence when I first experienced a room full of women with a shared interest of programming is a fact.

My immediate thought was: “My company should be there as well!” I made formal contact with Pink Programming and started a conversation with my managers. The pros of a partnership were compelling:

First: Expanding our company networks into networks that have women in them will help us hire more female engineers. Think about it: How can anybody expect to find women to hire if they are only looking in places where there aren’t any?

Second: Spread the knowledge of Neo4j the company, that it’s the most awesome place I’ve ever worked at, and of Neo4j the product, that we all feel is a piece of engineering art that will change the future of data analysis.

Third: Help encourage women to pursue an IT education, stay in the program, and eventually stay in the profession. Some may think this is far-fetched. But hey, if we get a woman to choose an engineering education today, she will be ready and eager to start work in five years. Long time? Perhaps for somebody more short-sighted, but not for us. We are building a stable, long-lasting company that is a good citizen in whatever societies we’re a part of.

Make It Happen


When my colleague Louise and I presented the sponsoring proposal to our CEO Emil Eifrem, he concluded with, “I hope you understand that I am all for this?” I felt a surge of “Wow! THIS is the guy I am working for!”

The sponsorship was a fact. We hosted our first Pink Programming Sunday in May, and participated as trainers in the first ever Pink Programming Data Science Camp this summer.

Pink Programming data science camp, hiking in Sweden.

Twenty women gathered for five days in a cabin in the Swedish mountains. There was hiking, yoga, running and tabata. There was crazy swimming in the ice cold lake, Bollywood dancing, delicious vegan meals cooked together and conversations about everything and anything.


Pink Programming yoga session in Sweden.

Connections were created and friends were made.


Pink Programming, cooking vegan meals together.

And there was data science! During five workshops, participants learned about relational databases and SQL, about graph databases in general, and about Neo4j and Cypher in particular, in addition to DNA analysis, search engines and statistical models on music data.

Making friends at the Pink Programming camp in Sweden.

From the very first moments, women of all ages, backgrounds and nationalities formed a warm and inclusive group. I found myself an instant trusted member of the leadership team. No positioning or pecking orders; all support and encouragement.

As for female role models: There is now a big fat checkmark in that column.

Look Into the Future


When asked, participants reported that they enjoyed just hanging out with like-minded women in an easy-going and friendly atmosphere in the beautiful Swedish mountains. Some attested to feeling a little unsure about signing up for a week with people they didn’t know beforehand, but that they were happy they did. Several said they had gained insight and an increased self-confidence in programming and data science. All said they wanted to continue exploring/pursuing this field.

Next year, Neo4j will be the main sponsor of a Pink Programming camp. I can’t wait.

Dining al fresco, Pink Programming data science camp in Sweden.


New to graph technology?

Grab yourself a free copy of the Graph Databases for Beginners ebook and get an easy-to-understand guide to the basics of graph database technology – no previous background required.


Get My Copy

Casinos on Graphs: 5-Minute Interview with Joe Stefaniak, CEO of IntelligentTag

$
0
0
Check out this 5-minute interview with Joe Stefaniak of IntelligentTag.
“Especially when it comes to the financial sector or gaming, people need answers in real time,” said Joe Stefaniak, CEO of IntelligentTag.

From slots and games to guest services to financial systems, casinos are chock full of data sources. Whether you’re troubleshooting a slot machine or running multiple casinos on a Friday night, you need a real-time view into all of that connected data. A traditional RDBMS approach to big data was just too slow; you need a graph database.

In this week’s five-minute interview (conducted at GraphConnect New York) we discuss how intelliTag’s Symmetry, built using Neo4j, surprises and delights customers by providing live answers from disparate data sources as well as the importance of AI to the future of the industry.

Check out this 5-minute interview with Joe Stefaniak of IntelligentTag.

Talk to us about how you guys use Neo4j at IntelligentTag.


Joe Stefaniak: Our focus at this time is on a product called Symmetry3. Symmetry is being positioned as a total property management system for casinos, a casino in a box.

What we’re looking to do is integrate all of the disparate data sources within a casino – of which there are quite a few – and bring them into a normalized layer to create a holistic view of how people are moving about the property, understand property loyalty and understand how to plan and optimize the casino to ensure that each guest gets the best possible experience. That normalized layer uses Neo4j.

What made you choose Neo4j?


Stefaniak: Our background is in the financial services sector, so we were doing a lot of data governance work. And with Symmetry, we continue to do data governance so that whether the user is an executive or a slot tech, they see only the data they should have access to.

In developing Symmetry, we found that the amount of data being generated through data lakes was so large the performance of our traditional approaches like relational database management systems (RDBMS) – especially doing a recursive analysis – was incredibly slow.

So we applied that linking of disparate data sources, which is prevalent in many different industries today, to the gaming industry, and we realized that the Neo4j graph database was not only allowing us to quickly integrate the disparate data sources but do the analytics performantly.

Especially when it comes to the financial sector or gaming, people need answers in real time. If a guest complains that they didn’t get credit for the amount they just played, managers need the real story right then and there.

Can you talk to me about some of the most surprising results you’ve had while using Neo4j to create Symmetry?


Stefaniak: From an end-user perspective, we’ve seen aha moments. Our users say, “Hey, I had this data but I did not know it could be integrated so easily and from a revenue perspective so inexpensively, without having teams of people come in and consult, define, implement, deliver and distribute the end application.”

We’re able to reach everyone from, in this case, a slot technician on the floor who’s responsible for the health of the slot machine in real time to the executives. In the casino industry, what we have today is a model where the executives are managing multiple properties.

Executives are able to see not only one property, but all the properties they are managing. With the ability to federate Neo4j through Symmetry, we’re able to point to multiple Neo4j instances and roll that all up across a multi-property view to create a holistic view.

And, again, that’s an aha moment for these executives. They say, “Wow, I didn’t know this was doable.” And to have it done so quickly is a great value-add.

If you could start over with Neo4j, taking everything you know now, what would you do differently?


Stefaniak: Honestly, I don’t know if I would change anything. I was a very big proponent of open source. What I worked on in my prior years, and I’ve been in the industry for about 35, was typically proprietary stacks. And in today’s world, those stacks are just not being taught in schools or they’re closed systems, where they can’t integrate with an operational ecosystem. You have to be able to do that.

So when we started IntelligentTag, I think one of the things we did first, and we did right, was to look for a community player. One that supports the community. One that is bringing up the next generation of data scientists, analysts, programmers, experimentalists, whatever you want to call them, because they are the next generation. They do think differently. They think openly.

From a technical perspective, the cost to market and the ability to integrate in a community. And whether it’s Neo4j or all of your partners who are offering your community, APOC (Awesome Procedures on Cypher), graph algorithms library, or your open D3 engines to integrate with a graph.

The ability to scale and to leverage the community allows us smaller companies to look like a big company.

What do you think the future of graph technology looks like in your industry or sector?


Stefaniak: Well, I think not only in my gaming sector, but I think in many sectors, the future is in the machine learning, reasoning, and artificial intelligence space. We have people who think in their own semantics, and they have their own roles, and being able to apply those semantics to their data is going to be the future, period.

Gaming and finance are going to be directed by compliance rules and regulations, especially gaming. Frankly, it’s a relatively unregulated industry today, but it will be a $100 billion market this year alone. That’s a lot of money, and when you have that kind of money, you have to have some type of regulation.

So I think you’re going to see in the near future, again, the machine learning, the artificial intelligence, and the governance and compliance element as well.

Anything else you want to add?


Stefaniak: We’ve been very happy with Neo4j and our partnership with Neo4j. They’ve been there from day one from a support perspective, all the way to helping us to penetrate verticals. It’s been a great partnership thus far, and I look forward to many years to come.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com


Take a closer look into the powerhouse behind the analysis of real-world networks: graph algorithms. Read this white paper – Optimized Graph Algorithms in Neo4j – and learn how to harness graph algorithms to tackle your toughest connected data challenge.

Get My Free Copy

Financial Risk Reporting: The Connected Nature of Financial Risk

$
0
0
Discover the connected nature of financial risk reporting and how graph technology can help.
As governmental regulations tighten, today’s banks must have a thorough and systematic understanding of risk calculations and their associated data lineage – including where underlying data originates and how it flows through enterprise systems.

Forward-looking banks are uniting data silos into an information foundation for building innovative applications. These solutions provide extreme visibility and deep analytical insights that improve compliance efforts and day-to-day decision making.

In this series, we’ll describe how connected data and graph database technologies are transforming risk reporting in modern banks to help them meet the stringent demands of risk reporting compliance.

This week, we’ll discuss risk reporting standards including the Basel Committee’s BCBS 239 and the multifaceted data challenges of complying with this regulation.

The Connected Nature of Financial Risk


The lack of timely risk data was a major contributor to the global financial crisis of 2008 as the collapse of Lehman Brothers sent shockwaves through the banking world.

Without standards for properly aggregating risk in their financial positions, banks were unable to quickly assess the dependency of their various holdings on Lehman stock and assets.

Armed with an understanding of risk data lineage – a visibility of data connections all the way back to authoritative data sources – financial houses could have limited their exposure. Such visibility requires financial data standards and modern software that understands the connectedness of modern investment instruments.

The Emergence of Risk Reporting Standards


Since the 2008 market meltdown, regulators have established standards for recording and tracing financial transactions and for aggregating risk data. The new standards are designed to uncover risk dependencies and adjust capital ratio requirements accordingly.

To create consistency in recording financial contract details, the International Organization for Standardization (ISO) released ISO 17442, the Legal Entity Identifier (LEI) initiative. LEI codes clearly identify parties in transactions, thereby laying a sturdy foundation for deep visibility into financial risk data.

Another crucial data-focused initiative is BCBS 239, the Basel Committee’s 14 principles to be used by banks when aggregating financial risk data.

These new standards collectively enable banks to assess risk, trace data lineage, and understand dependencies on other systems, investments and financial houses.

BCBS 239 Regulatory Principles


The 14 BCBS 239 principles are organized into four data management categories, as shown below.

Check out BCBS regulatory principles for financial risk reporting.

Governance and Infrastructure

To build their risk reporting systems, banks must utilize data governance and integrated data taxonomies, as well as group-wide metadata including consistent identifiers for entities, counterparties, customers and accounts. The banks must also maintain data systems that handle the requirements of normal operations as well as the high demands and specific requirements of crisis situations.

Risk Data Aggregation

Banks must be able to generate accurate, consistent and reliable risk data while maintaining full visibility back to authoritative data sources. Any aggregations and transformations must adjust for data latency, so all calculations are based on data values from the same point in time. And the datasets must be able to satisfy a full spectrum of requests made by managers and regulators.

Risk Reporting

Management and regulatory reports must represent risk in a precise and auditable manner, and reconcile to the complexity of the bank’s risk model and operations. They must present risk information in a clear, concise and easily understood manner than facilitates fast, informed decisions. The reports must be distributed regularly to managers and regulators and also be available for on-demand and ad hoc requests.

Supervisory Review

Bank risk supervisors are required to review their institution’s ongoing compliance with BCBS 239 principles. They must have access to the tools required to address any deficiencies they discover in their investigations. Supervisors are also required to cooperate with other regulators and other supervisors in their compliance investigations and implementation of remedial actions.

Key Data Challenges of BCBS 239 Regulations


Building models for risk reporting requires tackling some serious data management issues.

Data Lineage

Accurate and reliable risk reports require a clear understanding of data lineage. Reporting entities must be able to prove how each number in a report is generated, including its calculation details and source data. Each data item and transformation must be attributed to an owner, a steward, and be profiled with a quality and latency status. Most importantly, data must be traced backwards until its lineage ends with an authoritative source.

Data Silos

Lines of business often grow their own systems to meet specific business needs or to create independent trading desks for faster decision making. This produces discrete data silos that make tracing data lineage a very difficult task. You must be able to trace data movement through those silos and systems all the way back to their original sources. While data warehouses can assemble information from discrete silos, they do little or nothing to trace data lineage, and can even make the process more complex.

Terminology Differences

Business groups often use their own terminology and algorithms, even within the same organization. For example, the notional value of a derivative contract can mean different things to different people. What is a derivative? Exactly what asset classes are included? Is the data original, copied or calculated? Is the data derived from internal or external sources? Are those sources authoritative?

Legal Entity Identifiers

With the introduction of the Legal Entity Identifier Act (LEI) and MiFID 2 (the EU’s Markets in Financial Instruments Directive), entities and counterparties in contracts are required to use standard identifiers to describe transactions. While this helps address accountability of the parties, developers must still relate old entity identifiers to the new identifiers on all historical information.

Data Consistency and Latency

The data appearing in regulatory reports must be accurate and consistent as of a specific time. To achieve such temporal consistency, all transactions must be timestamped. Reports must use those timestamps to assemble accurate snapshots of risk data at any point in time.

Achieving consistency across data silos presents even harder challenges. Risk reports typically pull data from multiple sources, each with its own refresh schedule. To avoid data latency issues and achieve consistency, risk reports must adjust for temporal differences in each silo.

Conclusion


The risk-reporting mandates of BCBS 239 place new demands on data architectures at banks and financial houses worldwide. The need for fast access to real-time data lineage and financial risk information has given organizations solid justification for revisiting the old, relational reporting systems they’ve struggled with for years.

Using a native graph database like Neo4j, banks can provide regulators with all the information that is required by regulations like BCBS 239. At the same time, their connected data serves as the foundation for innovative, real-world risk reporting solutions.

In the coming weeks, we’ll take a closer look at a federated approach to BCBS 239 compliance and why it requires choosing the right graph database technology. We’ll explore how early adopters are utilizing the clarity and flexibility of graph modeling to create an enterprise platform for visualizing, analyzing, reporting and governing financial risk.


Comply and innovate:
Find out how financial services firms use connected data to comply and get a competitive edge in this white paper, The Connected Data Revolution in Financial Risk Reporting: Connections in Financial Data Are Redefining Risk and Compliance Practices. Click below to get your free copy.


Read the White Paper


Graph Databases for Beginners: Data Modeling Pitfalls to Avoid

$
0
0
Learn how to avoid these common (but fatal) data modeling pitfalls when working with graph technology
With the advent of graph database technology, data modeling has become accessible to masses.

Mapping business needs into a well-defined structure for data storage and organization has made a sortie du temple (of sorts) from the realm of the well-educated few to the province of the proletariat. No longer the sole domain of senior DBAs and principal developers, anyone with a basic understanding of graphs can complete a rudimentary data model – from the CEO to the intern.

(This doesn’t mean we don’t still need expert data modelers. If you’re a data modeling vet, here’s your more advanced introduction to graph data modeling.)

Yet, with greater ease and accessibility comes an equal likelihood that data modeling might go wrong. And if your data model is weak, your entire application will be too.

Learn how to avoid these common (but fatal) data modeling pitfalls when working with graph technology


In this Graph Databases for Beginners blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve covered why graph technology is the future, why connected data matters and how graph databases make data modeling easier than ever, especially for the uninitiated.

This week, we’ll discuss how to avoid the most common (and fatal) mistakes when completing your graph data model.

Example Data Model: Fraud Detection in Email Communications


graph database are highly expressive when it comes to data modeling for complex problems. But expressivity isn’t a guarantee that you’ll get your data model right on the first try. Even graph database experts make mistakes and beginners are bound to make even more.

Let’s dive into an example data model to witness the most common mistakes (and their consequences) so you don’t have to learn from the same errors in your own data model.

In this example, we’ll examine a fraud detection application that analyzes users’ email communications. This particular application is looking for rogue behavior and suspicious emailing patterns that might indicate illegal or unethical behavior.

We’re particularly looking for patterns from past wrongdoers, such as frequently using blind-copying (BCC) and using aliases to conduct fake “conversations” that mimic legitimate interactions. In order to catch this sort of unscrupulous behavior, we’ll need a graph data model that captures all the relevant elements and activities.

For our first attempt at the data model, we’ll map some users, their activities and their known aliases, including a relationship describing Alice as one of Bob’s known aliases. The result (below) is a star-shaped graph with Bob in the center.


Data modeling mistake for an email fraud detection solution

Our first data model attempting to map Bob’s suspicious email activity with Alice as a known alias. However, this data model isn’t robust enough to detect wrongful behavior.

At first blush, this initial data modeling attempt looks like an accurate representation of Bob’s email activity; after all, we easily see that Bob (an alias of Alice) emailed Charlie while BCC’ing Edward and CC’ing Davina. But we can’t see the most important part of all: the email itself.

A beginning data modeler might try to remedy the situation by adding properties to the EMAILED relationship, representing the email’s attributes as properties. However, that’s not a long-term solution. Even with properties attached to each EMAILED relationship, we wouldn’t be able to correlate connections between EMAILED, CC and BCC relationships – and those correlating relationships are exactly what we need for our fraud detection solution.

This is the perfect example of a common data modeling mistake. In everyday English, it’s easy and convenient to shorten the phrase “Bob sent an email to Charlie” to “Bob emailed Charlie.” This shortcut made us focus on the verb “emailed” rather than the email as an object itself. As a result, our incomplete model keeps us from the insights we’re looking for.

The Fix: A Stronger Fraud Detection Data Model


To fix our weak model, we need to add nodes to our graph model that represent each of the emails exchanged. Then, we need to add new relationships to track who wrote the email and to whom it was sent, CC’ed and BCC’ed.

The result is another star-shaped graph, but this time the email is at the center, allowing us to efficiently track its relationship to Bob and possibly some suspicious behavior.


The corrected fraud detection email data model

Our second attempt at a fraud detection data model. This iteration allows us to more easily trace the relationships of who is sending and receiving each email message.

Of course we aren’t interested in tracking just one email but many, each with its own web of interactions to explore. Over time, our email server logs more interactions, giving us something like the fraud detection graph below.


Fraud detection data model of email server logs

A graph data model showing many emails over time and their various relationships, including the sender and the direct, CC and BCC receivers.


The Next Step: Tracking Email Replies


At this point, our data model is more robust, but it isn’t complete.

We see who sent and received emails, and we see the content of the emails themselves. Nevertheless, we can’t track any replies or forwards of our given email communications. In the case of fraud or cybersecurity, we need to know if critical business information has been leaked or compromised.

To complete this upgrade, beginners might be tempted to simply add FORWARDED and REPLIED_TO relationships to our graph data model, like in the example below.


A graph data model mistake for email reply-to addresses

Our updated data model with FORWARDED and REPLIED_TO relationships in addition to the original TO relationship.

This approach, however, quickly proves inadequate. Much in the same way the EMAILED relationship didn’t give us the proper information, simply adding FORWARDED or REPLIED_TO relationships doesn’t give us the insights we’re really looking for.

To build a better data model, we need to consider the fundamentals of this particular domain. A reply to an email is both a new email and a reply to the original. The two roles of a reply are represented by attaching two labels – Email and Reply – to the appropriate node.

We then use the same TO, CC and BCC relationships to map whether the reply was sent to the original sender, all recipients or a subset of recipients. We also reference the original email with a REPLY_TO relationship.

The resulting graph data model is shown below.


A sophisticated email fraud detection graph data model


Not only do we see who replied to Bob’s original email, but we also track replies-to-replies and replies-to-replies-to-replies, and so on to an arbitrary depth. If we’re trying to track a suspicious number of replies to known aliases, the above graph data model makes this extremely simple.

Homework: Data Modeling for Email Forwards


Equally important to tracking email replies is tracking email forwards, especially when it comes to leaked business information.

As a graph data modeling acolyte, your homework assignment is to document how you would model the forwarded email data, tracking the relationships with senders, direct recipients, CC’ed recipients, BCC’ed recipients and the original email.

Check your work on pages 61 and 62 of the O’Reilly Graph Databases book available here.

Conclusion


Data modeling has been made much easier with the advent of graph technology. However, while it’s simpler than ever to translate your whiteboard model into a physical one, you need to ensure your data model is designed effectively for your particular use case.

There are no absolute rights or wrongs with graph data modeling, but you should avoid the pitfalls mentioned above in order to glean the most valuable insights from your data.


Ready to sharpen your understanding of graph technology? Click below to get your free copy of the O’Reilly Graph Databases book and learn how to harness the power of connected data.

Download My Free Copy



Catch up with the rest of the Graph Databases for Beginners series:

Democratizing Data Discovery at Airbnb

$
0
0
Learn how Airbnb democratized their data discovery with a graph database.
Editor’s Note: This presentation was given by John Bodley and Chris Williams at GraphConnect Europe in May 2017.

Presentation Summary


Airbnb, the online marketplace and hospitality service for people to lease or rent short-term lodging, generates many data points, which leads to logjams when users attempt to find the right data. Challenges managing all the data points have led the data team to search for solutions to “democratize the data,” helping employees with data exploration and discovery.

To address this challenge, Airbnb has developed the Dataportal, an internal data tool that helps with data discovery and decision-making and that runs on Neo4j. It’s designed to capture the company’s collective tribal knowledge.

As data accumulates, so do the challenges around the volume and complexity of the data. One example of where this data accumulates is in Airbnb’s Hive data warehouse. Airbnb has more than 200,000 tables in Hive spread across multiple clusters.

Each day the data starts off in Hive. Airbnb’s data engineers use Airflow to push it to Python. The data is eventually pushed to Neo4j by the Neo4j driver. The graph database is live, and every day they push updates from Hive into the graph database.

Why did Airbnb choose Neo4j? There are multiple reasons. Neo4j captures the relevancy of relationships between people and data resources, helping guide people to the data they need and want. On a technical level, it integrates well with Python and Elasticsearch.

Airbnb’s Dataportal UI is designed to help users, the ultimate holders of tribal knowledge, find the resources they need quickly.

Full Presentation: Democratizing Data at Airbnb


What we will be talking about today is how Airbnb uses Neo4j’s graph database to manage the many data points that accumulate in our Hive data warehouse.



What Is the Dataportal?


John Bodley: Airbnb is an online marketplace that connects people to unique travel experiences. We both work in an internal data tools team where our job is to help ensure that Airbnb makes data-informed business decisions.

The Dataportal is an internal data tool that we’re developing to help with data discovery and decision-making at Airbnb. We are going to describe how we modelled and engineered this solution, centered around Neo4j.

Addressing the Problem of Tribal Knowledge


The problem that the Dataportal project attempts to address is the proliferation of tribal knowledge. Relying on tribal knowledge often stifles productivity. As Airbnb grows, so do the challenges around the volume, the complexity and the obscurity of data. In a large and complex organization with a sea of data resources, users often struggle to find the right data.

We run an employee survey and consistently score really poorly on the question, “The information I need to do my job is easy to find.”

Data is often siloed, inaccessible and lacks context. I’m a recovering data scientist who wants to democratize data and provide context wherever possible.

Taming the Firehose of Hive


We have over 200,000 tables in our Hive data warehouse. It is spread across multiple clusters. When I joined Airbnb last year, it wasn’t evident how you could find the right table. We built a prototype, leveraging previous insights, giving users the ability to search for metadata. We quickly realized that we were somewhat myopic in our thinking and decided to include resources beyond just data tables.

Data Resources Beyond the Data Warehouse


We have over 10,000 Superset charts and dashboards. Superset is an open source, data analytics platform. We have in excess of 6,000 experiments in metrics. We have over 6,000 Tableau workbooks and charts, and over 1,500 knowledge posts, from Knowledge Repo, our open source, code knowledge-sharing platform that data scientists use to share their results, as well as a litany of other data types.

But most importantly, there’s over 3,500 employees at Airbnb. I can’t stress enough how valuable people are as a data resource. Surfacing who may be the point of contact for a resource is just as pertinent as the resource itself. To further complicate matters, we’re dispersed geographically, with over 20 offices worldwide.

The mandate of the Dataportal is quite simply to democratize data and to empower Airbnb employees to be data informed by aiding with data exploration, discovery and trust.

At a very high level, we want everyone to be able to search for data. The question is, how to frame our data in a meaningful way for searching. We have to be cognizant of ranking relevance as well. It should be fairly evident what we actually feed into our search indices, which is all these data resources and their associated metatypes.

The Relevancy of Relationships: Bringing People and Data Together


Thinking about our data in this way, we were missing something extremely important: relationships.

Our ecosystem is a graph, the data resources are nodes and the connectivity is all relationships. The relationships provide the necessary linkages between our siloed data components and the ability to understand the entire data ecosystem, all the way from logging to consumption.

Relationships are extremely pertinent for us. Knowing who created or consumed a resource (as shown below) is just as valuable as the resource itself. Where should we gather information from a plethora of disjointed tools? It would be really great if we could provide additional context.

Check out this graphic of how Airbnb defines their relevancy of data relationships with their employees.

Let’s walk through a high-level example, shown below. Using event logs, we discover a user consumes a Tableau chart, which lacks context. Piecing things together, we discover that the chart is from a Tableau workbook. The directionless edge is somewhat ambiguous, but we prefer the many-to-one direction from both a flow and a relevancy perspective. Digging a little further, both these resources were created by another user. Now we find an indirect relationship between these users.

We then discover that the workbook was derived from some aggregated table that wasn’t in Hive, thus exposing the underlying data to the user. Then we pass out the Hive order logs and determine that this table is actually derived from another table, which provides us with the underlying data. And finally, both these tables are associated with the same Hive schema, which may provide additional context with regards to the nature of the data.

How Airbnb's Dataportal graph search platform first took shape.

We leverage all these data sources, and we build a graph comprising of the nodes and relationships, and this resides in Hive. We pull from a number of different sources. Actually, Hive is our persistent data store, where the table schema mimics Neo4j. We have a notion of labels, and properties, and maybe an ID.

We pull from over six databases that come through scrapes that land in Hive. We create a number of APIs, be that Google, Slack and also some logging frameworks. That all goes into an Airflow Directed Acrylic Graph (DAG). (Airflow is an open source workflow tool that was also developed at Airbnb.) And then this workflow is run every day, and the graph is left to soak to prevent what we call “graph flickering.”

See the data resources Airbnb leverages to build a graph in Hive.

Dealing with “Graph Flickering”


Let me explain what I mean by graph flickering. Our graph is somewhat time-agnostic. It represents the most recent snapshot of the ecosystem. The issue is certain types of relationships are sporadic in nature, and that’s causing the graph to flicker. We resolve this by introducing the notion of relational state.

We have two sorts of relationships: persistent and transient.

Persistent relationships (see below) represent a snapshot in time of the system; they are the result of a DB scrape. In this example, the creator relationship will persist forever.

Check out how persistent relationships represent a snapshot in time.

Transient relationships, on the other hand, represent events that are somewhat sporadic in nature. In this example, the consumed relationship would only exist on certain days, which would cause the graph to flicker.

To solve this, we simply expand the time period from one to a trailing 28-day window, which acts as a smoothing function. This ensures the graph doesn’t flicker, but also enables us to capture only recent, and thus relevant, consumption information into our graph.

See how transient relationships are sporadic in nature.

How Airbnb Uses Neo4j with Python and Elasticsearch


Let’s touch upon how our data ends up in Neo4j and downstream resources.

Shown below is a very simplified view of our data path which, in itself, is a graph. Given that relationships have parity with nodes, it’s pertinent that we also discuss the conduit that connects these systems.

Every day, the data starts off in Hive. We use AirFlow to push it to Python. In Python, the graph is represented in NetworkX as an object and from this, we compute a weighted page rank on the graph and that helps improve search ranking. The data is then pushed to Neo4j by the Neo4j driver.

We have to be cognizant of how we do a merge here. The graph database is live, and every day we push updates from Hive into the graph database. That’s a merge, and it is something we have to be quite cautious of.

From here, the flow forks into two directions. The nodes get pushed into Elasticsearch via a GraphAware plugin, which is based on transaction hooks. From there, Elasticsearch will serve as our search engine. Finally, we use Flask as a lightweight Python web app, which is used with other data tools. Results from Elasticsearch queries are fetched by the web server.

Additionally, results from Neo4j queries pertaining to connectivity are fetched by the web server via Neo4j, using that same driver.

Learn how Airbnb democratized their data discovery with a graph database.

Why did we choose Neo4j as our graph database?

There are four main reasons. First, our data represents a graph, so it felt logical to use a graph database to store the data. Second, it’s nimble. We wanted a really fast, performant system. Third, it’s popular; it’s the world’s number one graph database. The community edition is free, which is really super helpful for exploring and prototyping. And finally, it integrates well with Python and Elasticsearch, existing technologies we wanted to leverage.

Learny why Airbnb choose Neo4j's graph database.

There’s a lovely symbiotic relationship between Elasticsearch and Neo4j, courtesy of some GraphAware plugins. The Neo4j plugin, which asynchronously replicates data from Neo4j to Elasticsearch. That means we actually don’t need to actively manage our Elasticsearch cluster. All our data persists. We use Neo4j as the source of truth.

The second plugin actually lives in Elasticsearch and allows Elasticsearch to consult with the Neo4j database during a search. And this allows us to enrich search rankings by leveraging the graph topology. For example, we could sort by recently created, which is a property on the relationship, or most consumed, where we have to explore topology of the graph.

This is how we represent our data model. We defined a node label hierarchy as follows.

Check out Airbnb's node label hierarchy.

This hierarchy enables us to organize data in both Neo4J and Hive. The top-level :Entity label really represents some base abstract node type, which I’ll explain later.

Let’s walk through a few examples here. Our schema was created in such a way that the nodes are globally unique in our database, by combining the set of labels and the locally scoped ID property.

First, we have a user who’s keyed by their LDAP username, then a table that’s keyed by the table name and finally a Tableau chart that’s keyed by the corresponding DB instance inside the Tableau database.

User name nodes examples from Airbnb.

The graph cores are heavily leveraged in the user interface (UI), and they need to be incredibly fast. We can efficiently match queries by defining per label indices on the ID property and we leverage them for fast access. Here, we’re just explicitly forcing the use of the index because we’re using multiple labels.

Match queries using multiple labels.

Ideally, we’d love to have a more abstract representation of the graph, moving from local to global uniqueness. To achieve that, we leverage another GraphAware plugin, UUID. This plugin assigns a global UUID on newly created entities that cannot be mutated in any way. This gives us global uniqueness. We can talk about entities in the graph by using just this one unique UUID property in addition to the entity label.

This helps us use PrimeBase queries, which leads to faster query and execution times. This is especially relevant when we do bulk loads. Every day we do a bulk load of data and we need that to be really performant.

Here’s this same sort of example as before. Now we’ve simplified this, so we can just purely match any entity using this UUID property, and it’s global.

See match queries.

We have a RESTful API. In the first example, you can match a node based on its labels and IDs. And this is useful if you have like a slug type of URL. The second one, you can match a node based purely on the UUID. The third one is how we’d get a created relationship, based on leveraging these two UUIDs. The front-end uses these APIs, as covered in the next section.

Check out match node labels and IDs.

Designing the Front-end of the Dataportal


Chris Williams: I’m going to describe how we enable Airbnb employees to harness the power of our data resource graph through the web application.

The backends of data tools are often so complex that the design of the front-end is an afterthought. This should never be the case, and in fact, the complexity and data density of these tools makes intentional design even more critical.

One of our project goals is to help build trust in data. As users encounter painful or buggy interactions, these can chip away at their trust in your tool. On the other hand, a delightful data product can build trust and confidence. Therefore, with the Dataportal, we decided to embrace a product mindset from the start and ensure a thoughtful user interface and experience.

As a first step, we interviewed users across the company to assess needs and pain points around data resources and tribal knowledge. From these interviews, three overall user personas emerged. I want to point out that they span data literacy levels and many different use cases.

The first of these personas is Daphne Data. She is a technical data power user, the epitome of a tribal knowledge holder. She’s in the trenches tracing data lineage, but she also spends a lot of time explaining and pointing others to these resources.

Second, we have Manager Mel. Perhaps she’s less data literate, but she still needs to keep tabs on her team’s resources, share them with others, and stay up to date with other teams that she interacts with. Finally, we have Nathan New. He may be new to Airbnb, working with a new team, or new to data. In any case, he has no clue what’s going on and quickly needs to get ramped up.

Airbnb's Dataportal user personalities.

With these personas in mind, we built up the front end of the Dataportal to support data exploration, discovery and trust through a variety of product features. At a high level, these broadly include search, more in-depth resource detail and metadata exploration, and user-centric, team-centric and company-centric data.

We do not really allow free-form exploration of our graph as the Neo4j UI does. The Dataportal offers a highly curated view of the graph, which attempts to provide utility while maintaining guardrails, where necessary, for less data-literate employees.

Designing the Dataportal for exploration, discovery and trust.

The Dataportal is primarily a data resource search engine. Clearly, it has to have killer search functionality. We tried to embrace a clean and minimalistic design. This aesthetic allows us to maintain clarity despite all the data content, which adds a lot of complexity on its own.

We also tried to make the app feel really fast and snappy. Slow interactions generally disincentivize exploration.

At the top of the screen (see below) are search filters that are somewhat analogous to Google. Rather than images, news and videos, we have things like data resources, charts, groups, teams and people.

Data discovery and contextual search.

The search cards have a hierarchy of information. The overall goal is to help provide a lot of context to allow users to quickly gauge the relevancy of results. We have things like the name, the type. We highlight search terms, the owner of the resource, when it was last updated, the number of views, and so on. And we also try to show the top consumers of any given result set. This is just another way to surface relationships and provide a lot more context.

Continuing with this flow, from a search result, users typically want to explore a resource in greater detail. For this, we have content pages. Here is an example of a Hive table content page.

A hive table content page.

At the top of the page, we have a description linked to the external resource and social features, such as favoriting and pinning, so users can pin a resource to their team page. Below that, we have metadata about the data resource, including who created it, when it was last updated, who consumes it, and so on.

The relationships between nodes provide context. This context isn’t available in any of our other siloed data tools. It’s something that makes the Dataportal unique, tying the entire ecosystem together.

Another way to surface graph relationships is through related content, so we show direct connections to this resource. For a data table, this could be something like the charts or dashboards that directly pull from the data table.

We also have a lot of links to promote exploration. You can see who created this resource and find out what other resources that they work on.

The screen below highlights some of the features we built out specifically for exploring data tables. You can explore column details and value distributions for any table. Additionally, tracing data lineage is important, so we allow users to explore both the parent tables and the child tables of any given table.

We’re also really excited about being able to enrich and edit metadata on the fly, we add table descriptions and column contents. And these are pushed directly to our Hive metastore.

Hive metastore for Airbnb's Dataportal.

The screen below highlights our Knowledge Repo, which is where data scientists can share analyses. You have dashboards and visualizations. We are typically iframing these data tools. That generates a log, which then our graph picks up, and it will trickle back into our graph, affect PageRank, and affect the number of views.

Airbnb Knowledge Repo analyses.

Helping Users, the Ultimate Holders of Tribal Knowledge


Users are the ultimate holders of tribal knowledge, so we created a dedicated user page, shown below, to reflects that.

On the left is basic contact information. On the right are resources that the user accesses frequently, created, or favorited in groups to which they belong. To help build trust in data, we wanted to be transparent about data. You can look at any resources a person views, including what your manager views and so on.

Along the lines of data transparency, we also made a conscious choice to keep former employees in the graph.

If we take George, the handsome intern that all the ladies talk about, he created a lot of data resources and he favorited things. If I wanted to find a cool dashboard that he made last summer, that I forgot the name of, this can be really relevant.

An example of data transparency with former employees tribal knowledge.

Another source of tribal knowledge is found within an organization’s teams. Teams have tables they query regularly, dashboards they look at and go-to metric definitions. We found that team members spend a lot of time telling people about the same resources, and they wanted a way to quickly point people to these items.

For that, we created group pages. The group overview below shows who’s in a particular team.

Group pages of tribal knowledge in Airbnb's Dataportal.

To enable curating content, we decided to borrow some ideas from Pinterest, so you can pin any content to a page. If a team doesn’t have any content that’s been curated, there’s a Popular tab. Rather than displaying an empty page, we can leverage our graph to inspect what resources the people on a given team use on a regular basis and provide context that way.

We leverage thumbnails for maximum context. We gathered about 15,000 thumbnails from Tableau or Knowledge Repo in our Superset internal data tool. They’re a combination of APIs and headless browser screenshots.

The screen below highlights the pinning and editing flows. On the left, similar to Pinterest, you can pin an item to a team page. On the right, you can customize and rearrange the resources on the team page.

Team page pinning of editing flows.

Finally, we have company metric data.

We found that people on a team typically keep a tight pulse on relevant information for their team. A lot of times, as the company grows larger, they’ll feel more and more disconnected from company-level, high-level metrics. For that, we created a high-level Airbnb dashboard where they can explore up-to-date company-level data.

Airbnb dashboard company level data.

Front-End Technology Stack


Our front-end technology stack is similar to what many teams use at Airbnb.

We leverage modern JavaScript, ES6. We use node package manager (NPM) to manage package dependencies and build the application. We use an open source package called React from Facebook for generating the Document Object Model (DOM) and the UI. We use Redux, which is an application state tool. We use a cool open source package from Khan Academy called Aphrodite, which essentially allows you to write Cascading Style Sheets (CSS) in JavaScript. We use ESLint to enforce JavaScript’s style guide, which is also open source from Airbnb, and Enzyme, Mocha and Chai for testing.

Airbnb's Dataportal technology stack.

Challenges in Building the Dataportal


We faced a number of challenges in building the Dataportal.

It is an umbrella data tool that brings together all of our siloed data tools and generates a picture of the overall ecosystem. The problem with this is that any umbrella data tool is vulnerable to changes in the upstream dependencies. This can include things on the backend like schema changes, which could break our graph generation, or URL changes, which would break the front-end.

Additionally, data-dense design, creating a UI that’s simple and still functional for people across a large number of data literacy levels, is challenging. To complicate this, most internal design patterns aren’t built for data-rich applications. We had to do a lot of improvising and creation of our own components.

We have a non-trivial Git-like merging of the graph that happens when we scrape everything from Hive and then push that to production in Neo4j.

The data ecosystem is quite complex, and for less data literate people, this can be confusing. We’ve used the idea of proxy nodes, in some cases, to abstract some of those complexities. For example, we have lots of data tables, which are often replicated across different clusters. Non-technical users could be confused by this, so we actually accurately model it on the backend, and then expose a simplified proxy node on the front end.

Airbnb's Dataportal challenges.

Future Directions for Airbnb and the Graph Database


We’re considering a number of future directions.

The first is a network analysis that finds obsolete nodes. In our case, this could be things like data tables that haven’t been queried for a long time and are costing us thousands of dollars each month. It could also be critical paths between resources.

One idea that we’re exploring is a more active curation of data resources. If you search for something and you get five dashboards with the same name, it’s often hard, if you lack context, to tell which one is relevant to you. We have passive mechanisms like PageRank and surfacing metadata that would, hopefully, surface more relevant results. We are thinking about more active forms of certification that we could use to boost results in search ranking.

We’re also excited about moving from active exploration to delivering more relevant updates and content suggestions through alerts and recommendations. For example, “Your dashboard is broken,” “This table you created hasn’t been queried for several months and is costing us X amount,” or “This group that you follow just added a lot of new content.”

And then, finally, what feature set would be complete without gamification?

We’re thinking about providing fun ways to give content producers a sense of value by telling them, for example, “You have the most viewed dashboard this month.”

The future of Airbnb's Dataportal.


Inspired by Dave’s talk? Click below to register for GraphConnect 2018 on September 20-21 in Times Square, New York City – and connect with leading graph experts from around the globe.

Get My Ticket

Taking the First Step: How to Lead a Local Neo4j GraphDB Meetup

$
0
0
Discover more about leading a local Neo4j GraphDB meetup with Michael McKenzie.
I discovered Neo4j and became instantly captivated. Finding something that made data make sense, was fun, approachable, and easier to learn, I read articles and white papers, watched videos online… but it wasn’t enough.

I needed to connect, with people, face-to-face. I wanted to listen and learn from peers, share ideas, collaborate and just be with others who shared the same feeling.

Having recently moved to Washington, D.C., I worked remotely from home and didn’t have any friends. I reached out to Karin Wolok, who works with the Neo4j community on behalf of Neo4j, looking to find anyone nearby I could meet and talk graphy things with (She’s too polite to admit it, but I definitely pestered her.)

Karin recommended something different – that I take over the D.C.-area meetup group.

Discover more about leading a local Neo4j GraphDB meetup with Michael McKenzie.

There had been a group in the area. As life goes, people get busy, things change and new leadership was needed to keep the momentum going. I accepted, and – let me be straight with you – I had absolutely no idea what I was doing.

But I was excited, full of energy, and just wanted to meet others with a similar interest. Luckily, I wasn’t doing it entirely alone, as the previous meetup leadership was willing to assist with helping me plan the next event.

We rebooted the old meetup group with a new name, new meetup group page and a new GRAPHTITUDE (too cheesy for you? I like it! Ha). The first meetup would be a simple introduction to graph technology, a Neo4j presentation and a short workshop.

Here was the plan:

  1. Pizza and drinks would be served.
  2. We would have a brief introduction to the new group and those in attendance.
  3. I would give the Intro to Graphs and Neo4j presentations.
  4. A local Neo4j engineer would provide a short workshop, walking through sample Cypher queries to get started.
  5. We would have a brief discussion, clean up and go home.

So… how did it go?


A massive thunderstorm rolled through the area in the early afternoon bringing threats of flooding to certain areas. Unfortunately, some last-minute, work-related issues prevented the Neo4j engineer from attending. The elevator at the venue required a key fob. About 10 people in all showed up (around 30 or so had RSVPed). And I gave the only presentation of the evening after realizing I was trying to reinvent the wheel instead of using the resources at my disposal.

And… IT WAS AWESOME!

The meetup was great. As soon as people were arriving, I began asking them questions about what they did, where they were from, how they were using Neo4j and anything else that came up. I had always figured the meetup would be rigid and organized. When I gave the presentation, however, I opted to allow a dialog and questions as we progressed through the different slides – and, it worked!

Our group began to talk and ask questions. We even had a 5-10 minute discussion and debate about whether a person should be modeled as a node itself, or a property of a chair node. (Spoiler: It depends on what you’re trying to answer).

Just like that, the first meetup was done. I’m really excited for the next one.

What did I learn?


  1. Have fun! Don’t put too much pressure to make everything perfect. Even with the best planning things can and do go wrong. All you can do is roll with the punches and laugh a little.
  2. Meetups function better as a team-building exercise rather than a superhero mission. You can do everything if you want, but it is more fun and less stressful when you let others help you.
  3. Don’t reinvent the wheel. Use the resources you have at your disposal. Realizing what your limitations are makes you wiser and better.
  4. Organizing and attending meetups are fun! It is natural to congregate and be around others that share in your interests and enthusiasm. It is scary to take the lead and make things happen, but the feeling of having fun with others trumps that by a long shot.
  5. Lastly, to paraphrase Karin, do meetups that interest you. Whether it be the topic of discussion (World Cup, algorithms, fraud detection, etc.), or the structure of the meetup (lightning talks, brainstorming, hack-a-thons, etc.), people will want to go to meetups that you yourself are interested in. People want to share in your excitement and also tend to not share cool ideas if they think no one else is interested.
And here is another first: This is my first-ever blog post about my first-ever meetup. No matter how it turns out, I am pretty proud that I did it.

MATCH (:Person)-[:ENJOYS]->(Neo4j:Graphdatabase)


Want to take your Neo4j skills up a notch? Take our online training class, Neo4j in Production, and learn how to scale the world’s leading graph database to unprecedented levels.

Take the Class
Viewing all 1139 articles
Browse latest View live