Meltdown & Spectre: Current Results from Neo4j Performance Testing

January 26, 2018, 2:25 am

≫ Next: Retail & Neo4j: Pricing & Revenue Management

≪ Previous: Retail & Neo4j: Supply Chain Visibility & Management

Learn how the Meltdown and Spectre security vulnerabilities affect Neo4j graph database performance

As mentioned in our earlier blog post about Meltdown and Spectre, we have been running tests to discover the impact on Neo4j of the patches and workarounds for the Meltdown and Spectre vulnerabilities.

The Neo4j engineering team has now completed a series of tests – in various environments and under a wide range of workloads – and we’d like to share the results.

Server Types

To try to understand the performance impact across different Neo4j deployment configurations, we have tested in three different environments, as follows:

AWS Ubuntu instances
Instance type: M3 Large
Meltdown patch tested with: Linux, 4.4.0-1047-aws
Dedicated hardware, low end servers
Instance type: 1 x 4-core Xeon Skylake-DT – 64GB RAM – SSD
Meltdown patch tested with: Linux, 4.4.0-109
Dedicated hardware, medium level servers
Instance type: 1 x 22-core Xeon Gold 6152 (22C/44T) @ 2.1Ghz – 512GB RAM – SSD
Meltdown patch tested with: Linux, 4.4.0-109

Workloads

We tested with several performance suites, each generating a distinctly different set of loads:

Internal performance testing workloads based on the LDBC SNB Interactive benchmark.

Store sizes varying between 1-200GB
Tested with up to 64 concurrent clients

Realistic read and write workloads against real (customer-provided) database stores

Targeted micro-benchmarks, designed to stress the major internal components of Neo4j

Very large data imports, using datasets ranging from 100 million entities to 100 billion entities

Neo4j Versions

We ran the workloads against the latest patch release of all the supported versions of Neo4j:

3.0.12
3.1.7
3.2.9
3.3.2

Results

The combination of server types, workloads and Neo4j versions creates a large matrix of tested scenarios. For every scenario we captured results from both before and after applying the relevant Meltdown patch.

Our analysis of the results shows only negligible performance impacts. Any change is within the range of variance we normally observe. These charts show results for two of the scenarios in the context of normal variation:

Neo4j LDBC performance benchmark testing for Meltdown and Spectre

The results suggest that Neo4j users using similar servers should not experience any significant performance degradation. However, we are continuing our testing and will report back if we find scenarios where there is a measurable impact.

For advice about your specific deployment, please contact Neo4j Support.

Unrelated to Meltdown, the second chart shows some substantial improvements in Neo4j performance from versions 3.1 → 3.2 and from 3.2 → 3.3. This is as expected and reflects ongoing efforts to improve the product across multiple dimensions.

Future Work

We expect further OS patches and firmware patches to become available over the coming weeks and months. Especially, we are waiting for stable patches from Intel. We’ll continue our testing to evaluate these patches as the become available.

The post Meltdown & Spectre: Current Results from Neo4j Performance Testing appeared first on Neo4j Graph Database Platform.

↧

Retail & Neo4j: Pricing & Revenue Management

January 30, 2018, 2:37 am

≫ Next: Getting Started with Data Analysis using Neo4j [Community Post]

≪ Previous: Meltdown & Spectre: Current Results from Neo4j Performance Testing

It’s never been easier for customers to comparison shop.

In a matter of minutes, customers can compare prices for a specific product across a dozen stores — and all from the comfort of home. They can even compare prices and purchase from a competitor while shopping at a different retailer’s brick-and-mortar storefront.

To compete on prices and optimize profitability, retailers need to deliver competitive prices in real time. In order to keep up, retailers will need a secret weapon: graph technology.

Learn how Neo4j enables you to adjust complex retail pricing and revenue management in real time

In this series on Neo4j and retail, we’ll break down the various challenges facing modern retailers and how those challenges are being overcome using graph technology. In our previous posts, we’ve covered personalized promotions and product recommendation engines, customer experience personalization, ecommerce delivery service routing and supply chain visibility.

This week, we’ll discuss retail pricing and revenue management.

How Neo4j Powers Revenue Management for Retailers

Competitive pricing is based on a variety of factors, such as inventory, location, season, consumer demand, and more. These factors are very fluid and change quickly.

For example, if a hotel plans pricing based on the basketball championships, which goes to seven games, then those cities where the games are hosted will have low inventory, and should be priced accordingly. But if the championship is over in five games, then there will be more inventory for what would’ve been the last two games, and the retail pricing should change appropriately.

What’s more, each retail location may have a different price based on the market. The more retailers are able to understand their micro-markets and optimize product pricing to match availability, the more options there are to improve margins and sales in the right proportion. However, a relational database can’t keep up with these data points, and poor performance makes it impossible to deliver real-time pricing updates across multiple locations.

A graph database can help retailers address revenue management while delivering the scale and performance necessary for a real-time pricing engine. The interdependencies between the many variables can be represented as a graph, which gives retailers an effective way to determine and efficiently calculate prices even as dependencies change rapidly.

Marriott International Neo4j case study

Case Study: Marriott International

Marriott International – one of the world’s largest hospitality companies – needed a new pricing engine to drive both revenue and competitive differentiation.

The older pricing engine was being stretched by complex pricing rules that resulted in long pricing update cycles despite spending on massive amounts of hardware and tuning the legacy application. Marriott needed to provide a global system for on-property managers to review pricing recommendations and maintain pricing strategies for a 365-day horizon.

The hospitality company’s previous revenue management system was a manual-based mainframe green screen system. Prices were changed relatively infrequently due to the complexity of doing so. Publishing performance was slow — new prices didn’t show up for minutes or hours — and users avoided the system. But as the number of properties increased, so too did the number of pricing strategies and the complexity of those strategies.

The system used a highly normalized data model consisting of about ten levels of data in a relational database management system (RDBMS) with foreign key constructs. The company decided to build an application aware of the data model and relationships of the data.

However, some rate programs required over 30,000 lines of SQL queries to process, which took too long. With a business requirement to publish in less than 60 seconds, the company decided a rewrite was necessary.

The team decided to try Neo4j to achieve scale and performance for its 4,500 independent graphs with related data. In just eight weeks they built a prototype using Neo4j.

Although it wasn’t functionally complete, the team built a projection model that could process its most complex property in less than 34 seconds compared to 240 seconds (four minutes). The prototype also demonstrated that it could process properties concurrently compared to serially.

In five months, the company deployed the global system based on Neo4j for its 4,500 properties. As a result, the company found a 10-fold increase in publishing volumes, a 96% reduction in average publishing times, and a 50% reduction in server capacity and infrastructure costs.

Conclusion

Revenue management and real-time pricing are no longer calculations that can be made on the back of an envelope – nor in a far-off corporate headquarters. Rather, in order to remain competitive, retail prices must change rapidly as complex factors change within local markets.

These complex calculations are too interconnected for a relational database to handle at scale – or to publish in a narrow time window. That’s why retailers need a revenue management solution based on graph technology: every pricing factor within a connected data set can be calculated quickly and then published immediately.

In the coming weeks, we’ll take a closer look at other ways retailers are using graph technology to create a sustainable competitive advantage, including network management and IT operations.

It’s time to up your retail game:
Witness how today’s leading retailers are using Neo4j to overcome today’s toughest industry challenges with this white paper, Driving Innovation in Retail with Graph Technology. Click below to get your free copy.

Read the White Paper

Catch up with the rest of the retail and Neo4j blog series:

Personalized Promotion & Product Recommendations

Customer Experience Personalization

Ecommerce Delivery Service Routing

Supply Chain Visibility & Management

The post Retail & Neo4j: Pricing & Revenue Management appeared first on Neo4j Graph Database Platform.

↧

Getting Started with Data Analysis using Neo4j [Community Post]

February 1, 2018, 1:36 am

≫ Next: Retail & Neo4j: Network & IT Management for Retailers

≪ Previous: Retail & Neo4j: Pricing & Revenue Management

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Why Are We Doing This?

Data analysis is the phenomenon of dissecting, structuring and understanding data. In a nutshell, we want to find meaning from our data. In this tutorial, we aim to analyze a dataset from issuu.com. The goal is to find answers to a variety of simple and complex questions.

There are a plethora of tools, techniques and methods available to pursue data analysis. We will use Neo4j – a graph database – to represent and visualize the data. It uses a query language called Cypher which allows us to build queries and find all the answers that we seek. By the end of this tutorial, we will be able to import a JSON dataset to Neo4j and comfortably perform queries on our data.

Getting Set Up

The first thing we need to do is to download and install Neo4j. We will use Neo4j Desktop which provides a user-friendly UI to visualise the graph and run queries.

The next step is to acquire and understand the dataset!

1. The Dataset

The dataset that we will analyze comes from Issuu – an online repository for magazines, catalogs, newspapers and other publications. They published the Issuu Research Dataset with a treasure of data about documents and visitors. The dataset is completely anonymised and provides an insight into the usage of the website.

The data is available in the JSON format. It can be downloaded and accessed from this GitHub repository. There are two flavors of this file:

A small version – issuu_sample.json (This version of the dataset has 4 entries).
A large version – issuu_cw2.json (This version of the dataset has 10,000 entries).

Both these datasets have been slightly modified for the purpose of this tutorial. To summarize the modification, all JSON entries are now stored in an array and referenced by a key called items.

The dataset is vast and the detailed specification is available here. However, we are interested in the following attributes:

Attribute	Purpose
`env_doc_id`	Uniquely identify each document
`visitor_uuid`	Uniquely identify each visitor
`visitor_country`	Two-letter code to identify visitor’s country
`event_type`	Type of action accomplished by visitor on the document

2. Understanding the Graph

Now that we have selected our dataset and cherry-picked the necessary attributes, the next step is to formulate the data as a graph. To create the graph in Neo4j, we need to identify the following elements:

Nodes
Relationships
Properties

From our dataset, we can identify two nodes with the following properties:

Node	Properties
`Document`	`doc_uuid`
`Visitor`	`visitor_uuid`, `country`

Note: uuid stands for “Universally Unique IDentifier.”

Tip: A node can also be thought of as a class in object-oriented programming.

What about the relationships? We can create one relationship between the document and visitor nodes.

Relationship	Properties
`Visitor` viewed `document`	`type`

The relationship viewed is generic in nature. The type property indicates the specific type of event that was accomplished.

As an example, if we consider the visitor Thomas and the document A Diary of Jane, then the relationship can be illustrated as: Thomas viewed A Diary of Jane. However, the type of viewership could be any one of the following:

impression
click
read
download
share
pageread
pagereadtime
continuation_load

For the purpose of this exercise, we will use a single property on the relationship called as type. The relationship can now be illustrated as: visitor Thomas viewed (and specifically downloaded) document A Diary of Jane.

3. Creating Constraints & Indexes

A constraint is a mechanism to control and ensure data integrity in Neo4j. Constraints can be created on either nodes or relationships. There are basically three types of constraints in Neo4j:

Unique node property constraints – to ensure that the graph contains only a single node with a specific label and property value.
Node property existence constraints – to ensure that a certain property of a specific label exists within all nodes in the graph.
Relationship property existence constraints – to ensure that a certain property exists within all relationships of a specific structure.
Node Keys – to ensure that, for a given label and set of properties, there exists all those properties for that label and that the combination of property values is unique. Essentially, a combination of existence and uniqueness.

We will create unique node property constraints for our graph as follows. On the Document label for the property doc_uuid:

CREATE CONSTRAINT ON (d:Document) ASSERT d.doc_uuid IS UNIQUE

On the visitor label for the property visitor_uuid:

CREATE CONSTRAINT ON (v:Visitor) ASSERT v.visitor_uuid IS UNIQUE

The main goal of this exercise is to query the graph to derive insights about the dataset. One way to improve retrieval efficiency is by using indexes. The idea behind an index here is the same as in relational and NoSQL databases.

In Neo4j, an index can be created on a single property of a label. These are known as single-property indexes. An index can also be created on multiple properties of a label, and these are known as composite indexes.

It is important to understand that, by creating a unique node property constraint on a property, Neo4j will also create a single-property index on that property. Hence, in our situation, the indexes will be created on the doc_uuid property for the document label and on the visitor_uuid property for the visitor label.

Now that we have created the constraints, we can view existing indexes in Neo4j using the query:

CALL db.indexes

It should return:

╒══════════════════════════════════╤══════╤══════════════════════╕
│description                    │state │type                │
╞══════════════════════════════════╪══════╪══════════════════════╡
│INDEX ON :Document(doc_uuid)   │online│node_unique_property│
├──────────────────────────────────┼──────┼──────────────────────┤
│INDEX ON :Visitor(visitor_uuid)│online│node_unique_property│
└──────────────────────────────────┴──────┴──────────────────────┘

Tip: More information about constraints and their effect on indexes is elaborated in the Neo4j documentation on constraints.

4. Importing the JSON Dataset

Let’s now get our hands dirty!

Assuming that Neo4j is started (with an appropriate Database Location selected), we should first see an empty graph. This means that there are no nodes or relationships. Our goal is to populate the graph with the data from the JSON dataset by defining its skeleton (nodes and relationships).

To accomplish this, we will use the concept of user-defined procedures in Neo4j. These procedures are specific functionalities which can be re-used to manipulate the graph. We will specifically use a procedure from the APOC library which is a collection of 200+ commonly used procedures.

On Neo4j Desktop, APOC can be installed with the click of a single button.

To do this, open your project and click on the Manage button. Next, click on the Plugins tab. Under this tab, you will see the APOC heading accompanied with its version and a description. Click on the Install and Restart button. Once APOC is successfully installed, you should see a label with the message ✓ Installed.

Here is a screenshot of a successful APOC installation:

Tip: More information about user-defined procedures and the APOC library (installation, usage, examples) is elaborated in this Neo4j blog article: APOC: An Introduction to User-Defined Procedures and APOC.

We are interested in a specific procedure – apoc.load.json – which will allow us to load data from a JSON document. This procedure will return a singular map if the result is a JSON object or a stream of maps if the result is an array.

The following code snippet illustrates the use of the apoc.load.json procedure with our large dataset – issuu_cw2.json:

WITH "/path/to/issuu_cw2.json" AS url
CALL apoc.load.json(url)
YIELD value

Now, value contains our data which we need to utilize to create the nodes and relationships. However, we first need to parse this bulk JSON data.

To do this, we can use the UNWIND keyword in Neo4j. In a nutshell, UNWIND will expand a list into a sequence of rows. (More information about UNWIND is available in the Neo4j documentation on Cypher clauses.) At this stage, we are required to possess an understanding of the structure of the JSON data. As mentioned in step 1, we can access the entries using the key items.

This is illustrated as:

UNWIND value.items AS item

At this stage, we have access to a single row via item. Now, we can use the API to access the values.

In Neo4j, the WITH clause allows using the output of a sub-query with following sub-query parts. (More information about WITH is available in the Neo4j documentation on Cypher clauses).

From our data, we are concerned about documents from the Issuu “Reader” software. We uniquely identify these documents using the env_doc_id attribute. However, it is to be noted that not all documents have the env_doc_id attribute. Thus, we are required to explicitly select those entries which possess the attribute. The following code snippet illustrates this:

WITH item
WHERE NOT item.env_doc_id IS NULL

Notice how we access the value using item.env_doc_id. This style of retrieving the value makes working with JSON data on Neo4j a smooth experience.

Now that we have access to the values of each entry, it is time to create the nodes and relationships. This is accomplished using the MERGE keyword in Neo4j. It is crucial to know the difference between CREATE and MERGE.

For example, if we execute the following statements sequentially (provided that the constraints were not created):

CREATE (d:Document {doc_uuid:1}) RETURN (d)
CREATE (d:Document {doc_uuid:1}) RETURN (d)

This will result in two nodes being created. A way to control this is by using MERGE, which will create the node only if it does not exist. This can be illustrated as follows:

MERGE (d:Document {doc_uuid:1}) RETURN (d)

This same principle can be applied to relationships as well. The final Cypher query consisting of all the above components will look like this:

WITH "https://raw.githubusercontent.com/.../issuu_cw2.json" AS url
CALL apoc.load.json(url)
YIELD value
UNWIND value.items AS item
WITH item
WHERE NOT item.env_doc_id IS NULL
MERGE (document:Document
       {doc_uuid:item.env_doc_id})
MERGE (visitor:Visitor
       {visitor_uuid:item.visitor_uuid})
      ON CREATE SET visitor.visitor_country
                    = item.visitor_country
MERGE (visitor)-[:VIEWED{type:item.event_type}]->(document)

If we run this Cypher query verbatim on Neo4j, the output should be (similar to):

Added 2293 labels, created 2293 nodes, set 5749 properties,
created 2170 relationships, statement executed in 15523 ms.

To check whether the graph was populated successfully, we can run the Cypher query…

MATCH (n) RETURN (n) LIMIT 200

…which will only display the top 200 results. The output can be visualized as follows:

Learn how to use the Neo4j graph database for data analysis with this step-by-step tutorial

5. Let’s Query

Finally, we can derive insights from our data.

We need to ask our graph questions. These questions need to be translated to Cypher queries which will return the appropriate results. Let’s start by answering some basic and advanced questions about the dataset.

Note: For the purposes of this tutorial, we will only display the top 10 results for queries with a large number of rows. This is achieved by using the LIMIT 10 constraint.

Query 1. Find the number of visitors from each country and display them in the descending order of count.

MATCH (v:Visitor)
RETURN v.visitor_country AS Country, count(v) AS Count
ORDER BY count(v) DESC
LIMIT 10

Result:

╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│US     │312  │
├────────┼─────┤
│BR     │143  │
├────────┼─────┤
│MX     │135  │
├────────┼─────┤
│PE     │47   │
├────────┼─────┤
│CA     │46   │
├────────┼─────┤
│ES     │43   │
├────────┼─────┤
│GB     │36   │
├────────┼─────┤
│AR     │35   │
├────────┼─────┤
│FR     │34   │
├────────┼─────┤
│CO     │32   │
└────────┴─────┘

This query simply performs an internal group by operation where visitor nodes are grouped based on the visitor_country property. The count is computed using the count() aggregate function. We sort the results in the descending order using the ORDER BY <column> DESC clause in Neo4j.

Query 2. For a given document, find the number of visitors from each country.

MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
WHERE d.doc_uuid='140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9'
RETURN v.visitor_country AS Country, count(v.visitor_country) AS Count
ORDER BY count(v.visitor_country) DESC

Result:

╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│GY     │15   │
├────────┼─────┤
│CA     │12   │
├────────┼─────┤
│US     │11   │
├────────┼─────┤
│CW     │1    │
├────────┼─────┤
│BB     │1    │
└────────┴─────┘

This query is very similar to the one above. Here, we perform an internal group by operation to group the visitor nodes based on the visitor_country property. However, this query differs from the previous one in the sense that we want to filter the counts for a particular document UUID.

In order to achieve this filtration, we need to utilise the relationship within the graph. Hence, we first MATCH, filter using the WHERE clause and then return the desired values.

Tip: The relationship given here…

MATCH (d:Document)<-[:VIEWED]-(v:Visitor)

...can also be written as…

MATCH (v:Visitor)-[:VIEWED]->(d:Document)

Query 3. Find the number of occurrences for each type of viewership activity.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN r.type AS Type, count(d.doc_uuid) AS Count
ORDER BY Count ASC

Result:

╒═════════════╤══════╕
│Type        │Count│
╞═════════════╪══════╡
│click       │1    │
├─────────────┼──────┤
│read        │62   │
├─────────────┼──────┤
│pageread    │369  │
├─────────────┼──────┤
│pagereadtime│779  │
├─────────────┼──────┤
│impression  │959  │
└─────────────┴──────┘

This query also performs an internal group by operation on the relationship property type. An interesting aspect of this query is the ORDER BY Count ASC. Previously, we followed the style of using ORDER BY count(d.doc_uuid) ASC. However, once we add a column name such as Count, we can use that in subsequent parts of the query.

Hence, ORDER BY count(d.doc_uuid) ASC can also be written as ORDER BY Count ASC.

Query 4. Find the visitors for each document and display the top three in the descending order of number of visitors.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN DISTINCT d.doc_uuid AS DocUUID, collect(DISTINCT v.visitor_uuid) AS Visitors, count(DISTINCT v.visitor_uuid) AS Count
ORDER BY Count DESC
LIMIT 3

Result:

╒═════════════════════════════════╤═════════════════════════════════╤═════╕
│DocUUID                       │Visitors                      │Count│
╞═════════════════════════════════╪═════════════════════════════════╪═════╡
│140224101516-e5c074c3404177518│[4f4bd7a35b20bd1f, 78e1a8af51d│26   │
│bab9d7a65fb578e               │44194, 4d49271019c7ed96, 6b2d3│     │
│                              │cca6c1f8595, 19f5285fef7c1f00,│     │
│                              │ 3819fc022d225057, f102d9d4fc4│     │
│                              │bacdc, e5d957682bc8273b, abefb│     │
│                              │3fe7784f8d3, 6170372b90397fb3,│     │
│                              │ 797846998c5624ca, 43dd7a8b2fa│     │
│                              │fe059, 3ec465aa8f36302b, d6b90│     │
│                              │f07f29781e0, 7bd813cddec2f1b7,│     │
│                              │ 3db0cb8f357dcc71, e1bfcb29e0f│     │
│                              │3664a, 6d87bcdc5fa5865a, b0ba1│     │
│                              │42cdbf01b11, 0930437b533a0031,│     │
│                              │ e3392e4a18d3370e, ee14da6b126│     │
│                              │3a51e, 502ddaaa898e57c4, 6fd04│     │
│                              │0328d2ad46f, 23f8a503291a948d,│     │
│                              │ 923f25aa749f67f6]            │     │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228202800-6ef39a241f35301a9│[2f21ee71e0c6a2ce, 55ac6c3ce63│25   │
│a42cd0ed21e5fb0               │25228, e8fa4a9e63248deb, b2a24│     │
│                              │f14bb5c9ea3, d2d6e7d1a25ee0b0,│     │
│                              │ 6229cca3564cb1d1, 13ca53a93b1│     │
│                              │594bf, 47d2608ec1f9127b, 4e2a7│     │
│                              │5f30e6b4ce7, 43b59d36985d8223,│     │
│                              │ 355361a351094143, 51fd872df55│     │
│                              │686a5, 2f63e0cca690da91, febc7│     │
│                              │86c33113a8e, 52873ed85700e41f,│     │
│                              │ ca8079a4aaff28cb, 17db86d2605│     │
│                              │43ddd, b3ded380cc8fdd24, b6169│     │
│                              │f1bebbbe3ad, 458999cbf4307f34,│     │
│                              │ 280bd96790ade2d4, 32563acf872│     │
│                              │f5449, fabc9339a406616d, 36a12│     │
│                              │501ee94d15c, 6d3b99b2041af286]│     │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228101942-d4c9bd33cc299cc53│[b1cdbeca3a556b72, a8cf3c4f144│24   │
│d584ca1a4bf15d9               │9cc5d, 1435542d699350d9, 06d46│     │
│                              │5bfb51b0736, 2d41536695cc4814,│     │
│                              │ a96854d21780c1f9, d1c98b02398│     │
│                              │e9677, 78deb8ffdb03d406, 6c661│     │
│                              │964d1d13c61, 0d47795fb1ddba9d,│     │
│                              │ 667283570b5cedfe, 5b2baf03296│     │
│                              │63564, 08c069dc405cad2e, 6823c│     │
│                              │573efad29f6, 9b2cb60327cb7736,│     │
│                              │ 0e8ddc2d2a60e14f, f5986e1cb02│     │
│                              │378e4, fa3810e505f4f792, d5ed3│     │
│                              │cfc4a454fe9, ba76461cdd66d337,│     │
│                              │ ee42ba15ed8618eb, 688eb0dcd6a│     │
│                              │d8c86, 67c698e88b4fbdcc, c97c3│     │
│                              │83d774deae0]                  │     │
└─────────────────────────────────┴─────────────────────────────────┴─────┘

This query utilizes the collect() aggregate function which groups multiple records into a list. An important consideration made here is the use of the DISTINCT operator to ensure that duplicate values are omitted from the output. Finally, we display the top three using the LIMIT 3 constraint.

Query 5. For a given document, find recommendations of other documents like it.

Example 1: document UUID = 130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document)
WHERE d1<>d
      AND
      d.doc_uuid='130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8'
RETURN d1 AS Recommendations, count(*) AS Views,
sum(
CASE r1.type
  WHEN "impression" THEN 1
  WHEN "pageread" THEN 1.5
  WHEN "pagereadtime" THEN 1.5
  WHEN "read" THEN 2
  WHEN "click" THEN 0.5
  ELSE 0
END
) as Score
ORDER BY Score DESC

Result:

╒═════════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130810070956-4f21f4│12   │16   │
│22b9c8a4ffd5f62fdadf1dbee8}   │     │     │
└─────────────────────────────────┴─────┴─────┘

Example 2: document UUID = 120831070849-697c56ab376445eaadd13dbb8b6d34d0

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document)
WHERE d1<>d
      AND
      d.doc_uuid='120831070849-697c56ab376445eaadd13dbb8b6d34d0'
RETURN d1 AS Recommendations, count(*) AS Views,
sum(
CASE r1.type
  WHEN "impression" THEN 1
  WHEN "pageread" THEN 1.5
  WHEN "pagereadtime" THEN 1.5
  WHEN "read" THEN 2
  WHEN "click" THEN 0.5
  ELSE 0
END
) as Score
ORDER BY Score DESC

Result:

╒═════════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130701025930-558b15│6    │6    │
│0c485fc8928ff65b88a6f4503d}   │     │     │
├─────────────────────────────────┼─────┼─────┤
│{doc_uuid: 120507012613-7006da│6    │6    │
│2bc335425b93d347d2063dc373}   │     │     │
└─────────────────────────────────┴─────┴─────┘

This query aims to find documents similar to a given document by assigning a score based on the type of viewership activity.

Activity	Score
`impression`	1
`pageread`	1.5
`pagereadtime`	1.5
`read`	2
`click`	0.5

First, we perform a MATCH operation to capture the 1st degree and 2nd degree viewership of a visitor node along with the document nodes and relationships. We ensure that two documents nodes are not the same by using the <> operator and also specify the initial document UUID for which we would like to find related documents.

Next, we simply return the recommended document UUIDs, their overall viewership counts and their score. To calculate the score, we utilize the CASE expression which is then tallied using the sum() aggregate function.

The CASE expression has the following syntax:

CASE 
 WHEN  THEN  //if value matches, then return result
 [WHEN ...] //repeat until all values are handled
 [ELSE ] //else return a default result
END

Finally, we sort the results in the descending order of score using the ORDER BY Score DESC clause!

Summary

In this tutorial, we saw an example of performing data analysis using Neo4j. We examined the Issuu Research Dataset and elaborated on its structure, format and fields.

Next, we formulated the model/schema of our desired graph by choosing appropriate nodes, relationships and properties. Further, we created constraints and indexes in Neo4j to ensure uniqueness and improve the performance of querying.

After this, we discussed how to import the raw JSON dataset, parse it and populate our graph by following the previously determined schema. Lastly, we saw some sample Cypher queries which helped us derive insight from our vast dataset.

What's Next?

The possibilities are endless! If you enjoyed this tutorial, then you can try to derive data analysis insights using another dataset of your choice.

While we chose to construct a rather simple graph, you can make it much more complex and detailed. In addition to this, you can also explore and experiment with the various APOC user defined procedures on the graph.

References

Want to learn more about what you can do with graph databases? Click below to get your free copy of the O’Reilly Graph Databases book and discover how to harness the power of connected data.

Download My Free Copy

The post Getting Started with Data Analysis using Neo4j [Community Post] appeared first on Neo4j Graph Database Platform.

↧

Retail & Neo4j: Network & IT Management for Retailers

February 12, 2018, 3:53 am

≫ Next: Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 1]

≪ Previous: Getting Started with Data Analysis using Neo4j [Community Post]

In order to re-invent the value chain from linear to circular and highly connected, retailers need to modernize their IT infrastructure rapidly and cost-effectively.

In addition, web-based retailers must find a way to handle scale and sophistication to remain competitive. After all, Amazon – their biggest competitor – already handles both with ease.

Fortunately, graph technology helps today’s ecommerce and retail professionals overcome these IT management challenges.

Learn how Neo4j enables retail IT organizations to efficiently address network and IT management

In this series on Neo4j and retail, we’ve broken down the various challenges facing modern retailers and how those challenges are being overcome using graph technology. In our previous posts, we’ve covered personalized promotions and product recommendation engines, customer experience personalization, ecommerce delivery service routing, supply chain visibility and pricing management.

This week, we’ll wrap up our discussion with network and IT operations for retailers.

How Neo4j Simplifies Network & IT Management for Retailers

Oftentimes, retailers have complex networks, and they increasingly have components that are in the cloud (or multiple clouds) as well as on-premises data centers. It can be difficult to represent every IT asset and understand how they’re interconnected in most traditional configuration management databases (CMDB).

Consider, for example, a physical server that’s running multiple virtual machines (VMs). The VMs may be hosting containers that are running different processes and connected to different subnets. A graph database can be used to see how all these components interconnect.

System administrators can also use a graph database to maintain a map of all the different network assets. This map can be used to better secure the network and to detect vulnerabilities or to limit the spread of an intrusion.

Penetration testers and security admins use Neo4j when they use Bloodhound, an open source pen-testing tool. Bloodhound tracks Active Directory permissions and is used by both blue and red teams to uncover problems in Active Directory security as well as potential attacks. Because of Neo4j’s index-free adjacency, users get predictable query response times regardless of the size of the database.

Conclusion

While not always foremost in the minds of retailers, network and IT management are essential to long-term, strategic success in an Amazon-dominated industry. Forward-thinking, retail IT organizations can use Neo4j to map and monitor network assets to offer seamless scale and sophistication to end-shoppers. In addition, Neo4j helps retail cybersecurity professionals keep the bad guys out of the network and away from disrupting the shopping experience – or stealing shopper’s critical data.

Either way, the status quo technology – relational databases – is no longer enough to tackle retailer’s challenges within network and IT operations.

This concludes our blog series taking a closer look at graph technology use cases in the retail industry. To catch up with any of the posts you missed, click the links below, or download our white paper to cover them all.

The post Retail & Neo4j: Network & IT Management for Retailers appeared first on Neo4j Graph Database Platform.

↧

Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 1]

February 13, 2018, 2:24 am

≫ Next: Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 2]

≪ Previous: Retail & Neo4j: Network & IT Management for Retailers

I recently had the opportunity to combine work and pleasure and meet with Dr. Aaron Clauset, an expert on network science, data science and complex systems. In 2016, Clauset won the Erdos-Renyi Prize in Network Science but you might be more familiar with his earlier research into power-laws, link prediction and modularity.

Dr. Clauset directs the research group that developed the ICON dataset reference (if you’re looking for network data to test, bookmark this now) and has recently published research that sheds light on possible misconceptions about network structures. When a last-minute business trip to Denver came up, I made the trip up to Boulder where Clauset is an assistant professor of computer science at University of Colorado Boulder.

Read this interview on the hidden field of network science and how it's revolutionizing everything

Dr. Aaron Clauset is an Assistant Professor of Computer Science at the University of Colorado Boulder and in the BioFrontiers Institute. He’s also part of the external faculty at the Santa Fe Institute (for complexity studies).

Between lunch and Clauset’s next class, we chatted about his group’s recent research and the general direction of network science, and I left with a superposition of both disillusionment and excitement. The Clauset Lab has been working to expand the diversity and rigor of studying complex systems and in doing so, they may be dismantling some cherished beliefs that date back to the 90s. (I should have known it wouldn’t be simple; we’re talking about complex systems after all.)

This matters to the Neo4j graph community because anyone analyzing networks – especially if they are looking for global attributes – needs to understand the underlying dynamics and structure. Below is a summary of our discussion.

What kind of work is your team focused on?

Clauset: My research group at CU Boulder currently includes five Ph.D. students, along with a few masters and several undergraduates. Our research focuses on both developing novel computational methods for understanding messy, complicated datasets, and on applying these methods to solve real scientific problems, in biological and social settings mainly.

In the group, everyone is involved in research in some way. For example, the ICON website (index of complex networks) was built by a pair of undergrads to teach themselves networks concepts and explore tools.

Networks are one of our key areas of work. Networks are really just a representation, a tool to understand complex systems. We represent how a social system works by thinking about interactions between pairs of people. By analyzing the structure of this representation, we can then answer questions about how the system works or how individuals behave within it. In this sense, network science is a set of technical tools that can be applied to nearly any domain.

Networks also act as a bridge for understanding how microscopic interactions and dynamics can lead to global or macroscopic regularities. They can bridge between the micro and the macro because they represent exactly which things are interacting with each other. It used to be common to assume that everything interacts with everything, and we know that’s not true; in genetics, not all pairs of people and not all pairs of genes interact with each other.

Taken from, “Hierarchical structure and the prediction of missing links in networks“

An extremely important effort in network science is figuring out how the structure of a network shapes the dynamics of the whole system. We’ve learned in the last 15 years that for many complex systems, the network is incredibly important in shaping both what happens to individuals within the network and how the whole system evolves.

My group’s work focuses on characterizing the structure of these networks so that we can better understand how structure ultimately shapes function.

Are there commonalities across different types of networks?

Clauset: In the late 1990s and early 2000s, a lot of energy in driving network science came from physicists, who brought new mathematical tools, models and lots of new data. One idea they popularized was the hypothesis that “universal” patterns occurred in networks of all different kinds – social, biological, technological, information and even economic networks – and these were driven by a small number of fundamental processes.

This kind of idea was pretty normal in one part of physics. For instance, there’s a universal mathematical model of how a magnet works that makes remarkably accurate predictions about real magnets of all different kinds.

The dream for networks was to show that the same could be done for them: that all different kinds of networks could be explained by a small set of basic mathematical principles or processes, or that they fell into a small number of general structural categories. It’s a pretty powerful idea, and it inspired both a lot of really good, cross-disciplinary work, as well as a number of highly provocative claims.

The validity of some of the boldest claims has been difficult to evaluate empirically because it required using a large and diverse set of real-world networks to test empirical “universality” of the pattern. Assembling such a dataset is part of what led us to put together the Index of Complex Networks, what we call the ICON index.

Although we’re still expanding it, my group has already begun revisiting many of the early claims about universal patterns in networks, including the idea that “all networks are scale-free,” or that only social networks have a high triangle density, or that networks cluster into “superfamilies” based on the pattern of their local structure. Surprisingly, many claims about the structure of networks have been repeated again and again in the literature, but they haven’t been carefully scrutinized with empirical data.

It turns out that many of these universal patterns fall apart when you can look across a huge variety of networks. Kansuke Ikehara’s recent paper [Characterizing the structural diversity of complex networks across domains] asks a simple question: If I label a large number of networks with where they came from (for example a transportation/road network, an online/social network, or a metabolic/biological network) can you use machine learning to discover what features distinguish these classes of networks?

The structural diversity of complex networks

If there are a few “families” of network structures, then no algorithm should be able to learn to distinguish different networks within a family. Instead, what we found was that pretty much every class of network was easily distinguishable from every other class.

Social networks cluster together in one part of the feature space, biological networks are generally well separated from those, etc., and this is true for every class of network we looked at. The clear take-home message is that there’s a lot more diversity in network structures than we thought 20 years ago, and therefore a lot more work to be done to understand where this diversity comes from.

Ikehara’s research revealed the hidden structural diversity of networks and suggests that there may be fewer universal patterns than once thought. At the same time, some clusters of networks are closer to each other in terms of their structure.

For instance, we found that water distribution networks exhibit similar structural signatures to fungal mycelia networks, which suggests that they may be shaped by similar underlying processes or optimization problems. In this way, machine learning can help us identify such structural commonalities, and therefore help us figure out, in a data-driven way, where we might most likely to find a common mechanistic explanation.

How is network science evolving?

Clauset: In many ways network science today is diversifying and expanding. This expansion is enabling a great deal of specialization, but there’s a trade-off. People can now take network methods and apply them in really specific questions about really specific systems.

This is enormously productive and an exciting achievement for network science. But, the growth of disciplinary work around networks also means there’s relatively less work that crosses disciplinary boundaries. Without shared spaces where people from different domains get together to talk about their breakthroughs, people working on one type of problem are less likely to get exposed to potentially remarkable ideas in a different area.

Sure, a lot of ideas about economics won’t apply to biological networks, but some will, and if the economists and biologists never talk to each other, we’ll never know. If there’s no common ground, there will be a lot of reinvention and delays, even years for methods in one domain to cross over to another.

This is why I think it’s so important to study and get together to discuss networks in general. This kind of cross-disciplinary fervor is another thing physicists and computer scientists helped get going about 20 years ago; it was mainly physicists and computer scientists broadcasting “we can do sociology, and politics, and ecology too.”

That attitude certainly annoyed some people, especially the sociologists who’d been doing networks for 80 years already, but it also generated enormous and broad interest in networks across essentially all the sciences. Now, the different disciplinary areas of network science are growing so quickly that in some ways the middle – the crossroads where ideas can jump between fields – is effectively shrinking.

How can network science encourage more cross-domain collaboration?

The NetSci conference for interdisciplinary network science

Clauset: Having an actual event that serves as a crossroads between domains where people can present and interact is essential. In many ways the International Conference on Network Science is trying to do that, but it struggles to pull researchers out of their domains and into the middle, since different disciplines have different overarching questions. I think so long as some domain experts from different fields come to the crossroads to talk and interact, good ideas will eventually spread.

Continuing this interdisciplinary effort will be a key part of continuing the advancement of network science. But, not all efforts need to be interdisciplinary. In fact, disciplines are essential to help focus our collective attention.

I’m not sure what the right balance is between disciplinary and interdisciplinary work, but for me, interdisciplinary ideas are the most exciting. If work on these isn’t funded and supported at decent levels, we surely won’t address many of the most important ideas in society because they’re the ones that span different disciplines.

For instance, cybersecurity is not just a technical issue because humans have a terrible track record of writing bug-free software. Real security requires legal components, social components, ethical components, economic components, and probably more to develop a lasting solution.

In fact, if you choose any problem that impacts a decent portion of the population, then it’s surely an interdisciplinary problem that will require an interdisciplinary approach to understand and solve.

Conclusion

As you can see, we had a great discussion on how some of the preconceived ideas around networks are changing. Next week – in the second part of this series – I’ll summarize our deeper dive into some of the advancements and emerging themes in network science.

Take a closer look into the powerhouse behind the analysis of real-world networks: graph algorithms. Read this white paper – Optimized Graph Algorithms in Neo4j – and learn how to harness graph algorithms to tackle your toughest connected data challenge.

Get My Free Copy

The post Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 1] appeared first on Neo4j Graph Database Platform.

↧

Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 2]

February 22, 2018, 1:54 am

≫ Next: DevOps on Graphs: The 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub

≪ Previous: Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 1]

Last week, in part one of my interview with Dr. Aaron Clauset, we reviewed how network science was evolving and how it’s dismantling preconceived notions about networks in general. Clauset also stressed the crucial role of interdisciplinary collaboration when it comes to network science.

Dr. Aaron Clauset, Network Science researcher

This week, we’ll dive into how advancements in network science are impacting predictions based on connected data.

Which domains are making the most advancements in network science?

Clauset: We’re beginning to realize that interactions – which is what networks represent – are the way to understand complexity of all different kinds. All social processes at the population scale are complex and all biological processes at scale are complex.

One area that’s particularly interesting to me is computational social science, in part because there’s a lot of new data and many companies are terribly interested in predicting how people will behave. So we have data, interest and support.

In biology, the traditional approach to disease has been to find the one thing that’s broken and fix that. But it turns out that when two things are broken it’s much harder. When 1000 genes are involved and each compensates for others in some way, that’s a network problem. Many of the advancements coming in computational biology – like gene therapies – are related to looking at biological networks.

I’m really optimistic about what these two disciplines will do to drive network science. A lot of that will depend on developing new tools for answering both new and old questions.

What kind of tools do we need to develop?

Clauset: Everyone would like a “Google Maps for networks,” so you could see the big picture but then also drill down in different areas. Just being able to visually explore some of these network datasets and develop hypothesis would be helpful.

But it’s much easier to visualize the underlying geometry geographic data that makes it possible to zoom in and out between different scales. Network visualization is much harder because networks aren’t three dimensional.

To see anything at all, you have to use statistical tools and have an idea of what you’re looking for. As a result, it’s inherently hard to know what’s being hidden from you by the tool. People are working to solve this problem, but progress has been pretty slow.

Another exciting area is, of course, machine learning. Networks present a special challenge for standard machine learning techniques because most techniques – including deep learning – assume that observations are independent and by definition that’s not true in a network. Building new tools that can work with networks is an important area of work now.

Hierarchical data structure visualized as a network

Taken from, “Hierarchical structure and the prediction of missing links in networks“

The machine learning community is incredibly good at developing tools and algorithms, including network methods. But, let’s be honest, most of the fancy techniques aren’t great in many practical settings because they’re too complicated. If a company wants to extract insights from their data, they need to know how the tool will fail and when it can lead them astray. The more complicated the technique, the less likely you are to understand how it actually works – and that’s a huge problem, especially in network settings where things interact.

Deep learning is a great example of a powerful technique that we don’t really understand yet, but it also clearly works in practice. But how and when does it fail? Failure modes end up being surprisingly important.

One fascinating area of work on failure modes is on algorithmic biases and measures of fairness. Another is on interpretable models, meaning models that a human can inspect to figure out what they are doing. For networks, probabilistic network models are a powerful and interpretable technique, and this area is one of my favorites for innovative network tools.

Is your research group working in these emerging areas of network science?

Clauset: One area of focus in my group is on probabilistic models for networks, and a long-running effort has been to develop techniques that learn from as many different types of auxiliary information as possible: dynamic information like node and edge changes over time and also metadata like node annotations and edge weights.

The models we build are generally interpretable and relatively simple, while still digesting all this information to make predictions: detecting anomalies; sensing change points in dynamics; and predicting missing links or attributes.

One of my students is working on an approach similar to what was used to win the Netflix Prize but for link prediction. We take a dozen state-of-the-art “community detection” algorithms, combine them with 35 measures of local network structure, and then learn a model to figure out how to combine them all into highly accurate predictions.

This approach is an example of the same ensemble method that won the Netflix Prize competition. These results are particularly exciting both because the performance is remarkably good and because the tool is revealing new insights about how different networks are structured.

Network science research from the Santa Fe Institute

A word cloud from Dr. Clauset’s research with the Santa Fe Institute.

This kind of algorithm development depends on having a large and diverse corpus of network datasets, and that is precisely what the ICON dataset provides. Like with a lot of machine learning techniques, this kind of work is very data needy, and we used nearly 500 networks from every domain imaginable to train the model, and as a result, it provides much better generalizability than if we trained it on the usual half-dozen datasets.

One goal of this project is to create a website that will perform all the backend calculations for you. You upload your data, the algorithms run and out spits different predictions.

What other research or work should we watch out for?

Clauset: Another of my students, Anna Boido, just put out a paper called Scale-free Networks Are Rare, which used the ICON index data to test another idea from early in network science: the claim that all networks are “scale free.”

Network science research on scale-free networks

Taken from, “Scale-free Networks Are Rare“

Performing rigorous statistical tests on nearly 1000 network datasets of all different kinds, we found that only 4% exhibited strong evidence of being scale free. Evidently, a genuinely scale-free network is a bit like a unicorn. That said, degree distributions in most networks are, in fact, very heterogeneous, and this simple pattern by itself can lead to lots of interesting dynamics, e.g., in the spread of disease or memes.

Much like Kansuke Ikehara’s work, the take-home message here is that there’s much more structural diversity in networks than people previously thought, and some beliefs may need to be sacrificed if network science is to move forward. One of the tremendous things about the ICON dataset is that it provides a powerful, data-driven view into network structure, and we’re really excited about all the new discoveries that will be made by studying these large groups of real-world networks.

Taken from, “Hierarchical structure and the prediction of missing links in networks“

Is the study of structure the next big thing in network science?

Clauset: The study of structure is a crucial part of the study of networks. The big question is: how does the network’s structure shape the dynamics of a complex system?

This gets right back to the micro vs. macro question from last week, but ultimately, the big question is really about function. Connecting structure and function has always been difficult. But, there are many places where significant advances are happening.

Neuroscience is one where network science is helping make sense of new and very detailed datasets to unravel the way brains work. Ecology is another area where scientists have made great progress in understanding structure and function, particularly in food webs.

The basic idea is that network structure and interaction dynamics shape a system’s function. Certainly, structure constraints dynamics by determining who interacts with whom, and dynamics can drive structure, especially if interactions can be formed or broken. Anytime a system does something or carries out some kind of task, that function has to be related to dynamics and structure.

Studying structure helps us think about function and – as it turns out – there’s just a lot of work to be done yet to understand structure. Connecting these two is surely where a lot of big discoveries are lurking. I’m excited to see what comes out of exploring the true structural diversity of networks across domains, and the scientific explanations that are developed to explain it.

Conclusion

That wraps up our interview with Dr. Aaron Clauset on the hidden field of network science and how it’s revolutionizing the world around us – even if you can’t always see it. If you missed part one of our interview, catch it here.

The post Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 2] appeared first on Neo4j Graph Database Platform.

↧

DevOps on Graphs: The 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub

February 23, 2018, 2:33 am

≫ Next: GDPR Compliance: The Challenges and Problems with Personal Data

≪ Previous: Network Science: The Hidden Field behind Machine Learning, Economics and Genetics That You’ve (Probably) Never Heard of – An Interview with Dr. Aaron Clauset [Part 2]

“Basically, anything you can think of in your infrastructure, whether it’s GitHub, Jenkins, AWS, load balancers, Cisco UCS, vCenter – it’s all in our graph database,” said Ashley Sun, Software Engineer at LendingClub.

DevOps at LendingClub is no easy feat: Due to the complexities and dependencies of their internal technology infrastructure – including a host of microservices and other applications – it would be easy for everything to spiral out of control. However, graph technology helps them manage and automate every connection and dependency from top to bottom.

In this week’s five-minute interview (conducted at GraphConnect New York), Ashley Sun discusses how the team at LendingClub uses Neo4j to gain complete visibility into its infrastructure for deployment and release automation and cloud orchestration. The flexibility of the schema makes it easy for LendingClub to add and modify its view so that their graph database is the single up-to-date source for all queries about its release infrastructure.

Talk to us about how you use Neo4j at LendingClub.

Ashley Sun: We are using Neo4j for everything related to managing the complexities of our infrastructure. We are basically scanning all of our infrastructure and loading it all into Neo4j. We’ve written a lot of deployment and release automation, cloud orchestration, and it’s all built around Neo4j. Basically, anything you can think of in your infrastructure, whether it’s GitHub, Jenkins, Amazon Web Services (AWS), load balancers, Cisco Unified Computing System (UCS), vCenter – it’s all in our graph database.

We’re constantly scanning and refreshing this information so that at any given time, we can query our graph database and receive real-time, current information on the state of our infrastructure.

What made you choose Neo4j?

Sun: At the time, my manager was looking for a database that we could run ad hoc queries against, something that was flexible and scalable. He actually looked at a few different graph databases and decided Neo4j was the best.

Catch this week’s 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub

What have been some of your most interesting or surprising results you’d seen while using Neo4j?

Sun: The coolest thing about Neo4j, for us, has been how flexible and easily scalable it is. If you’ve come from a background of working with the traditional SQL database where schemas have to be predefined — with Neo4j, it’s really easy to build on top of already existing nodes, already existing relationships and already existing properties. It’s really easy to modify things. Also, it’s really, really easy to query at any time using ad hoc queries.

We’ve been working with Neo4j for three years, and as our infrastructure has grown and as we’ve added new tools, our graph database has scaled and grown with us and just evolved with us really easily.

Anything else you’d like to add or say?

Sun: It would be exciting for more tech companies to start using Neo4j to map out their infrastructure and maybe automate deployments and their cloud orchestration using Neo4j. I’d love to about how other tech companies are using Neo4j.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Want to learn more on how relational databases compare to their graph counterparts? Get The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

The post DevOps on Graphs:<br /> The 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub appeared first on Neo4j Graph Database Platform.

↧

GDPR Compliance: The Challenges and Problems with Personal Data

February 26, 2018, 2:35 am

≫ Next: Neo4j: A Reasonable RDF Graph Database & Reasoning Engine [Community Post]

≪ Previous: DevOps on Graphs: The 5-Minute Interview with Ashley Sun, Software Engineer at LendingClub

European Union regulators are dead serious about protecting the privacy of their citizens’ personal data.

The General Data Protection Regulations (GDPR) that take effect on 25 May 2018 apply to all EU and foreign organizations handling personal data of EU residents.

They mandate strict compliance and call for steep fines for privacy violations. If you commit infractions or are subjected to random checks, regulators will require you to prove your compliance with GDPR requirements.

Learn about the complex challenges and problems of personal data when it comes to GDPR compliance

Personal Data Raises Difficult Questions

To meet GDPR requirements, you must be able to answer these difficult questions for any of the more than 500 million people in the European Union:

Personal data questions involved with GDPR compliance

But GDPR demands don’t end with these questions. You must know when and where breaches occur and what data was taken. You have to give people a way to view their personal data and how it’s being used. And – perhaps most importantly – you must be able to prove to regulators that you are in compliance with GDPR requirements.

GDPR rules are the most far-reaching and technically demanding personal data privacy regulations ever established. This high degree of visibility and enforcement provides an opportunity for organizations across the Continent: Enterprises that embrace the new GDPR regulations and provide transparent tracking of personal information have a big opportunity to win the hearts, minds and business of consumers.

Tracking Personal Data Requires Deep Visibility

In modern organizations, personal data resides in many applications that span servers, data centers, geographies, internal networks, and cloud service providers. GDPR holds you accountable for that data regardless of where it is stored. And it requires you to be able to access, report and remove personal information from all those systems when required by consumers or regulators.

To satisfy GDPR requirements, you must be able to track the movement, or lineage, of a contact’s personal data — where it was first acquired, whether consent was obtained, where it moves over time, where it resides in each of your systems, and how it gets used. The connections among those systems and silos are key to tracking the complex path that personal data follows through your enterprise.

Personal data lineage across enterprise systems

The key to GDPR compliance is tracking data lineage across all your enterprise applications

Conclusion

A seismic shift will occur in the data management world this coming May when GDPR becomes law in the European Union. If you’re an organization with information about European residents, then you must comply with these new, strict rules about how personal data is stored, secured, used, transmitted and even erased from your system.

Using a graph database foundation for your GDPR solution places your organization on the fastest, easiest, most cost-effective path to GDPR compliance. One of the challenges with adhering to these regulations is ensuring you find all the data related to an individual. Using a graph database like Neo4j enables you to manage all of your data and its connections, offering a natural approach to compliance with GDPR.

In the coming weeks, we’ll take a closer look at why graph technology is a superior approach to tackling the challenge of GDPR, and we’ll outline four steps to building a GDPR compliance solution.

GDPR rules more far-reaching and technically demanding than anything your enterprise has ever tackled – that’s why you need both a fast and a future-proof solution. Click below to get your copy of The Fastest Path to GDPR Compliance and learn how Neo4j enables you to become both GDPR compliant and poised for future opportunities driven by data connections.

Read the White Paper

The post GDPR Compliance: The Challenges and Problems with Personal Data appeared first on Neo4j Graph Database Platform.

↧

Neo4j: A Reasonable RDF Graph Database & Reasoning Engine [Community Post]

February 27, 2018, 1:54 am

≫ Next: What’s Waiting for You in the Latest Release of the APOC Library [March 2018]

≪ Previous: GDPR Compliance: The Challenges and Problems with Personal Data

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

It is widely known that Neo4j is able to load and write RDF. Until now, RDF – and certainly OWL – reasoning have been attributed to fully fledged triple stores or dedicated reasoning engines only. This post shows that Neo4j can be extended by a unique reasoning technology to deliver a very expressive and highly competitive reasoning engine for RDF, RDFS and OWL 2 RL. I will briefly illustrate the approach and provide some benchmark results.

Labeled Property Graphs (LPG) and the Resource Description Framework (RDF) have a common ground: both consider data as a graph. Not surprisingly there are ways of converting one format into the other as recently demonstrated nicely by Jesús Barrasa from Neo4j for the Thomson Reuters PermID RDF dataset.

If you insist on differences between LPG and RDF, then consider the varying abilities of representing schema information and reasoning.

In Neo4j 2.0, node labels were introduced for typing nodes to optionally encode a lightweight type schema for a graph. Broadly speaking, RDF Schema (RDFS) extents this approach more formally. RDFS allows to structure labels of nodes (called classes in RDF) and relationships (called properties) in hierarchies. On top of this, the Web Ontology Language (OWL) provides a language to express rule-like conditions to automatically derive new facts such as node labels or relationships.

Reasoning Enriches Data with Knowledge

For a quick dive into the world of rules and OWL reasoning, let us consider the very popular LUBM benchmark (Lehigh University Benchmark).

The benchmark consists of artificially generated graph data in a fictional university domain and deals with people, departments, courses, etc. As an example, a student is derived to be an attendee if he or she takes some course, thus when he or she matches the following ontological rule:

Student and (takesCourse some) SubClassOf Attendee

This rule has to be read as follows when translated into LPG lingo: every node with label Student that has some relationship with label takesCourse to some other node will receive the label Attendee. Any experienced Neo4j programmer may rub his or her hands since this rule can be translated straightforward into the following Cypher expression:

match (x:Student)-[:takesCourse]->()
set x:Attendee

That is perfectly possible but could become cumbersome in case of deeply nested rules that may also depend on each other. For instance, the Cypher expression misses the subclasses of Student such as UndergraduateStudent. Strictly speaking the expression above should therefore read:

match (x)-[:takesCourse]->() where x:Student or x:UndergraduateStudent
set x:Attendee

It’s obviously more convenient to encode such domain knowledge as an ontological rule with support of an ontology editor such as Protégé and an OWL reasoning engine that takes care of executing them.

Another nice thing about RDFS/OWL is that modelling such knowledge is on a very declarative level that is standardized by W3C. In addition, the OWL language bears some important properties such as soundness and completeness.

For instance, you can never define a non-terminating rule set, and reasoning will instantly identify any conflicting rules. In case of OWL 2 RL, it is furthermore guaranteed that all derivable facts can be derived in polynomial time (theoretical worst case) with respect to the size of the graph.

In practice, performance can vary a lot of course. In case of our Attendee example, a reasoner – regardless of whether a triple store rule engine or Cypher engine – has to loop over the graph nodes with label Student and check for takesCourse relations.

To tweak performance, one could use dedicated indexes to effectively select nodes with particular relations (resp. relation degree) or labels as well as use stored procedures. At the end of the day, it seems that this does not scale well: when doubling the data, you double the amount of graph reads and writes to compute the consequences of such rules.

The good news is that this is not the end of the story.

Efficient Reasoning for Graph Storage

There is a technology called GraphScale that empowers Neo4j with scalable OWL reasoning. The approach is based on an abstraction refinement technique that builds a compact representation of the graph suitable for in-memory reasoning. Reasoning consequences are then incrementally propagated back to the underlying graph store.

The idea behind GraphScale is based on the observation that entities within a graph often have a similar structure. The GraphScale approach takes advantage of these similarities and computes a condensed version of the original data called an abstraction.

This abstraction is based on equivalence groups of nodes that share a similar structure according to well-defined logical criteria. This technique is proven to be sound and complete for all of RDF, RDFS and OWL 2 RL.

Learn how the Neo4j graph database (vs. a triple store) performs as a reasonable RDF reasoning engine

Here is an intuitive idea of the approach. Consider the graph above as a fraction of the original data about the university domain in Neo4j. On the right, there is a compact representation of the undergraduate students that take at least some course.

In essence, the derived fact that those students are attendees implicitly holds for all source nodes in the original graph. In other words, there is some one-to-many relationship from derived facts in the compact representation to nodes in the original graph.

Reasoning and Querying Neo4j with GraphScale

Let’s look at some performance results with data of increasing size from the LUBM test suite.

The following chart depicts the time to derive all derivable facts (called materialization) with GraphScale on top of Neo4j (without loading times) with 50, 100, resp. 250 universities. In comparison to other secondary storage systems with reasoning capabilities, it occurs that the Neo4j-GraphScale duo shows a much lower growth ratio in reasoning time with increasing data than any other system (schema and data files can be found at the bottom of this post).

A benchmark of GraphScale + Neo4j using the LUBM test suite

Experience has shown that materialization is key to efficient querying in a real-world setting. Without upfront materialization, a reasoning-aware triple store has to temporarily derive all answers and relevant facts for each single query on demand. Consequently, this comes with a performance penalty and typically fails on non-trivial rule sets.

Since the Neo4j graph database is not a triple store, it is not equipped with a SPARQL query engine. However, Neo4j offers Cypher and for many semantic applications it should be possible to translate SPARQL to Cypher queries.

From a user perspective this integrates two technologies into one platform: a transactional graph analytics system as well as a RDFS/OWL reasoning engine able to service sophisticated semantic applications via Cypher over a materialized graph in Neo4j.

As a proof of concept, let us consider SPARQL query number nine from the LUBM test suite that turned out to be one of the most challenging out of the 14 given queries. The query asks for students and their advisors which teach courses taken by those students – a triangular relationship pattern over most of the dataset:

SELECT ?X ?Y ?Z {
?X rdf:type Student .
?Y rdf:type Faculty .
?Z rdf:type Course .
?X advisor ?Y .
?Y teacherOf ?Z .
?X takesCourse ?Z
}

Under the assumption of a fully materialized graph, this SPARQL query translates into the following Cypher query:

MATCH (x:Student)-[:takesCourse]->(z:Course),
  	(x)-[:advisor]->(y:Faculty)-[:teacherOf]->(z)
RETURN x, y, z

Without a doubt, the Neo4j Cypher engine delivers a competitive query performance with the previous datasets (times for resp. count(*) version of query nine). Triple store A is not listed since it is a pure in-memory system without secondary storage persistence.

Benchmark data between Neo4j + Cypher + GraphScale vs. a triple store

There is more potential in the marriage of Neo4j and the GraphScale technology. In fact, the graph abstraction can be very helpful as an index for query answering. For instance, you can instantly read from the abstraction whether there is some data matching query patterns of kind (x:)-[:]->().

Bottom line: I fully agree with George Anadiotis’ statement that labeled property graphs and RDF/OWL are close relatives.

In a follow-up blog post, I will present an interactive visual exploration and querying tool for RDF graphs that utilizes the compact representation described above as an index to deliver a distinguished user experience and performance on large graphs.

Resources

GraphScale:

GraphScale: Adding Expressive Reasoning to Semantic Data Stores. Demo Proceedings of the 14th International Semantic Web Conference (ISWC 2015): http://ceur-ws.org/Vol-1486/paper_117.pdf
Abstraction refinement for scalable type reasoning in ontology-based data repositories: EP 2 966 600 A1 & US 2016/0004965 A1

Data:

LUBM OWL 2 RL schema and RDF data (http://www.semspect.de/lubm-neo4j-graphscale.7z) used in the evaluation

Dive into the world of the labeled property graph:
Click below to get your free copy of the O’Reilly Graph Databases book and discover how to harness the power of connected data.

Get My Free Copy

The post Neo4j: A Reasonable RDF Graph Database & Reasoning Engine [Community Post] appeared first on Neo4j Graph Database Platform.

↧

What’s Waiting for You in the Latest Release of the APOC Library [March 2018]

March 1, 2018, 3:25 am

≫ Next: The Story behind Russian Twitter Trolls: How They Got Away with Looking Human – and How to Catch Them in the Future

≪ Previous: Neo4j: A Reasonable RDF Graph Database & Reasoning Engine [Community Post]

The last release of APOC library was just before GraphConnect New York, and in the meantime quite a lot of new features made their way into our little standard library.

We also crossed 500 GitHub stars, thanks everyone for giving us a nod!

What’s New in the Latest APOC Release

Learn about the March 2018 release of the APOC library of user-defined procedures and functions built for Neo4j Desktop

Image: Warner Bros.

If you haven’t used APOC yet, you have one less excuse: it just became much easier to try. In Neo4j Desktop, just navigate to the Plugins tab of your Manage Database view and click “Install” for APOC. Then your database is restarted, and you’re ready to rock.

APOC wouldn’t be where it is today without the countless people contributing, reporting ideas and issues and everyone telling their friends. Please keep up the good work.

I also added a code of conduct and contribution guidelines to APOC, so every contributor feels welcome and safe and also quickly knows how to join our efforts.

For this release again, our friends at LARUS BA did a lot of the work. Besides many bugfixes, Angelo Busato also added S3 URL support, which is really cool. Andrea Santurbano also worked on the HDFS support (read / write).

With these, you can use S3 and HDFS URLs in every procedure that loads data, like apoc.load.json/csv/xml/graphml, apoc.cypher.runFile, etc. Writing to HDFS is possible with all the export functions, like apoc.export.cypher/csv/graphml.

Andrew Bowman worked on a number of improvements around path expanders, including:

Added support for repeating sequences of labels and/or rel-types to express more complex paths
Support for known end nodes (instead of end nodes based only on labels)
Support for compound labels (such as :Person:Manager)

I also found some time to code and added a bunch of things.

Aggregation Functions

I wanted to add aggregation functions all the way back to Neo4j 3.2 after Pontus added the capability, but I just never got around to it. Below is one of the patterns that we used to use to get the first (few) elements of a collect, which is quite inefficient because the full collect list is built up even if you’re just interested in the first element:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p,m ORDER BY m.released
RETURN p, collect(m)[0] as firstMovie

Now you can just use:

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p,m ORDER BY m.released
RETURN p, apoc.agg.first(m) as firstMovie

There are also some more statistics functions, including apoc.agg.statistics which computes all at once and returns a map with: {min,max,sum,median,avg,stdev}. The other statistics functions include:

More efficient variants of collect(x)[a..b]
apoc.agg.nth, apoc.agg.first, apoc.agg.last, apoc.agg.slice
apoc.agg.median(x)
apoc.agg.percentiles(x,[0.5,0.9])
apoc.agg.product(x)
apoc.agg.statistics() provides a full numeric statistic

Indexing

Implemented an idea of my colleague Ryan Boyd to allow indexing of full “documents”, i.e. map-structures per node or relationship that can also contain information from the neighborhood or computed data. Later, those can be searched as keys and values of the indexed data.

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WITH p, p {.name, .age, roles:r.roles, movies collect(m.title) } as doc
CALL apoc.index.addNodeMap(p, doc);

Then, later you can search:

CALL apoc.index.nodes('Person','name:K* movies:Matrix roles:Neo');
apoc.index.addNodeMap(node, {map})
apoc.index.addRelationshipMap(node, {map})

As part of that work, I also wanted to add support for deconstructing complex values or structs, such as:

apoc.map.values to select the values of a subset of keys into a mixed type list
apoc.coll.elements is used to deconstruct a sublist into typed variables (this can also be done with WITH, but requires an extra declaration of the list to be concise)

RETURN apoc.map.values({a:'foo', b:42, c:true}, ["a","c"]) -> ['foo', true]

CALL apoc.coll.elements([42, 'foo', person]) 
YIELD _1i as answer, _2s as name, _3n as person

Path Expander Sequences

You can now define repeating sequences of node labels or relationship types during expansion, just use commas in the relationshipFilter and labelFilter config parameters to separate the filters that should apply for each step in the sequence.

relationshipFilter:'OWNS_STOCK_IN>, <MANAGES, LIVES_WITH>|MARRIED_TO>|RELATED'

The above will continue traversing only the given sequence of relationships.

labelFilter:'Person|Investor|-Cleared, Company|>Bank|/Government:Company'

All filter types are allowed in label sequences. The above repeats a sequence of a :Person or :Investor node (but not with a :Cleared label), and then a :Company, :Bank, or :Government:Company node (where :Bank nodes will act as end nodes of an expansion, and :Government:Company nodes will act as end nodes and terminate further expansion).

sequence:'Person|Investor|-Cleared, OWNS_STOCK_IN>, Company|>Bank|/Government:Company,
         <MANAGES, LIVES_WITH>|MARRIED_TO>|RELATED'

The new sequence config parameter above lets you define both the label filters and relationship filters to use for the repeating sequence (and ignores labelFilter and relationshipFilter if present).

Path Expansion Improvements

Compound labels (like Person:Manager) allowed in the label filter, applying only to nodes with all of the given labels.
endNodes and terminatorNodes config parameters, for supplying a list of the actual nodes that should end each path during expansion (terminatorNodes end further expansion down the path, endNodes allow expansion to continue)
For labelFilter, the whitelist symbol + is now optional. Lack of a symbol is interpreted as a whitelisted label.
Some minor behavioral changes to the end node > and termination node / filters, specifically when it comes to whitelisting and behavior when below minLevel depth.

Path Functions

(This one came from a request in neo4j.com/slack.)

apoc.path.create(startNode, [rels])
apoc.path.slice(path, offset, length)
apoc.path.combine(path1, path2)

MATCH (a:Person)-[r:ACTED_IN]->(m)
...
MATCH (m)<-[d:DIRECTED]-()
RETURN apoc.path.create(a, r, d) as path

MATCH path = (a:Roo)<-[:PARENT_OF*..10]-(leaf)
RETURN apoc.path.slice(path, 2,5) as subPath

MATCH firstLeg = shortestPath((start:City)-[:ROAD*..10]-(stop)),
             secondLeg = shortestPath((stop)-[:ROAD*..10]->(end:City))
RETURN apoc.path.combine(firstLeg, secondLeg) as route

Text Functions

apoc.text.code(codepoint), apoc.text.hexCharAt(), apoc.text.charAt() (thanks to Andrew Bowman)
apoc.text.bytes/apoc.text.byteCount (thanks to Jonatan for the idea)
apoc.text.toCypher(value, {}) for generating valid Cypher representations of nodes, relationships, paths and values
Sørensen–Dice similarity (thanks Florent Biville)
Roman <-> Arabic conversions (thanks Marcin Cylke)
New email and domain extraction functions (thanks David Allen)

Data Integration

Generic XML import with apoc.import.xml() (thanks Stefan Armbruster)
Pass Cypher parameters to apoc.export.csv.query
MongoDB integration (Thanks Gleb Belokrys)

Added paging parameter in the get and find procedure

stream apoc.export.cypher script export back to client when no file name is given
apoc.load.csv

Handling of converted null values and/or null columns
explicit nullValues option to define values that will be replaced by null (global and per field)
explicit results option to determine which output columns are provided

Collection Functions

apoc.coll.combinations(), apoc.coll.frequencies() (Thanks Andrew)
Update/remove/insert value at collection index (Thanks Brad Nussbaum)

Graph Refactoring

Per property configurable merge strategy for mergeNodes
Means to skip properties for cloneNodes

Other Additions

Added apoc.date.field UDF

Other bugfixes in this release of the APOC library include:

apoc.load.jdbc (type conversion, connection handling, logging)
apoc.refactor.mergeNodes
apoc.cypher.run*
apoc.schema.properties.distinctCount
Composite indexes in Cypher export
ElasticSearch integration for ES 6
Made larger parts of APOC not require the unrestricted configuration
apoc.json.toTree (also config for relationship-name casing)
Warmup improvements (dynamic properties, rel-group)
Compound index using apoc.schema.assert (thanks Chris Skardon)
Explicit index reads don’t require read-write-user
Enable parsing of lists in GraphML import (thanks Alex Wilson)
Change CYPHER_SHELL format from upper case to lower case. (:begin,:commit)
Allowed apoc.node.degree() to use untyped directions (thanks Andrew)

Feedback

As always, we’re very interested in your feedback, so please try out the new APOC releases, and let us know if you like them and if there are any issues.

Please refer to the documentation or ask in neo4j-users Slack in the #neo4j-apoc channel if you have any questions.

Enjoy the new release(s)!

Take a deeper dive into the world of graph algorithms: Read this white paper – Optimized Graph Algorithms in Neo4j – and learn how to harness graph algorithms to tackle your toughest connected data challenge.

Get My Free Copy

The post What’s Waiting for You in the Latest Release of the APOC Library [March 2018] appeared first on Neo4j Graph Database Platform.

↧

The Story behind Russian Twitter Trolls: How They Got Away with Looking Human – and How to Catch Them in the Future

March 7, 2018, 6:03 am

≫ Next: Achievement Unlocked: 1,000 Neo4j Certified Professionals

≪ Previous: What’s Waiting for You in the Latest Release of the APOC Library [March 2018]

It’s no secret that Russian operatives used Twitter and other social media platforms in attempt to influence the most recent U.S. presidential election cycle with fake news. The question most people aren’t asking is: How did they do it?

More importantly: How can governments and social media vendors detect this sort of behavior in the future before it has unforeseen consequences?

The Backstory of the Russian Troll Network

First, some background. During the 2016 U.S. presidential elections Twitter was used to propagate fake news to presumably influence the presidential election. The House Intelligence committee then released a list of 2,752 false Twitter accounts believed to be operated by a known Russian troll factory – the Internet Research Agency. Twitter immediately suspended these accounts, removing their information and tweets from public view.

Even though these accounts and tweets were removed from Twitter.com and the Twitter API, journalists at NBC News were able to assemble a subset of the tweets and using Neo4j, were able to analyze the data to help make sense of what these Russian accounts were doing.

NBC News recently open sourced the data in the hopes that others could learn from this dataset and inspire those who have been caching tweets to contribute to a combined database.

Two days later, the Mueller indictment named the Internet Research Agency and several of the Twitter accounts and hashtags used in the NBC News data specifically. While most news articles explain what happened, the important questions is how did it happen in order to prevent future abuse.

How Did the Interference Work?

The interference worked in several stages with the development of Twitter accounts that used common hashtags and the posting of reply tweets to popular accounts to gain visibility and followers.

This chart below shows that the vast majority of the tweets were retweets, and only roughly 25% were original content tweets.

Russian troll behavior: Original tweets vs. retweets pie chart

When we break this analysis down on a per-account level, we see that many accounts were only retweeting others, amplifying the messages but not posting much themselves.

Russian troll tweet vs. retweet count totals

These accounts can be classified into roughly three categories:

1.) The Typical American Citizen

This screenshot is the profile image of user @LeroyLovesUSA, one of the accounts Twitter has identified as being operated by the Internet Research Agency in Russia. Many accounts were intended to appear as normal everyday Americans just like this one.

A Russian Twitter troll sharing fake news

These accounts often tried to associate themselves with real-world events and – now that the Mueller indictment has revealed many of these operatives traveled to the U.S. – it is possible they actually participated in these events. In the image above, @LeroyLovesUSA is taking credit for posting a controversial banner from a bridge in Washington, DC.

2.) The Local Media Outlet

The "Cleveland Online" Twitter handle sharing fake news

Another type of Russian troll account presented themselves as local news outlets. Here @OnlineCleveland – another of the IRA-controlled accounts – appears to be a local news outlet in Cleveland. These accounts often posted exaggerated reports of violence.

Other examples of accounts in this category include @KansasDailyNews and @WorldnewsPoli.

3.) The Local Political Party

The TEN_GOP Twitter handle identified in the Mueller indictment

The third type of Russian troll account appeared to be affiliated with local political parties. Here is the account @TEN_GOP, intended to appear as an account connected to the Tennessee Republican party. This account is specifically named in the Mueller indictment.

The Amplification Network

Analyzing the data, we were able to determine that most of the original tweets written in the Russian troll network were written by a small number of users, such as @TEN_GOP. As mentioned above, the majority of overall tweets were retweets, because many of the Russian troll accounts were solely retweeting other accounts in an attempt to amplify the message.

When we apply graph analysis to this retweet network, we can see that the graph partitions into three distinct clusters, or communities. Further, we can run the PageRank centrality algorithm to identify the most influential accounts within each cluster.

The Russian troll Twitter network using the PageRank graph algorithm

A community detection algorithm shows there are three clear communities in the Russian troll retweet network. Node size is proportional to the PageRank score for each node, showing the importance of the account in the network.

When we then look into the hashtags these Russian trolls were using, we can see that the red group was tweeting mainly about right-wing politics (#VoterFraud, #TrumpTrain); the yellow group was more left leaning, but not necessarily positively (#ObamasWishlist, #RejectedDebateTopics); and the purple group covered topics in the Black Lives Matter community (#BlackLivesMatter, #Racism, #BLM). Each of these three clusters tended to have a small number of original content generators, with the bulk of the community amplifying the message.

For example, one account @TheFoundingSon sent more than 3,200 original tweets, averaging about 7 tweets per day. On the other hand, accounts like @AmelieBaldwin authored only 21 original tweets out of more than 9,000 sent.

The Mueller Indictment

The Mueller indictment specifically names two Twitter accounts: @TEN_GOP and @March_For_Trump. The NBC News dataset captured thousands of tweets from these two users.

This graph shows tweets from the accounts named in the Mueller indictment and the hashtags they used. You can see a small overlap in the hashtags used by the two accounts.

Conclusion

So, what can social media platforms and governments do to monitor and prevent future abuse?

First, it’s a matter of connections. In today’s hyper-connected world, it’s difficult to identify relationships in a dataset if you’re not using a technology purpose-built for storing connected data. It’s even more difficult if you’re not looking for connections in the first place.

Second, once you’re storing and looking for connections within your datasets, it’s essential to detect and understand the patterns of behavior reflected by those connections. In this case, a simple graph algorithm (PageRank) was able to illustrate that most of the Russian troll accounts behaved like single-minded bees with a focused job – and not like normal humans.

Using a connections-first approach to analyzing these sorts of datasets, both governments and social media platforms can more proactively detect and deter this sort of meddling behavior before it has a chance to derail democracy or poison civil conversation.

You don’t have to take our word for it:
Explore the tweets for yourself with a graph database of the Russian troll dataset available via the Neo4j Sandbox. Click below to get started.

Explore Russian Troll Tweets

The post The Story behind Russian Twitter Trolls: How They Got Away with Looking Human – and How to Catch Them in the Future appeared first on Neo4j Graph Database Platform.

↧

Achievement Unlocked: 1,000 Neo4j Certified Professionals

March 8, 2018, 2:44 am

≫ Next: Knoten auf der Achterbahn – Neo4j beim Javaland Brühl

≪ Previous: The Story behind Russian Twitter Trolls: How They Got Away with Looking Human – and How to Catch Them in the Future

In January 2016 we launched the first-ever Neo4j Certification for professionals so that Neo4j experts could prove their know-how, and this week saw the 1000th person pass the test.

Celebrate the 1000th Neo4j Certified Professional and learn how you can become Neo4j certified as well

The 1000th person to pass the test was Gábor Szárnyas, our Featured Community Member of 17th February 2018.

Congratulations Gábor and to everyone who’s passed the certification!

Certification Refresh

In the last two years we’ve added lots of new features to Neo4j – Bolt language drivers, Causal Clustering, native users and roles as well as other database security features – but the certification itself hasn’t been updated until now.

If you’ve already completed the certification you don’t need to do anything – your current certification remains valid. However, you may be interested in getting an official link to your existing certificate which you can learn about below.

Preparing for the Test

If you’re planning on taking the Neo4j certification soon, 2/3 of the test will cover long-lived topics such as:

Graph database basics
The specifics of the property graph model
Cypher syntax, including semantics for data import, creation and querying
Graph data modeling problems

The other 1/3 will cover new features introduced in the Neo4j 3.x series, but don’t worry if you haven’t had a chance to try them all out in detail yet. They’re all well documented, and you can use the following resources to prepare for that part of the test.

The exam still consists of 80 questions that must be answered within an hour, and you’ll need to pass with a score of 80% or higher to be certified.

There’s no charge for certification – it’s 100% free. You just need to head over to the Neo4j Certification page, fill in a few details, and then you’re ready to go.

An Official Link for Certificates

We’re frequently asked for a link to a certificate to send to potential employers or to put on a LinkedIn profile and we’re pleased to be able to offer that as well.

If you’ve already taken the certification, send an email to certification@neo4j.com and we’ll send you an email with your certificate.

If you haven’t already taken the certification, what are you waiting for?! Head over to the certification page and once you’ve answered all the questions and achieved a score of more than 80% you’ll receive an email with instructions explaining how to add the certificate to your LinkedIn profile.

You can see an example below of my certificate on LinkedIn.

Yay, I passed!

Now it’s your turn! Prepare using the resources on this page, and let us know when you’ve passed.

Hello @neo4j! I’ve just passed the #Neo4j Professional Certification! Get certified here: http://neo4j.com/graphacademy/neo4j-certification/

Tweet

If you have questions around the Neo4j Certification program or the exam, please send us an email to certification@neo4j.com.

Good luck!

What are you waiting for?

Get Neo4j Certified

The post Achievement Unlocked:<br /> 1,000 Neo4j Certified Professionals appeared first on Neo4j Graph Database Platform.

↧

Knoten auf der Achterbahn – Neo4j beim Javaland Brühl

March 8, 2018, 3:27 am

≫ Next: Integrate Neo4j with RDKit for the Google Summer of Code [2018]

≪ Previous: Achievement Unlocked: 1,000 Neo4j Certified Professionals

In diesem Jahr wird Neo4j in gleich 6 Vorträgen und Workshops von 7 Neo4j-Kennern beim JavaLand in Brühl vertreten sein. Es gibt also viel Gelegenheit, die Vielseitigkeit von Graphdatenbanken an praktischen Beispielen kennenzulernen und dem anwesenden Experten all Eure Fragen zu stellen.

Montag

In einem Workshop beim JavaLand4Kids am Montag wird Iryna Feuerstein mit einer Gruppe von Jugendlichen in die Welt der Graphen eintauchen und eine auf Spring Data Neo4j basierende Applikation entwickeln. Ideen für den Anwendungsfall dürfen die Teilnehmer selbst mitbringen.

Dienstag

Am Dienstagmorgen, 13. März, 08:30 in Quantum zeigen Dirk Mahler und Stephan Pirnbaum, wie man dem Monolithen in “gewachsenen” Software-Systemen Schritt für Schritt zu Leibe rücken kann. In “Wir schlachten einen Monolithen!” stellen sie ein Vorgehen zum iterativen und systematischen Herauslösen von Microservices vor. jQAssistant und Neo4j werden dabei benutzt, um bestehende Softwarestrukturen zu analysieren sowie neu entstehende Architekturkonzepte dauerhaft abzusichern.

Iryna Feuerstein ist eine Newcomerin beim JavaLand. Sie wird den Paragraphendschungel deutscher Gesetze in Häppchen aufteilen, in ein Graphmodell importieren und damit leichter zugänglich machen. In Form von verknüpften Entitäten verlieren Paragraphen ihren Schrecken. Ihr Vortrag “Zwischen den Zeilen lesen – Datenanalyse mit Graphen” ist am 13. März, 11:00 im Lecture Tent.

Am Dienstagabend wird Gerrit Meier ein BOF zu Neo4j & Graphdatenbanken organisieren, bei dem ihr all Eure Fragen loswerden könnt. Wir versuchen alle erwähnten Vortragenden bei Meet-the-Lib zu versammeln. Von Cypher und Spring Data Neo4j über Deployment und Clustering bis zu Modellierung, Graphalgorithmen, Datenanalyse mit Python und natürlich Visualisierung.

Mittwoch

Markus Harrer hat schon in mehreren Blogposts und Vorträgen dargestellt, wie er Neo4j, jQAssistant, Pandas und Jupyter Notebooks erfolgreich für die Analyse von Softwaresystemen benutzt. Bei seinem JavaLand-Vortrag “Mit Datenanalysen Probleme in der Entwicklung aufzeigen” (Mittwoch, 14. März, 10:00 in Rotunde) stellt er neben den Analysewerkzeugen auch die dahinterliegende Motivation und Methodik vor.

Johannes Unterstein spricht am Mittwoch, 14. März, 14:00 im Wintergarten über “Container: check! Aber wohin mit Big Data oder Fast Data?!”. Er zeigt wie Mesos’ Infrastruktur genutzt werden kann, um große Container-Setups verlässlich zu orchestrieren und zu betreiben. Eines seiner Beispiel nutzt Neo4j. Johannes hat das Neo4j-Universe Package entwickelt, das es erlaubt Neo4j Cluster auf Mesos zu betreiben.

Donnerstag

Am Donnerstag könnt Ihr bei Michael Hunger‘s Ganztags-Workshop (ab 9:00, Konfuzius) “Echtzeitempfehlungen selbstgemacht, einfach mit Neo4j” lernen, wie man mit einer Graphdatenbank und wenig Aufwand Empfehlungen für Produkte, Personen und Themen ermitteln kann. Dabei werden sowohl die Ähnlichkeit der Items als auch Nutzerverhalten und Geoinformationen für das Scoring genutzt.

Wir freuen uns auf Euch! Nutzt die Gelegenheit Euch mit uns zum Thema Graphdatenbanken auszutauschen. Falls ihr ein Gespräch im kleinen Kreis führen wollt, schickt einfach eine E-Mail an michael@neo4j.com.

The post Knoten auf der Achterbahn – Neo4j beim Javaland Brühl appeared first on Neo4j Graph Database Platform.

↧

Integrate Neo4j with RDKit for the Google Summer of Code [2018]

March 9, 2018, 2:46 am

≫ Next: Graph Gopher: The Neo4j Browser Built on Swift for Your iOS Device [Community Post]

≪ Previous: Knoten auf der Achterbahn – Neo4j beim Javaland Brühl

We’ll get straight to the point: We’re looking for a student willing to participate to Google Summer of Code to develop a integration of RDKit into Neo4j.

The project will be supported by

one of the largest chemistry companies of the world,
one of the main developers of RDKit and
Neo4j, Inc.

who will all be serving as mentors on the project.

If you’re interested please get in touch with Stefan Armbruster at stefan@neo4j.com.

RDKit & Neo4j

Academic and industrial research projects in areas such as medicinal chemistry and materials sciences accumulate – on a large scale – data of completely different natures, typically recipe, characterization and performance data. Both, the sheer amount of data and its inherent complexity make the researchers’ tasks to optimally join data in their projects difficult.

However, properly integrated data is key to success. Manual data processing and integration blocks a substantial amount of working time, is error prone, and is usually not sustainable.

The concept of knowledge graphs has proven to not only have high versatility and plasticity to comply with data management requirements mentioned above. A substantial number among the worldwide most booming enterprises are successful because they realized that there is a higher value in considering the relations between “things” rather than the “things” by themselves.

One of the strengths of a knowledge graph is to cope with arbitrary path length, a challenge frequently met in chemical research (e.g., process sequences). Moreover, the knowledge graph itself serves as an efficient communication vehicle helping to resolve and pinpoint complex data situations that are typical to chemical research and development.

Knowledge graphs are based on the fact that “things” are usually connected to each other by “relations.” The relation is expressed semantically, e.g., a process and a substance are connected by the verb “has product”: “process” – [“has product”] → “substance”. This represents an information triple. Upon combination with other triples a network is formed – the knowledge graph. Virtually anything can be mapped to such networks.

A prominent tool to instantiate knowledge graphs is Neo4j. Being developed for more than a decade, Neo4j has not only reached a mature status but also defined standards how to interact with graphs by means of our query language, Cypher. Neo4j is open source. A GPLv3 licensed community edition is available with only minor limitations to the commercial enterprise version.

For those parts of chemical research dealing with small organic molecules functionality such as (sub-)structure search is usually inevitable. Here, RDKit has proven to be a versatile and stable tool offering a vast variety of options. RDKit can already be used in conjunction with the relational database Postgres. Taking into account the value graph databases offer, a similar conjunction to Neo4j would be highly desirable.

Learn how to work with both Neo4j and RDKit for the Google Summer of Code in 2018.

The proposal is to marry RDKit with Neo4j to furnish chemical cartridge functionality to find entry points into the graph as well as to efficiently diminish paths by chemical know-how while traversing through the graph.

Interested? Let us know if you want to join this project by clicking below.

Apply Today

The post Integrate Neo4j with RDKit for the Google Summer of Code [2018] appeared first on Neo4j Graph Database Platform.

↧

Graph Gopher: The Neo4j Browser Built on Swift for Your iOS Device [Community Post]

March 13, 2018, 4:18 am

≫ Next: Now You Can Express Cypher Queries in Pure Python using Pypher [Community Post]

≪ Previous: Integrate Neo4j with RDKit for the Google Summer of Code [2018]

Learn about the Neo4j Browser for your iPhone or iPad built using the Swift programming language

Graph Gopher is a Neo4j browser for iPhone that was recently released to the App Store.

Graph Gopher lets you interact natively with your Neo4j graphs through easy browsing and quick entry for new nodes and relationships. It gives you a full Cypher client at your fingertips and fast editing of your existing data.

Start by browsing labelled nodes or relationships or their property keys

A custom Cypher client for Neo4j on iOS using Graph Gopher

A full Cypher client ready at your fingertips

Add a node or relationship to Neo4j on iOS using Graph Gopher

Quickly add new nodes and relationships

Edit nodes and relationships in Neo4j on iOS using Graph Gopher

Easily edit nodes and relationships

See what graph relationships share a common property key in Neo4j

See what relationships share a common property key, in this case createdDate

How Graph Gopher Got Started

Graph Gopher came out of a few questions I explored. First of all, I was exploring different ways to browse the graphs stored in my Neo4j graph database. The graph visualization of a Cypher query approach we know from the Neo4j web interface was an alternative, but I thought it required quite a bit of the user to start exploring it, and it was perhaps not as good a fit on a phone-sized device.

After spending a lot of time trying to adapt that, I found that the classic navigation interface was one I thought worked well for exploring the graph. To me, the navigation interface looks a lot like Gopher, the navigation paradigm we used to explore the internet before web browsers, and hence the name was born.

Building Graph Gopher in Swift

The second road to Graph Gopher was that Swift – a language used to write iOS apps – had become open source, and it was starting to be used to write server applications. While databases like MySQL and SQLite were available and used by many, Neo4j was absent.

I knew I could do something about that, and joined Cory Wiles’s Theo project in late 2016. After completing the Swift 3.0 transition together with him, I implemented Bolt support for 3.1 and 3.2.

For version 4.0, I improved the API, made it support Swift 4, and made it a lot easier to use. I used the development of Graph Gopher to validate the work done there, and Graph Gopher is a great demonstration of what you can do with Theo. Along the way, other developers started using the betas of Theo 4, giving me great feedback.

Faster than the Neo4j Browser and Available Wherever You Need It

An ambition for Graph Gopher was to be way faster to load and use than than loading up the web interface in a browser tab and and interacting with your Neo4j instance that way. In practice it has been no match: it is a very convenient tool. Even though I use a Mac all through my working day, I still access my Neo4j instances primarily through Graph Gopher.

The exception to this is when I write longer Cypher statements as part of my development work, but I have gotten good feedback on how to improve this. Look forward to updates here in the coming versions.

In practice, Graph Gopher makes it so that you always have your Neo4j instance available to you. It helps you add or edit nodes and relationships, prototype ideas and look up queries from your couch, coming out of the shower, on the train, or wherever you are. That is wonderfully powerful.

Another important feature is multi-device support. I use both an iPhone and an iPad, and I know people will use it on both work and private devices. Therefore it was important to me that session configuration was effortlessly transferred between devices, as well as favourite nodes. This has been implemented using iCloud, so that if you add a new instance configuration on one device, it will be available to all devices using the same iCloud account. Likewise, when you favourite.

Unique to mobile devices is connectivity, and a lot of work was done to help Graph Gopher keep a stable connection over flaky network connections. If the connection still drops, it will reconnect to allow you to continue working where you left off.

The Future of Graph Gopher

The road forward with Graph Gopher will be exciting. Now that it is out, I get contacted by people in situations I hadn’t imagined at all. Where people use it will be the primary driver of what features get added and how it will evolve. I would absolutely love to hear back from you how you use it, or how you would like to use it.

Want to learn more about graph databases and Neo4j?
Click below to register for our online training class, Introduction to Graph Databases and master the world of graph technology in no time.

Sign Me Up

The post Graph Gopher: The Neo4j Browser Built on Swift for Your iOS Device [Community Post] appeared first on Neo4j Graph Database Platform.

↧

Now You Can Express Cypher Queries in Pure Python using Pypher [Community Post]

March 15, 2018, 4:43 am

≫ Next: GDPR Compliance: Why Graph Technology Is the Fastest (and Most Future-Proof) Solution

≪ Previous: Graph Gopher: The Neo4j Browser Built on Swift for Your iOS Device [Community Post]

Learn more about the Pypher library that allows you to express Cypher queries in pure Python

Cypher is a pretty cool language. It allows you to easily manipulate and query your graph in a familiar – but at the same time – unique way. If you’re familiar with SQL, mixing in Cypher’s ASCII node and relationship characters becomes second nature, allowing you to be very productive early on.

A query language is the main interface for the data stored in a database. In most cases, that language is completely different than the programming language interacting with the actual database. This results in query building through either string concatenation or with a few well-structured query-builder objects (which themselves resolve to concatenated strings).

In my research, the majority of Python Neo4j packages either offered no query builder or a query builder that is a part of a project with a broader scope.

Being a person who dislikes writing queries by string contention, I figured that Neo4j should have a simple and lightweight query builder. That is how Pypher was born.

What Is Pypher?

Pypher is a suite of lightweight Python objects that allow the user to express Cypher queries in pure Python.

Its main goals are to cover all of the Cypher use-cases through an interface that isn’t too far from Cypher and to be easily expandable for future updates to the query language.

What Does Pypher Look Like?

from pypher import Pypher

p = Pypher()
p.Match.node('a').relationship('r').node('b').RETURN('a', 'b', 'r')

str(p) # MAtCH ('a')-['r']-('b') RETURN a, b, r

Pypher is set up to look and feel just like the Cypher that you’re familiar with. It has all of the keywords and functions that you need to create the Cypher queries that power your applications.

All of the examples found in this article can be run in an interactive Python Notebook located here.

Why Use Pypher?

No need for convoluted and messy string concatenation. Use the Pypher object to build out your Cypher queries without having to worry about missing or nesting quotes.
Easily create partial Cypher queries and apply them in various situations. These Partial objects can be combined, nested, extended and reused.
Automatic parameter binding. You do not have to worry about binding parameters as Pypher will take care of that for you. You can even manually control the bound parameter naming if you see fit.
Pypher makes your Cypher queries a tad bit safer by reducing the chances of Cypher injection (this is still quite possible with the usage of the Raw or FuncRaw objects, so be careful).

Why Not Use Pypher?

Strings are a Python primitive and could use a lot less memory in long-running processes. Not much, but it is a fair point.
Python objects are susceptible to manipulation outside of the current execution scope if you aren’t too careful with passing them around (if this is an issue with your Pypher, maybe you should re-evaluate your code structure).
You must learn both Cypher and Pypher and have an understanding of where they intersect and diverge. Luckily for you, Pypher’s interface is small and very easy to digest.

Pypher makes my Cypher code easier to wrangle and manage in the long run. It allows me to conditionally build queries and relieves the hassle of worrying about string concatenation or parameter passing.

If you’re using Cypher with Python, give Pypher a try. You’ll love it.

Examples

Let’s take a look at how Pypher works with some common Cypher queries.

Cypher:

MATCH (u:User)
RETURN u

Pypher:

from pypher import Pypher, __

p = Pypher()
p.MATCH.node('u', labels='User').RETURN.u

str(p) # MATCH (u:`User`) RETURN u

Cypher:

OPTIONAL MATCH (user:User)-[:FRIENDS_WITH]-(friend:User)
WHERE user.Id = 1234
RETURN user, count(friend) AS number_of_friends

Pypher:

p.OPTIONAL.MATCH.node('user', 'User').rel(labels='FRIENDS_WITH').node('friend', 'User')
# continue later
p.WHERE.user.__id__ == 1234
p.RETURN(__.user, __.count('friend').alias('number_of_friends'))

str(p) # OPTIONAL MATCH (user:`User`)-[FRIENDS_WITH]-(friend:`User`) 
WHERE user.`id` = $NEO_964c1_0 RETURN user, count($NEO_964c1_1) 
AS $NEO_964c1_2
print(dict(p.bound_params)) # {'NEO_964c1_0': 1234, 'NEO_964c1_1': 'friend',
'NEO_964c1_2': 'number_of_friends'}

Use this accompanying interactive Python Notebook to play around with Pypher and get comfortable with the syntax.

So How Does Pypher Work?

Pypher is a tiny Python object that manages a linked list with a fluent interface.

Each method, attribute call, comparison or assignment taken against the Pypher object adds a link to the linked list. Each link is a Pypher instance allowing for composition of very complex chains without having to worry about the plumbing and how to fit things together.

Certain objects will automatically bind the arguments passed in replacing them with either a randomly generated or user-defined variable. When the Pypher object is turned into a Cypher string by calling the __str__ method on it, the Pypher instance will build the final dictionary of bound_params (every nested instance will automatically share the same Params object with the main Pypher object).

Pypher also offers partials in the form of Partial objects. These objects are useful for creating complex, but reusable, chunks of Cypher. Check out the Case object for a cool example on how to build a Partial with a custom interface.

Things to Watch Out for

As you can see in the examples above, Pypher doesn’t map one-to-one with Cypher, and you must learn some special syntax in order to produce the desired Cypher query. Here is a short list of things to consider when writing Pypher:

Watch Out for Assignments

When doing assignment or comparison operations, you must use a new Pypher instance on the other side of the operation. Pypher works by building a simple linked list. Every operation taken against the Pypher instance will add more to the list and you do not want to add the list to itself.

Luckily this problem is pretty easy to rectify. When doing something that will break out of the fluent interface it is recommended that you use the Pypher factory instance __ or create a new Pypher instance yourself, or even import and use one of the many Pypher objects from the package.

p = Pypher()

p.MATCH.node('p', labels='Person')
p.SET(__.p.prop('name') == 'Mark)
p.RETURN.p

#or

p.mark.property('age') <= __.you.property('age')

If you are doing a function call followed by an assignment operator, you must get back to the Pypher instance using the single underscore member

p.property('age')._ += 44

Watch Out for Python Keywords

Python keywords that are either Pypher Statement or Func objects are in all caps. So when you need an AS in the resulting Cypher, you simply write it as all caps in Pypher.

p.RETURN.person.AS.p

Watch Out for Bound Parameters

If you do not manually bind params, Pypher will create the param name with a randomly generated string. This is good because it binds the parameters; however, it also doesn't allow the Cypher caching engine in the Neo4j server to property cache your query as a template.

The solution is to create an instance of the Param object with the name that you want to be used in the resulting Cypher query.

name = Param('my_param', 'Mark')

p.MATCH.node('n').WHERE(__.n.__name__ == name).RETURN.n

str(p) # MATCH (n) WHERE n.`name` = $my_param RETURN n
print(dict(p.bound_params)) # {'my_param': 'Mark'}

Watch Out for Property Access

When accessing node or relationship properties, you must either use the .property function or add a double underscore to the front and back of the property name node.__name__.

Documentation & How to Contribute

Pypher is a living project, and my goal is to keep it current with the evolution of the Cypher language. So if you come across any bugs or missing features or have suggestions for improvements, you can add a ticket to the GitHub repo.

If you need any help with how to set things up or advanced Pypher use cases, you can always jump into the Neo4j users Slack and ping me @emehrkay.

Have fun. Use Pypher to build some cool things and drop me a link when you do.

Take your Neo4j skills up a notch:
Take our online training class, Neo4j in Production, and learn how to scale the #1 graph platform to unprecedented levels.

Take the Class

The post Now You Can Express Cypher Queries in Pure Python using Pypher [Community Post] appeared first on Neo4j Graph Database Platform.

↧

GDPR Compliance: Why Graph Technology Is the Fastest (and Most Future-Proof) Solution

March 19, 2018, 1:35 am

≫ Next: Theo 4.0 Release: The Swift Driver for Neo4j

≪ Previous: Now You Can Express Cypher Queries in Pure Python using Pypher [Community Post]

According to Eurostat, 81% of Europeans feel they don’t wholly control their online data and 69% worry that firms might use their data for purposes other than those advertised.

The European Union’s General Data Protection Regulation (GDPR) states that individuals have the right to ensure their personal data is private and protected.

So why is everyone taking GDPR so seriously? Because penalties for GDPR violations are costly, amounting to the higher of twenty million euros – or four percent of worldwide sales – for each breach of the new regulations.

European regulators demonstrated their commitment to enforcing EU data regulations with the 2.4 million euro fine they levied against Google in June 2017 for anti-competitive search-engine practices.

Learn why graph technology is the overall best tool for building GDPR compliance solutions

In this series on GDPR compliance, we’ll break down how companies can best achieve compliance with the EU’s new privacy regulations using the power of graph database technology. Last time, we discussed the challenges and problems with personal data.

This week, we’ll take a closer look at why graph database technology is the best fit for overcoming the challenge of GDPR compliance.

Graph Databases Are the Right GDPR Foundation

Personal data seldom travels in a straight line and instead follows an unpredictable path through the enterprise.

That path is best visualized as a graph, so it’s not surprising that GDPR personal data problems are best addressed by a graph database. Graph technology is designed for connected-data applications like GDPR in which data relationships are as important as the data itself.

As the #1 platform for connected data, Neo4j includes powerful data visualization tools that enable you to model and track the movement of sensitive data through your systems. So you can provide easy, clear answers about personal data to:

Regulators who demand proof of GDPR compliance
GDPR-mandated Data Protection Officers and internal staff responsible for preserving data privacy across all your systems
Individual consumers who ask what you know about them and how you are using that data

Why Graph Technology Is Superior for GDPR

The complex data lineage problems posed by GDPR are impossible to solve with relational databases (RDBMS) and most NoSQL technologies. A modern graph database platform like Neo4j is a superior foundation for addressing the connected data requirements of GDPR compliance.

RDBMS Cannot Handle Connected Data

Relational database technologies are built for managing highly structured datasets that change infrequently and have minimal numbers of clear connections. To connect all your operational GDPR data, you need a colossal maze of JOIN tables and many thousands of lines of SQL code.

Those queries require several months to develop and are nearly impossible to debug and maintain as you add more systems and data relationships. Most importantly, queries of such complexity can take an eternity to execute and can easily hang your server.

Non-Native Graph Technologies Break Down

Some NoSQL and relational databases claim to have graph capabilities. In reality, they have cobbled a graph layer onto their non-graph storage models. These non-native approaches inevitably omit key system connections and break personal data lineage, making them easy targets for GDPR regulators.

Neo4j is a native graph database that stores and connects data as a graph — just as you visualize it on a whiteboard — making Neo4j the ideal technology for GDPR compliance.

A Picture Is Worth a Thousand Words: Proving GDPR Compliance

The ultimate test for any technology is its ability to satisfy regulators and consumers that your organization is GDPR-compliant.

Traditional approaches produce tabular results that are hard to follow. In contrast, Neo4j produces simple, easily understood pictures of how personal data flows through all your systems.

A Modern Graph Approach is Far Superior for GDPR

GDPR approaches: traditional data management vs Neo4j

Conclusion

The mandates of GDPR are looming on the horizon and your enterprise can’t afford to be caught violating these new privacy regulations. But, you also can’t afford to use the wrong technology to tackle this compliance challenge.

Even if you manage to remain compliant with a sub-optimal technology, choosing a wrong fit could cost you millions in implementation costs and months to years of development time – time you don’t have since GDPR goes into effect in May 2018.

Instead, the case is clear: In order to achieve and maintain GDPR compliance, you need a native graph technology platform that allows you to change your data model as compliance regulations and business requirements change (hint: they always do).

Next week, we’ll outline four steps to GDPR compliance using graph technology.

Catch up with the rest of the GDPR compliance blog series:

The Challenges and Problems with Personal Data

The post GDPR Compliance: Why Graph Technology Is the Fastest (and Most Future-Proof) Solution appeared first on Neo4j Graph Database Platform.

↧

Theo 4.0 Release: The Swift Driver for Neo4j

March 21, 2018, 12:00 am

≫ Next: GDPR Compliance: 4 Simple Steps to Building a GDPR Solution

≪ Previous: GDPR Compliance: Why Graph Technology Is the Fastest (and Most Future-Proof) Solution

Learn all about the 4.0 release of Theo – a Swift language driver for the Neo4j graph database

Last week, I wrote about Graph Gopher, the Neo4j client for iPhone. I mentioned that it was built alongside version 4.0 of Theo, the Swift language driver for Neo4j. Today, we’ll explore the Theo 4.0 update in more detail.

But before we dive into the Theo update, let’s have a look at what Theo looks like with a few common code examples:

Instantiating Theo

Swift programming language example for Neo4j

Creating a node and getting the newly created node back, complete with error handling

Looking up a node by ID, including error handling and handling if the node was not found

Performing a Cypher query and getting the results

Performing a Cypher query multiple times with different parameters as part of a transaction, and then rolling it back

As you can see, it is very much in line with how you would expect Swift code to read, and it integrates with Neo4j very much how you would expect a Neo4j integration to be. So no hard learning curves, meaning you can start being productive right away.

What’s New in Theo 4.0

Now for the update story:

Theo 4.0 had a few goals:

Make a results-oriented API
Support Swift 4.0
Remove REST support

Theo 3.1 was our first version to support Bolt, and while it has matured since then, it turned out to be very stable, memory-efficient and fast right out of the gate.

We learned from using Theo 3 that a completion-block-based API that could throw exceptions, while easy to reason about, could be rather verbose, especially for doing many tasks in a transaction. For version 4, we explored – and ultimately decided upon – a Result type-based API.

That means that a request would still include a completion block, but it would be called with a Result type that would contain either the values successfully queried for, or an error describing the failure.

Theo 3 having a throwing function with a regular completion block

Theo 4, same example, but now with a Result type in the completion block instead

This allowed us to add parsing that matched each query directly, and thus the code using the driver could delete the result parsing. For our example project, Theo-example, the result was a lot of less code. That means less code to debug and maintain.

Theo-example connection screen

Theo Swift driver for Neo4j graph database example

Theo-example main screen

Theo 3.2 added Swift 4 support, in addition to Swift 3. In Theo 4, the main purpose of this release – other than to incorporate the improvements done on the Bolt implementation – was that Theo 4 would remove the REST client that by 3.2 was marked as deprecated.

Having Theo 3.2 compatible with Swift 4 meant that projects using the REST client could use this as a target for a while going forward, giving them plenty of time to update. We committed to keeping this branch alive until Swift 5 arrived.

The main reason to remove the REST client was that the legacy Cypher HTTP endpoint it was using has been deprecated. This was the endpoint Theo 1 had been built around. Bolt is the preferred way for drivers, and hence it made little sense to adapt the REST client to the transactional Cypher HTTP endpoint that succeeds the legacy Cypher HTTP endpoint.

The result of these changes is an API that is really powerful, yet easy to use. The developer feedback we’ve gotten so far has been very positive. Theo 4 was in beta for a very long time and is now mature enough that we use it in our own products, such as Graph Gopher.

Going forward with Theo 4, the main plan is bugfixes, ensure support for new Neo4j versions, and minor improvements based on community input.

Looking Forward to Theo 5.0

The next exciting part will be Theo 5, which will start taking shape when Swift 5 is nearing ready.

The next major API change will be when Swift updates its concurrency model, so that the API will stay close to the recommended Swift style. Specifically, we are hoping that Swift 5 will bring an async-await style concurrency model that we would then adopt to Theo. But it may very well be that this will have to wait until later Swift versions.

Other Ways to Connect to Neo4j Using the Swift Programming Language

If you think Theo is holding your hands too much, you can use Bolt directly through the Bolt-Swift project. The API is fairly straightforward to use, and hey, if you need an example project you can always browse the Theo source code.

Another interesting project to come out of Theo and Bolt support is PackStream-Swift. PackStream is the format that Bolt uses to serialize objects, in a way similar to the more popular MessagePack protocol. So if you simply need a way to archive your data or communicate them across another protocol than Bolt, perhaps PackStream will fit your needs.

Give Us Your Feedback!

You can ask questions about Theo both on Stack Overflow (preferably) or in the #neo4j-swift channel in the neo4j-users Slack.

If you find issues, we’d love a pull request with a proposed solution. But even if you do not have a solution, please file an issue on the GitHub page.

We hope you enjoy using Theo 4.0!

The post Theo 4.0 Release: The Swift Driver for Neo4j appeared first on Neo4j Graph Database Platform.

↧

GDPR Compliance: 4 Simple Steps to Building a GDPR Solution

March 26, 2018, 12:00 am

≫ Next: This Week in Neo4j – Graph Visualization, GraphQL, Spatial, Scheduling, Python

≪ Previous: Theo 4.0 Release: The Swift Driver for Neo4j

According to PwC, 92 percent of multinational companies cite compliance with the looming General Data Protection Regulation (GDPR) data privacy regulations as a top data-protection priority.

More than three-quarters of those organizations have allocated over a million dollars for related compliance efforts, with nearly ten percent planning to spend more than ten million dollars each.

For the enterprises spending their dollars on graph technology, their investment will be worth every penny.

Discover 4 simple steps to building a GDPR compliance solution using Neo4j graph technology

In this series on GDPR compliance, we’ll break down how companies can best achieve compliance with the EU’s new privacy regulations using the power of graph database technology. In previous weeks, we discussed the challenges and problems with personal data and why graph technology is the fastest (and most future-proof) solution to GDPR compliance.

This week, we’re taking a deeper dive into the practical steps you can take to get started on your GDPR compliance solution.

4 Steps to GDPR Compliance

Follow these steps to build your organization’s GDPR solution using the Neo4j graph database as its foundation:

A GDPR compliance solution building plan

Step 1: Inventory Your Systems

Identify all enterprise systems that use or could potentially use GDPR-regulated personal data. Document where and how those systems store personal data.

For more information for identifying and mapping out master data, read this white paper:

Your Master Data Is a Graph: Are You Ready?

Step 2: Build Your Logical Data Model

Build a logical data model of personal data elements, and how and when they flow across your systems. Define system connections including metadata that describes and quantifies them.

Check out these resources for more information on data modeling:

Step 3: Develop and Test Your GDPR System

Using your logical data model, load your data into Neo4j. Then develop and test your solution by creating simple queries that address the personal data requirements of GDPR.

To learn more about harnessing the power of connected data – and drawing out connected insights from your existing RDBMS architecture, check out these two white papers:

Step 4: Visualize and Respond to GDPR Requests

Use Neo4j and third-party data visualization tools to display the flow of personal data across your systems. Answer questions quickly about how it is being used by your organization.

Review our listing of data visualization partners more information on graph visualization solutions for Neo4j.

Conclusion

While GDPR might be a complex regulation, your compliance solution doesn’t have to be. Following these simple steps to identify, model, build and visualize your customers’ personal data not only keeps you ahead of regulations as they evolve, but it gives you a connections-first perspective on your data that delivers value to your bottom line.

This concludes our series on GDPR compliance and using Neo4j graph technology to manage data privacy regulations.

Catch up with the rest of the GDPR and Neo4j blog series:

The Challenges and Problems with Personal Data

Why Graph Technology Is the Fastest (and Most Future-Proof) Solution to GDPR Compliance

The post GDPR Compliance: 4 Simple Steps to Building a GDPR Solution appeared first on Neo4j Graph Database Platform.

↧

This Week in Neo4j – Graph Visualization, GraphQL, Spatial, Scheduling, Python

March 31, 2018, 12:00 pm

≫ Next: Introducing the New Blazing-Fast Query Optimizer for Neo4j

≪ Previous: GDPR Compliance: 4 Simple Steps to Building a GDPR Solution

Welcome to this week in Neo4j where we round up what’s been happening in the world of graph databases in the last 7 days. As my colleague Mark Needham is on his well earned vacation, I’m filling in this week.

Next week we plan to do something different. Stay tuned!

Featured Community Member: Jeffrey Miller

Jeffrey A. Miller works as a Senior Consultant in Columbus, Ohio supporting clients in a wide variety of topics. Jeffrey has delivered presentations (slides) at regional technical conferences and user groups on topics including Neo4j graph technology, knowledge management, and humanitarian healthcare projects.

Jeffrey A. Miller – This Week’s Featured Community Member

Jeffrey published a really interesting Graph Gist on the Software Development Process Model. He was recently interviewed at the Cross Cutting Concerns Podcast on his work with Neo4j.

Jeffrey and his wife, Brandy, are aspiring adoptive parents and have written a fun children’s book called “Skeeters” with proceeds supporting adoption.

On behalf of the Neo4j community, thanks for all your work Jeffrey!

Interesting, Neo4j Related Projects

The infamous Max De Marzi demonstrates how to use Neo4j for a common meeting room scheduling task. Quite impressive Cypher queries in there.
Max also demos another new feature of Neo4j 3.4 – geo-spatial indexes. In his blog post, he describes how to use them to find the right type of food place for your tastes via the geolocation of the city that you’re both in.
There seems to be a lot of recent interest in Python front-ends for Neo4j, Timothée Mazzucotelli created NeoPy which is early alpha but contains some nice ideas
Zeqi Lin has a number of cool repositories of importing different types of data into Neo4j, e.g. Java classes, Git Commits or parts of Docx documents, and even SnowGraph a software data analytics platform built on Neo4j.
I think I came across this before, but the newrelic-neo4j is really a neat way of getting Neo4j metrics into NewRelic, thanks Ștefan-Gabriel Muscalu. While browsing his repositories I also came across this WikiData Neo4j Importer which I need to test out
This AutoComplete system uses Neo4j which stores terms, counts and other associated information. It returns top 10 suggestions for auto-complete and tracks usage patterns.
Sam answered a question on counting distinct paths on StackOverflow

Nigel is teasing us

A new version of py2neo is coming soon. Designed for Neo4j 3.x, this will remove the previously mandatory HTTP dependency and include a new set of command line tools and other goodies. Expect an alpha release within the next few days.

Graph Visualizations

I had some fun this week with 3d-force-graph and neo4j. It was really easy to combine the 3d graph visualization project based on three.js and available in 2D, 3D, for VR and as React Components with the Neo4j javascript driver. The graphs up to 5000 relationships load sub-second.

See the results of my experiments in my repository which also links to several live versions of different setups (thanks to rawgit)

My colleague Will got an access key to Graphistry and used this Jupyter Notebook to load the Russian Twitter trolls from Neo4j.

I also came across another Cytoscape plugin for Neo4j, which looks quite useful.

Zhihong SHEN created a Data Visualizer for larger Neo4j graphs using vis.js, you can see an online demo here

Desktop & GraphQL

This weeks update of Neo4j Desktop has seen the addition of the neo4j-graphql extension that our team has been working on for a while.

There will be more detail around it from Will next week but I wanted to share a sneak preview for all of you that want to have some fun with GraphQL & Neo4j over the weekend.

Next Week

What’s happening next two weeks in the world of graph databases?

Date	Title	Group	Speaker
April 3rd	Importer massivement dans une base graphe !	GraphDB Lyon	Gabriel Pillet
April 5th	GraphTour Afterglow: Lightning Talks	GraphDB Brussels	Tom Michiels, Dirk Vermeylen, Ignaz Wanders, Surya Gupta
April 9-10th	Training – Neo4j Masterclass – Amsterdam	GoDataDriven	Ron van Weverwijk
April 10th	Training – Atelier – Les basiques Neo4j – Paris	Paris	Benoit Simard
April 10th	Meetup – The Night Before the Graphs – Milan	Milan	Michele Launi, Matteo Cimini, Roberto Franchini, Omar Rampado, Alberto De Lazzari
April 11th	Conference – Neo4j GraphTour – Milan	Milan	several
April 12th	Training Data Modeling	Milan	Lorenzo Speranzoni, Fabio Lamanna
April 12th	Neo4j GraphTour USA #1	Arlington, VA	several
April 12th	Meetup: Paradise Papers	Munich	Stefan Armbruster
April 13th	Training Graph Data Modeling	Amsterdam	Kees Vegter
April 29th	Searching for Shady Patterns	PyData London	Adam Hill

Date

Title

Group

Speaker

April 3rd

Importer massivement dans une base graphe !

GraphDB Lyon

Gabriel Pillet

April 5th

GraphTour Afterglow: Lightning Talks

GraphDB Brussels

Tom Michiels, Dirk Vermeylen, Ignaz Wanders, Surya Gupta

April 9-10th

Training – Neo4j Masterclass – Amsterdam

GoDataDriven

Ron van Weverwijk

April 10th

Training – Atelier – Les basiques Neo4j – Paris

Paris

Benoit Simard

April 10th

Meetup – The Night Before the Graphs – Milan

Milan

Michele Launi, Matteo Cimini, Roberto Franchini, Omar Rampado, Alberto De Lazzari

April 11th

Conference – Neo4j GraphTour – Milan

Milan

several

April 12th

Training Data Modeling

Milan

Lorenzo Speranzoni, Fabio Lamanna

April 12th

Neo4j GraphTour USA #1

Arlington, VA

several

April 12th

Meetup: Paradise Papers

Munich

Stefan Armbruster

April 13th

Training Graph Data Modeling

Amsterdam

Kees Vegter

April 29th

Searching for Shady Patterns

PyData London

Adam Hill

Tweet of the Week

My favourite tweet this week was our own Easter Bunny

#HappyEaster #Neo4j Community

CREATE
(h:Head)<-[:EAR]-(h),
(h)-[:EAR]->(h),
(b:Body)-[:NECK]->(h),
(b)-[:WAG]->(t:Tail),
(fl:Leg)<-[:JNT]-(b)-[:JNT]->(fr:Leg),
(hl:Leg)<-[:JNT]-(b)-[:JNT]->(hr:Leg)
RETURN * pic.twitter.com/eItnZCzFBj
— Neo4j (@neo4j) March 30, 2018

Don’t forget to RT if you liked it too.

That’s all for this week. Have a great weekend! And Happy Easter or Passover, if you celebrate it.

Cheers, Michael

The post This Week in Neo4j – Graph Visualization, GraphQL, Spatial, Scheduling, Python appeared first on Neo4j Graph Database Platform.

↧

Server Types

Workloads

Neo4j Versions

Results

Future Work

How Neo4j Powers Revenue Management for Retailers

Case Study: Marriott International

Conclusion

Why Are We Doing This?

Recommended Reading

Getting Set Up

1. The Dataset

2. Understanding the Graph

3. Creating Constraints & Indexes

4. Importing the JSON Dataset

5. Let’s Query

Query 1. Find the number of visitors from each country and display them in the descending order of count.

Query 2. For a given document, find the number of visitors from each country.

Query 3. Find the number of occurrences for each type of viewership activity.

Query 4. Find the visitors for each document and display the top three in the descending order of number of visitors.

Query 5. For a given document, find recommendations of other documents like it.

Summary

What's Next?

References

How Neo4j Simplifies Network & IT Management for Retailers

Conclusion

What kind of work is your team focused on?

Are there commonalities across different types of networks?

How is network science evolving?

How can network science encourage more cross-domain collaboration?

Conclusion

Which domains are making the most advancements in network science?

What kind of tools do we need to develop?

Is your research group working in these emerging areas of network science?

What other research or work should we watch out for?

Is the study of structure the next big thing in network science?

Conclusion

Talk to us about how you use Neo4j at LendingClub.

What made you choose Neo4j?

What have been some of your most interesting or surprising results you’d seen while using Neo4j?

Anything else you’d like to add or say?

Personal Data Raises Difficult Questions

Tracking Personal Data Requires Deep Visibility

Conclusion

Reasoning Enriches Data with Knowledge

Efficient Reasoning for Graph Storage

Reasoning and Querying Neo4j with GraphScale

Resources

What’s New in the Latest APOC Release

Aggregation Functions

Indexing

Path Expander Sequences

Path Expansion Improvements

Path Functions

Text Functions

Data Integration

Collection Functions

Graph Refactoring

Other Additions

Feedback

The Backstory of the Russian Troll Network

How Did the Interference Work?

1.) The Typical American Citizen

2.) The Local Media Outlet

3.) The Local Political Party

The Amplification Network

The Mueller Indictment

Conclusion

Certification Refresh

Preparing for the Test

An Official Link for Certificates

Montag

Dienstag

Mittwoch

Donnerstag

RDKit & Neo4j

How Graph Gopher Got Started

Building Graph Gopher in Swift

Faster than the Neo4j Browser and Available Wherever You Need It

The Future of Graph Gopher