[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]
Why Are We Doing This?
Data analysis is the phenomenon of dissecting, structuring and understanding data. In a nutshell, we want to find meaning from our data. In this tutorial, we aim to
analyze a dataset from
issuu.com. The goal is to find answers to a variety of simple and complex questions.
There are a plethora of tools, techniques and methods available to pursue data analysis. We will use
Neo4j – a
graph database – to represent and visualize the data. It uses a query language called
Cypher which allows us to build queries and find all the answers that we seek. By the end of this tutorial, we will be able to import a JSON dataset to Neo4j and comfortably perform queries on our data.
Recommended Reading
Readers are encouraged to read and possess a basic understanding about Neo4j and graph databases prior to starting this tutorial.
I previously published a beginner’s guide to Neo4j –
Getting Started with Neo4j – which introduces the concepts of graph databases and elaborates on how to perform basic CRUD (Create, Read, Update, Delete) operations using Neo4j.
Getting Set Up
The first thing we need to do is to
download and
install Neo4j. We will use Neo4j Desktop which provides a
user-friendly UI to visualise the graph and run queries.
The next step is to acquire and understand the dataset!
1. The Dataset
The dataset that we will analyze comes from Issuu – an online repository for magazines, catalogs, newspapers and other publications. They published the
Issuu Research Dataset with a treasure of data about documents and visitors. The dataset is completely anonymised and provides an insight into the usage of the website.
The data is available in the JSON format. It can be downloaded and accessed from this GitHub repository. There are two flavors of this file:
- A small version –
issuu_sample.json
(This version of the dataset has 4
entries).
- A large version –
issuu_cw2.json
(This version of the dataset has 10,000
entries).
Both these datasets have been slightly modified for the purpose of this tutorial. To summarize the modification, all JSON entries are now stored in an array and referenced by a key called
items
.
The dataset is vast and the detailed specification is available
here. However, we are interested in the following attributes:
Attribute |
Purpose |
env_doc_id |
Uniquely identify each document |
visitor_uuid |
Uniquely identify each visitor |
visitor_country |
Two-letter code to identify visitor’s country |
event_type |
Type of action accomplished by visitor on the document |
2. Understanding the Graph
Now that we have selected our dataset and cherry-picked the necessary attributes, the next step is to formulate the data as a
graph. To create the graph in Neo4j, we need to identify the following elements:
- Nodes
- Relationships
- Properties
From our dataset, we can identify two nodes with the following properties:
Node |
Properties |
Document |
doc_uuid |
Visitor |
visitor_uuid , country |
Note: uuid
stands for “Universally Unique IDentifier.”
Tip: A node can also be thought of as a
class in object-oriented programming.
What about the relationships? We can create one relationship between the
document
and
visitor
nodes.
Relationship |
Properties |
Visitor viewed document |
type |
The relationship
viewed
is
generic in nature. The
type
property indicates the
specific type of event that was accomplished.
As an example, if we consider the visitor
Thomas and the document
A Diary of Jane, then the relationship can be illustrated as: Thomas
viewed A Diary of Jane. However, the
type of viewership could be any one of the following:
impression
click
read
download
share
pageread
pagereadtime
continuation_load
For the purpose of this exercise, we will use a single property on the relationship called as
type
. The relationship can now be illustrated as: visitor
Thomas viewed (
and specifically downloaded) document
A Diary of Jane.
3. Creating Constraints & Indexes
A
constraint is a mechanism to control and ensure
data integrity in Neo4j. Constraints can be created on either nodes or relationships. There are basically three types of constraints in Neo4j:
- Unique node property constraints – to ensure that the graph contains only a single node with a specific label and property value.
- Node property existence constraints – to ensure that a certain property of a specific label exists within all nodes in the graph.
- Relationship property existence constraints – to ensure that a certain property exists within all relationships of a specific structure.
- Node Keys – to ensure that, for a given label and set of properties, there exists all those properties for that label and that the combination of property values is unique. Essentially, a combination of existence and uniqueness.
We will create unique node property constraints for our graph as follows. On the
Document
label for the property
doc_uuid
:
CREATE CONSTRAINT ON (d:Document) ASSERT d.doc_uuid IS UNIQUE
On the
visitor
label for the property
visitor_uuid
:
CREATE CONSTRAINT ON (v:Visitor) ASSERT v.visitor_uuid IS UNIQUE
The main goal of this exercise is to query the graph to derive insights about the dataset. One way to improve retrieval efficiency is by using indexes. The idea behind an index here is the same as in relational and NoSQL databases.
In Neo4j, an index can be created on a single property of a label. These are known as single-property indexes. An index can also be created on multiple properties of a label, and these are known as composite indexes.
It is important to understand that, by creating a unique node property constraint on a property, Neo4j will also create a single-property index on that property. Hence, in our situation, the indexes will be created on the
doc_uuid
property for the
document
label and on the
visitor_uuid
property for the
visitor
label.
Now that we have created the constraints, we can view existing indexes in Neo4j using the query:
CALL db.indexes
It should return:
╒══════════════════════════════════╤══════╤══════════════════════╕
│description │state │type │
╞══════════════════════════════════╪══════╪══════════════════════╡
│INDEX ON :Document(doc_uuid) │online│node_unique_property│
├──────────────────────────────────┼──────┼──────────────────────┤
│INDEX ON :Visitor(visitor_uuid)│online│node_unique_property│
└──────────────────────────────────┴──────┴──────────────────────┘
Tip: More information about constraints and their effect on indexes is elaborated
in the Neo4j documentation on constraints.
4. Importing the JSON Dataset
Let’s now get our hands dirty!
Assuming that Neo4j is started (with an appropriate
Database Location
selected), we should first see an empty graph. This means that there are no nodes or relationships. Our goal is to populate the graph with the data from the JSON dataset by defining its skeleton (nodes and relationships).
To accomplish this, we will use the concept of
user-defined procedures in Neo4j. These procedures are specific functionalities which can be re-used to manipulate the graph. We will specifically use a procedure from the
APOC library which is a collection of 200+ commonly used procedures.
On Neo4j Desktop, APOC can be installed with the click of a single button.
To do this, open your project and click on the
Manage
button. Next, click on the
Plugins
tab. Under this tab, you will see the APOC heading accompanied with its version and a description. Click on the
Install and Restart
button. Once APOC is successfully installed, you should see a label with the message ✓
Installed
.
Here is a screenshot of a successful APOC installation:
Tip: More information about user-defined procedures and the APOC library (installation, usage, examples) is elaborated in this Neo4j blog article:
APOC: An Introduction to User-Defined Procedures and APOC.
We are interested in a specific procedure –
apoc.load.json
– which will allow us to load data from a JSON document. This procedure will return a singular map if the result is a JSON object or a stream of maps if the result is an array.
The following code snippet illustrates the use of the
apoc.load.json
procedure with our large dataset –
issuu_cw2.json
:
WITH "/path/to/issuu_cw2.json" AS url
CALL apoc.load.json(url)
YIELD value
Now,
value
contains our data which we need to utilize to create the nodes and relationships. However, we first need to parse this bulk JSON data.
To do this, we can use the
UNWIND
keyword in Neo4j. In a nutshell,
UNWIND
will expand a list into a sequence of rows. (More information about
UNWIND
is available in
the Neo4j documentation on Cypher clauses.) At this stage, we are required to possess an understanding of the structure of the JSON data. As mentioned in step 1, we can access the entries using the key
items
.
This is illustrated as:
UNWIND value.items AS item
At this stage, we have access to a single row via
item
. Now, we can use the API to access the values.
In Neo4j, the
WITH
clause allows using the output of a sub-query with following sub-query parts. (More information about
WITH
is available in
the Neo4j documentation on Cypher clauses).
From our data, we are concerned about documents from the Issuu “Reader” software. We uniquely identify these documents using the
env_doc_id
attribute. However, it is to be noted that not all documents have the
env_doc_id
attribute. Thus, we are required to explicitly select those entries which possess the attribute. The following code snippet illustrates this:
WITH item
WHERE NOT item.env_doc_id IS NULL
Notice how we access the value using
item.env_doc_id
. This style of retrieving the value makes working with JSON data on Neo4j a smooth experience.
Now that we have access to the values of each entry, it is time to create the nodes and relationships. This is accomplished using the
MERGE
keyword in Neo4j. It is
crucial to know the difference between
CREATE
and
MERGE
.
For example, if we execute the following statements sequentially (provided that the constraints were
not created):
CREATE (d:Document {doc_uuid:1}) RETURN (d)
CREATE (d:Document {doc_uuid:1}) RETURN (d)
This will result in two nodes being created. A way to control this is by using
MERGE
, which will create the node only if it does not exist. This can be illustrated as follows:
MERGE (d:Document {doc_uuid:1}) RETURN (d)
This same principle can be applied to relationships as well. The final Cypher query consisting of all the above components will look like this:
WITH "https://raw.githubusercontent.com/.../issuu_cw2.json" AS url
CALL apoc.load.json(url)
YIELD value
UNWIND value.items AS item
WITH item
WHERE NOT item.env_doc_id IS NULL
MERGE (document:Document
{doc_uuid:item.env_doc_id})
MERGE (visitor:Visitor
{visitor_uuid:item.visitor_uuid})
ON CREATE SET visitor.visitor_country
= item.visitor_country
MERGE (visitor)-[:VIEWED{type:item.event_type}]->(document)
If we run this Cypher query verbatim on Neo4j, the output should be (similar to):
Added 2293 labels, created 2293 nodes, set 5749 properties,
created 2170 relationships, statement executed in 15523 ms.
To check whether the graph was populated successfully, we can run the Cypher query…
MATCH (n) RETURN (n) LIMIT 200
…which will only display the top 200 results. The output can be visualized as follows:
5. Let’s Query
Finally, we can derive insights from our data.
We need to ask our graph questions. These questions need to be translated to Cypher queries which will return the appropriate results. Let’s start by answering some basic and advanced questions about the dataset.
Note: For the purposes of this tutorial, we will only display the top 10 results for queries with a large number of rows. This is achieved by using the
LIMIT 10
constraint.
Query 1. Find the number of visitors from each country and display them in the descending order of count.
MATCH (v:Visitor)
RETURN v.visitor_country AS Country, count(v) AS Count
ORDER BY count(v) DESC
LIMIT 10
Result:
╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│US │312 │
├────────┼─────┤
│BR │143 │
├────────┼─────┤
│MX │135 │
├────────┼─────┤
│PE │47 │
├────────┼─────┤
│CA │46 │
├────────┼─────┤
│ES │43 │
├────────┼─────┤
│GB │36 │
├────────┼─────┤
│AR │35 │
├────────┼─────┤
│FR │34 │
├────────┼─────┤
│CO │32 │
└────────┴─────┘
This query simply performs an internal
group by operation where visitor nodes are grouped based on the
visitor_country
property. The count is computed using the
count()
aggregate function. We sort the results in the descending order using the
ORDER BY <column> DESC
clause in Neo4j.
Query 2. For a given document, find the number of visitors from each country.
MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
WHERE d.doc_uuid='140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9'
RETURN v.visitor_country AS Country, count(v.visitor_country) AS Count
ORDER BY count(v.visitor_country) DESC
Result:
╒════════╤═════╕
│Country│Count│
╞════════╪═════╡
│GY │15 │
├────────┼─────┤
│CA │12 │
├────────┼─────┤
│US │11 │
├────────┼─────┤
│CW │1 │
├────────┼─────┤
│BB │1 │
└────────┴─────┘
This query is very similar to the one above. Here, we perform an internal
group by operation to group the visitor nodes based on the
visitor_country
property. However, this query differs from the previous one in the sense that we want to filter the counts for a particular document UUID.
In order to achieve this filtration, we need to utilise the relationship within the graph. Hence, we first
MATCH
, filter using the
WHERE
clause and then return the desired values.
Tip: The relationship given here…
MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
...can also be written as…
MATCH (v:Visitor)-[:VIEWED]->(d:Document)
Query 3. Find the number of occurrences for each type of viewership activity.
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN r.type AS Type, count(d.doc_uuid) AS Count
ORDER BY Count ASC
Result:
╒═════════════╤══════╕
│Type │Count│
╞═════════════╪══════╡
│click │1 │
├─────────────┼──────┤
│read │62 │
├─────────────┼──────┤
│pageread │369 │
├─────────────┼──────┤
│pagereadtime│779 │
├─────────────┼──────┤
│impression │959 │
└─────────────┴──────┘
This query also performs an internal
group by operation on the relationship property
type
. An interesting aspect of this query is the
ORDER BY Count ASC
. Previously, we followed the style of using
ORDER BY count(d.doc_uuid) ASC
. However, once we add a column name such as
Count
, we can use that in subsequent parts of the query.
Hence,
ORDER BY count(d.doc_uuid) ASC
can also be written as
ORDER BY Count ASC
.
Query 4. Find the visitors for each document and display the top three in the descending order of number of visitors.
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN DISTINCT d.doc_uuid AS DocUUID, collect(DISTINCT v.visitor_uuid) AS Visitors, count(DISTINCT v.visitor_uuid) AS Count
ORDER BY Count DESC
LIMIT 3
Result:
╒═════════════════════════════════╤═════════════════════════════════╤═════╕
│DocUUID │Visitors │Count│
╞═════════════════════════════════╪═════════════════════════════════╪═════╡
│140224101516-e5c074c3404177518│[4f4bd7a35b20bd1f, 78e1a8af51d│26 │
│bab9d7a65fb578e │44194, 4d49271019c7ed96, 6b2d3│ │
│ │cca6c1f8595, 19f5285fef7c1f00,│ │
│ │ 3819fc022d225057, f102d9d4fc4│ │
│ │bacdc, e5d957682bc8273b, abefb│ │
│ │3fe7784f8d3, 6170372b90397fb3,│ │
│ │ 797846998c5624ca, 43dd7a8b2fa│ │
│ │fe059, 3ec465aa8f36302b, d6b90│ │
│ │f07f29781e0, 7bd813cddec2f1b7,│ │
│ │ 3db0cb8f357dcc71, e1bfcb29e0f│ │
│ │3664a, 6d87bcdc5fa5865a, b0ba1│ │
│ │42cdbf01b11, 0930437b533a0031,│ │
│ │ e3392e4a18d3370e, ee14da6b126│ │
│ │3a51e, 502ddaaa898e57c4, 6fd04│ │
│ │0328d2ad46f, 23f8a503291a948d,│ │
│ │ 923f25aa749f67f6] │ │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228202800-6ef39a241f35301a9│[2f21ee71e0c6a2ce, 55ac6c3ce63│25 │
│a42cd0ed21e5fb0 │25228, e8fa4a9e63248deb, b2a24│ │
│ │f14bb5c9ea3, d2d6e7d1a25ee0b0,│ │
│ │ 6229cca3564cb1d1, 13ca53a93b1│ │
│ │594bf, 47d2608ec1f9127b, 4e2a7│ │
│ │5f30e6b4ce7, 43b59d36985d8223,│ │
│ │ 355361a351094143, 51fd872df55│ │
│ │686a5, 2f63e0cca690da91, febc7│ │
│ │86c33113a8e, 52873ed85700e41f,│ │
│ │ ca8079a4aaff28cb, 17db86d2605│ │
│ │43ddd, b3ded380cc8fdd24, b6169│ │
│ │f1bebbbe3ad, 458999cbf4307f34,│ │
│ │ 280bd96790ade2d4, 32563acf872│ │
│ │f5449, fabc9339a406616d, 36a12│ │
│ │501ee94d15c, 6d3b99b2041af286]│ │
├─────────────────────────────────┼─────────────────────────────────┼─────┤
│140228101942-d4c9bd33cc299cc53│[b1cdbeca3a556b72, a8cf3c4f144│24 │
│d584ca1a4bf15d9 │9cc5d, 1435542d699350d9, 06d46│ │
│ │5bfb51b0736, 2d41536695cc4814,│ │
│ │ a96854d21780c1f9, d1c98b02398│ │
│ │e9677, 78deb8ffdb03d406, 6c661│ │
│ │964d1d13c61, 0d47795fb1ddba9d,│ │
│ │ 667283570b5cedfe, 5b2baf03296│ │
│ │63564, 08c069dc405cad2e, 6823c│ │
│ │573efad29f6, 9b2cb60327cb7736,│ │
│ │ 0e8ddc2d2a60e14f, f5986e1cb02│ │
│ │378e4, fa3810e505f4f792, d5ed3│ │
│ │cfc4a454fe9, ba76461cdd66d337,│ │
│ │ ee42ba15ed8618eb, 688eb0dcd6a│ │
│ │d8c86, 67c698e88b4fbdcc, c97c3│ │
│ │83d774deae0] │ │
└─────────────────────────────────┴─────────────────────────────────┴─────┘
This query utilizes the
collect()
aggregate function which groups multiple records into a list. An important consideration made here is the use of the
DISTINCT
operator to ensure that duplicate values are omitted from the output. Finally, we display the top three using the
LIMIT 3
constraint.
Query 5. For a given document, find recommendations of other documents like it.
Example 1: document UUID =
130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document)
WHERE d1<>d
AND
d.doc_uuid='130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8'
RETURN d1 AS Recommendations, count(*) AS Views,
sum(
CASE r1.type
WHEN "impression" THEN 1
WHEN "pageread" THEN 1.5
WHEN "pagereadtime" THEN 1.5
WHEN "read" THEN 2
WHEN "click" THEN 0.5
ELSE 0
END
) as Score
ORDER BY Score DESC
Result:
╒═════════════════════════════════╤═════╤═════╕
│Recommendations │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130810070956-4f21f4│12 │16 │
│22b9c8a4ffd5f62fdadf1dbee8} │ │ │
└─────────────────────────────────┴─────┴─────┘
Example 2: document UUID =
120831070849-697c56ab376445eaadd13dbb8b6d34d0
MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document)
WHERE d1<>d
AND
d.doc_uuid='120831070849-697c56ab376445eaadd13dbb8b6d34d0'
RETURN d1 AS Recommendations, count(*) AS Views,
sum(
CASE r1.type
WHEN "impression" THEN 1
WHEN "pageread" THEN 1.5
WHEN "pagereadtime" THEN 1.5
WHEN "read" THEN 2
WHEN "click" THEN 0.5
ELSE 0
END
) as Score
ORDER BY Score DESC
Result:
╒═════════════════════════════════╤═════╤═════╕
│Recommendations │Views│Score│
╞═════════════════════════════════╪═════╪═════╡
│{doc_uuid: 130701025930-558b15│6 │6 │
│0c485fc8928ff65b88a6f4503d} │ │ │
├─────────────────────────────────┼─────┼─────┤
│{doc_uuid: 120507012613-7006da│6 │6 │
│2bc335425b93d347d2063dc373} │ │ │
└─────────────────────────────────┴─────┴─────┘
This query aims to find documents similar to a given document by assigning a score based on the type of viewership activity.
Activity |
Score |
impression |
1 |
pageread |
1.5 |
pagereadtime |
1.5 |
read |
2 |
click |
0.5 |
First, we perform a
MATCH
operation to capture the 1st degree and 2nd degree viewership of a visitor node along with the document nodes and relationships. We ensure that two documents nodes are not the same by using the
<>
operator and also specify the initial document UUID for which we would like to find related documents.
Next, we simply return the recommended document UUIDs, their overall viewership counts and their score. To calculate the score, we utilize the
CASE
expression which is then tallied using the
sum()
aggregate function.
The
CASE
expression has the following syntax:
CASE
WHEN THEN //if value matches, then return result
[WHEN ...] //repeat until all values are handled
[ELSE ] //else return a default result
END
Finally, we sort the results in the descending order of score using the
ORDER BY Score DESC
clause!
Summary
In this tutorial, we saw an example of performing data analysis using Neo4j. We examined the Issuu Research Dataset and elaborated on its structure, format and fields.
Next, we formulated the model/schema of our desired graph by choosing appropriate nodes, relationships and properties. Further, we created constraints and indexes in Neo4j to ensure uniqueness and improve the performance of querying.
After this, we discussed how to import the raw JSON dataset, parse it and populate our graph by following the previously determined schema. Lastly, we saw some sample Cypher queries which helped us derive insight from our vast dataset.
What's Next?
The possibilities are endless! If you enjoyed this tutorial, then you can try to derive data analysis insights using another dataset of your choice.
While we chose to construct a rather simple graph, you can make it much more complex and detailed. In addition to this, you can also explore and experiment with the various
APOC user defined procedures on the graph.
References
Want to learn more about what you can do with graph databases? Click below to get your free copy of the O’Reilly Graph Databases book and discover how to harness the power of connected data.
Download My Free Copy
The post Getting Started with Data Analysis using Neo4j [Community Post] appeared first on Neo4j Graph Database Platform.