
We’ve been building Cypher for Apache Spark for over a year now and have donated it to the openCypher project under an Apache 2.0 license, allowing for external contributors to join at this early juncture. Find the current language toolkit on GitHub here.
Making Cypher More Accessible to Data Scientists
Cypher for Apache Spark will allow big data analysts and data scientists to incorporate graph querying into their workflows, making it easier to leverage graph algorithms and dramatically broadening how they reveal data connections.
Up until now, the full power of graph pattern matching has been unavailable to data scientists using Spark or for data wrangling pipelines. Now, with Cypher for Apache Spark, data scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly.
As graph-powered applications and analytic projects gain success, big data teams are looking to connect more of their data and personnel into this work. This is happening at places like eBay for recommendations via conversational commerce, Telia for smart home, and Comcast for smart home content recommendations.
CAPS: A Closer Look
Follow the openCypher blog and read the latest post for the full technical details of Cypher on Apache Spark (CAPS).
Cypher for Apache Spark enables the execution of Cypher queries on property graphs stored in an Apache Spark cluster in the same way that SparkSQL allows for the querying of tabular data. The system provides both the ability to run Cypher queries as well as a more programmatic API for working with graphs inspired by the API of Apache Spark.
CAPS is the first implementation of Cypher with support for working with multiple named graphs and query composition. Cypher queries in CAPS can access multiple graphs, dynamically construct new graphs, and return such graphs as part of the query result.
Furthermore, both the tabular and graph results of a Cypher query may be passed on as input to a follow-up query. This enables complex data processing pipelines across multiple heterogeneous data sources to be constructed incrementally.
CAPS provides an extensible API for integrating additional data sources for loading and storing graphs. Initially, CAPS will support loading graphs from HDFS (CVS, Parquet), the file system, session local storage, and via the Bolt protocol (i.e., from Neo4j). In the future, we plan to integrate further technologies at both the data source level as well as the API level.
Cypher for Apache Spark also is the first open source implementation of Cypher in a distributed memory / big data environment outside of academia. Property graphs are represented as a set of scan tables that each correspond to all nodes with a certain label or all relationships with a certain type.
Conclusion
True to our open source roots, CAPS is the first release of a full Cypher system within the ecosystem of the openCypher project, making it available for re-use, modification and extension by the wider community. This is an early alpha release, and we will help further develop and refine CAPS until the first public release of 1.0 next year.
Until then, we look forward to your feedback and contributions. The data industry is recognizing the true power of graph technology, and we’re happy to be building the de facto graph query language alongside our amazing community.
New to the world of graph technology?
Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph database technology.
Get My Free Book
Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph database technology.
Get My Free Book
The post Cypher – the SQL for Graphs – Is Now Available for Apache Spark appeared first on Neo4j Graph Database.