InterviewSolution
| 1. |
What is Graph Analytics in Spark? |
|
Answer» Graphs are data structures COMPOSED of nodes, or vertices, which are arbitrary objects, and edges that define relationships between these nodes. Graph analytics is the process of analysing these relationships. An example graph might be your friend group. In the CONTEXT of graph analytics, each vertex or node would represent a person, and each edge would represent a relationship. Graphs are a natural way of describing relationships and many different problem sets, and Spark provides several ways of working in this analytics paradigm. Some business use cases could be detecting credit card fraud, MOTIF finding, DETERMINING the importance of papers in bibliographic networks (i.e., which papers are most referenced), and ranking web pages, as Google famously used the PageRank algorithm to do. Spark has long contained an RDD-based library for performing graph processing: GraphX. This provided a very low-level interface that was extremely powerful, but just like RDDs, wasn’t easy to use or optimize. GraphX remains a core part of Spark. Companies continue to build production applications on TOP of it, and it still sees some minor feature development. The GraphX API is well documented simply because it hasn’t changed much since its creation. However, some of the developers of Spark (including some of the original authors of GraphX) have recently created a next-generation graph analytics library on Spark: GraphFrames. GraphFrames extends GraphX to provide a DataFrame API and support for Spark’s different language bindings so that users of Python can take advantage of the scalability of the tool. |
|