Tuesday, December 20, 2011

Pregel: a system for large-scale graph processing


For large-scale graph processing one way to go is, of course, to use Hadoop and code the graph algorithm as a series of chained MapReduce invocations. MapReduce, however, is a functional language, so using MapReduce requires passing the entire state of the graph from one stage to the next, which is inefficient (as I alluded to at the end of this summary).

Google Pregel provides a simple straightforward solution to the large-scale graph processing problems. The Pregel solution is to use round(superstep)-based synchronized computation at the vertices supported with message-passing between the rounds. Pregel keeps vertices and edges on the machines that perform computation, and uses network transfers only for messages. This way Pregel avoids the communication overhead and programming complexity incurred by MapReduce chained iterations.

Model
In Pregel, in each iteration (superstep), a vertex can receive messages sent to it in the previous iteration, send messages to other vertices, modify its own state and its outgoing edges' states, and mutate the graph's topology. This synchronized superstep model is inspired by Valiant’s Bulk Synchronous Parallel model. More specifically, at a superstep S, each active vertex V (in parallel and in isolation) reads messages sent to itself in superstep S-1, send messages to other vertices that will be received at superstep S+1, and modify the state of V and its outgoing edges.

Messages are accumulated and sent in batch-mode along outgoing edges, but a message may be sent to any vertex whose identifier is known. For example, the graph could be a clique, with well-known vertex identifiers V1 through VN, in which case there may be no need to even keep explicit edges in the graph. (This way Pregel reduces to a distributed message-passing programming system with N nodes.) A vertex can inspect and modify the values of out-edges. Conflicts can arise due to concurrent vertex add/remove requests, and are relegated to be resolved by user-defined handlers. This introduces significant complexity and could become a source of programmer errors.

Implementation
It seems like default Pregel partitioning is not locality-preserving. This was surprising to me as this could cause excessive communication across nodes lead to inefficiency/waste. From the paper: "The default partitioning function is just hash(ID) mod N, where N is the number of partitions, but users can replace it. The assignment of vertices to worker machines is the main place where distribution is not transparent in Pregel. Some applications work well with the default assignment, but some benefit from defining custom assignment functions to better exploit locality inherent in the graph. For example, a typical heuristic employed for the Web graph is to colocate vertices representing pages of the same site."

The user program begins executing on a cluster of machines over the partitioned graph data. One of the machines acts as the master and coordinates worker activity. "The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. ... Each worker is also given the complete set of assignments for all workers [so that the worker knows which other worker to enqueue messages for its outgoing edges]." Fault-tolerance is achieved by checkpointing and replaying on machine failure. Note that if you write a self-stabilizing graph algorithm, then you can disable fault-tolerance and finish early.

Discussion
The key to the scalability of Pregel is batch messaging. The message passing model allows Pregel to amortize latency by delivering messages asynchronously in batches between supersteps. Pregel is said to scale to billions of vertices and edges, but I am not sure what this means.  For some graphs, I reckon superhubs would limit scalability significantly. It is not clear if Pregel has mechanisms/optimizations to handle superhubs in some graphs.

Another question that comes to my mind is that how much of the work that is currently done by Hadoop can be (should be) moved to Pregel. I guess for any job where the data can be easily/naturally modeled as a graph (pagerank, social graph analysis, network analysis), Pregel is applicable and may be preferable to Hadoop. Especially, the ability to modify vertices/edges on-the-fly makes Pregel very flexible to accommodate a rich class of applications.

A major downside for Pregel is that it offloads a lot of responsibility to the programmer. The programmer has to develop code for this decentralized vertex-mode with round-based messaging. This model leads to some race-conditions as discussed above and those conflicts are also left to the programmer to deal with.

I am working on a Maestro architecture that can alleviate these problems. (I plan to write about Maestro here soon.) Maestro accepts as input a centralized program and takes care of decentralization and synchronization/locking of shared variables in an efficient manner. Maestro also uses a master for coordinating workers (unsurprisingly). But the master has more responsibility in Maestro; it is involved in synchronizing access to shared variables. (Recall that there are no shared variables in Pregel, so master does not get involved in synchronizing and locking.) In return, Maestro relieves the programmer  from writing decentralized code and handlers for data race conditions among vertices.

Pregel already has an opensource cloud implementation (Golden Orb). My plan next is to modify Golden Orb to see whether we can quickly develop a cloud implementation for Maestro.

Monday, December 19, 2011

Reverse-scooping


"scooping (v): publish a news story before (a rival reporter, newspaper, or radio or television station)." Scooping is what happens when another team beats you to publishing results on a problem. Getting scooped is very frustrating and is dreaded by many PhD students. I heard stories about poor Math PhD students who worked on a problem for years only to discover that they got scooped by a couple months.

OK, then what is reverse-scooping? It is a term I coined last year. (Of course, the irony is that after a Google search I discovered that I was almost reverse-scooping someone else ;-). In reverse-scooping, you solve a problem and publish it first. Then several months later, another team (generally from a better-known university) solves the same problem and publish it at a more visible venue. They get all the credit, their work gets cited a lot, and it is as if your work doesn't exist! Congratulations, you got reverse-scooped.

I got reverse-scooped more than a couple of times (obviously I am not going to name publications here). There is no clear definition of same work or similar work, so there is no clear definition of how many times you got reverse-scooped. But it happens, and you know it when it happens to you. Reverse-scooping is often something that could have been easily avoided by doing a simple Google search for the keyterms (or even the title) of the paper. Could it be a simple omission that the offending authors failed to do a Google search and miss your work? That is hard to believe (but I am often generous to give the benefit of doubt). Sometimes reverse-scooping is more insidious: the offending paper cites your paper while seriously downplaying your results. The result is the same, your work will not get the credit and will not get cited further by papers citing this new paper.

Getting reverse-scooped is at least as frustrating as getting scooped. The first few times it happened to me I was very angry. But now I came to find fault with myself when I get reverse-scooped. I take getting reverse-scooped to mean that I should have polished and published that work at the best venue I possibly could. Maybe I should have publicized the idea better and elaborated on the idea further to make my contributions crystal-clear. Reverse-scooping is ugly, and I am not trying to rationalize it. But I find this new way of thinking to be more constructive than getting angry and complaining.

Friday, December 16, 2011

Scalable SPARQL Querying of Large RDF Graphs

Due to its excellent price/performance ratio, Hadoop has become the kitchen sink of big data management and analysis. Hadoop defaults are, of course, not suitable for every kind of data analysis application. For example using Hadoop on relational data processing incurs a lot of waste/inefficiencies. Similarly for graph data processing Hadoop is very inefficient. This paper by Abadi et. al. shows that Hadoop's efficiency on graph data (for semantic web subgraph matching application) can be improved 1000 times by fixing the following defaults in Hadoop.

1. Hadoop, by default, hash partitions data across nodes. For graph processing this is very inefficient. The paper advocates using a locality-preserving partitioning (METIS partitioner) that maps nearby vertices to the same worker as much as possible.

2. Hadoop, by default, replicates each data 3 times. "However, ... the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. It is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors." For this, the paper uses a custom triple replication module.

3. Hadoop, by default, uses HDFS and HBase for storing data. These are not optimal for web semantics graph data which is of RDF form (subject-predicate-object triplet). The paper uses RDF-Store for storing web semantics graph data.

Daniel's blog mentions that each fix contributes a 10 fold improvement in efficiency which yields a 1000 fold improvement in total. The experiments results are taken using the Lehigh University Benchmark (LUBM) for semantic web querying. For queries that take less than a second to compute on a single machine, the single-machine solution was faster than both the Hadoop-default and Hadoop-optimized. Of course for these fast queries a lookup to another worker requires network communication and incurs a relatively large overhead. Therefore, Hadoop-default is at a big disadvantage for fast queries. For slow queries (that take from 10 sec to 1000 sec on a single machine) there were still cases where Hadoop-optimized was 1000 times faster than hadoop-default.

It would have been nice if the paper included Hadoop-default pseudocode as well as Hadoop-optimized pseudocode. I want to see what (if anything) changed in the code. Here is another noteworthy implementation detail from the paper. The paper had to revert to some customizations in vertex partitioning. "To facilitate partitioning a RDF graph by vertex, we remove triples whose predicate is rdf:type (and other similar predicates with meaning "type"). These triples may generate undesirable connections, because if we included these "type" triples, every entity of the same type would be within two hops of each other in the RDF graph (connected to each other through the shared type object). These connections make the graph more complex and reduce the quality of graph partitioning significantly, since the more connected the graph is, the harder it is to partition it."

So, what are the fundamental contributions in this work? After all, it is now becoming a folk theorem that it is easy to make minor modifications/configurations to Hadoop to yield large performance improvements. Gun Sirer puts it nicely: "if you have a Hadoop job whose performance you're not happy with, and you're unable to speed it up by a factor of 10, there is something wrong with you." The first technique of locality preserving distribution of graph data over workers is a pretty obvious idea because it is the most sensible thing to do. The second technique of replicating border vertices is interesting and promising. However, this technique is inapplicable for graph applications that modify the graph data. The semantic web subgraph matching application did not modify the graph; it only read the graph. If instead of subgraph-matching, had we considered a graph-subcoloring application (or any such application that modified the graph), the replication would not be valid because it would be very hard to maintain consistency among the replicas of the boundary vertices.

For applications that modify the graph, even after fixing the inefficient defaults to sensible alternatives, there would still be inherent inefficiency/waste in Hadoop due to the functional nature of MapReduce programming.  For such graph-modifying applications, MapReduce is not a good fit as it neccessitates numerous iterations over the graph data and is wasteful. People don't care about this waste, because in batch execution mode this waste is not obvious/visible. Also, in return for this waste Hadoop enables hassle-free scale-out, which makes it acceptable. However, for real-time tight-synchronization-requiring applications this waste becomes clear by way of unacceptable latency and has to be dealt with. Obviously, there are other data processing tools for graphs, such as Google Pregel. The paper plans to compare with Pregel, and I also plan to write a summary of Pregel soon.