Saturday, August 15, 2015

New directions for distributed systems research in cloud computing

This post is a continuation of my earlier post on "a distributed systems research agenda for cloud computing". Here are some directions I think are useful directions for distributed systems research in cloud computing.

Data-driven/data-aware algorithms

Please check the Facebook and Google software architecture diagrams in these two links: Facebook Software Stack, Google Software Stack. You will notice that the architecture is all about data: almost all components are about either data processing or data storage.

This trend may indicate that the distributed algorithms should need to adopt to the data it operates on to improve performance. So, we may see the adoption of machine-learning as input/feedback to the algorithms, and the algorithms becoming data-driven and data-aware. (For example, this could be a good way to attack the tail-latency problem discussed here.)

Similarly, driven by the demand from the large-scale cloud computing services, we may see power-management, energy-efficiency, electricity-cost-efficiency as requirements for distributed algorithms. Big players already partition data as hot, warm, cold, and employ tricks to reduce power. We may see algorithms becoming more aware of this.

Scalable coordination 

Again refer to the Facebook and Google service stacks linked above. Facebook stack does not have dedicated coordination services, only monitoring tools. (Of course the data stores employ Paxos to replicate the masters.) Google stack has some coordination and cluster management tools. These large scale systems already seem to embrace the principle of operating with as little coordination as possible.

Heeding the advice in the first challenge in my previous post, this may suggest that we should look into implicit/diffusing/asynchronous/eventual coordination, such as coordination by writing to datastores and other processes reading off of it. Pat Helland's article suggested entity and activities abstractions which can be useful primitives to get started on implicit/diffusing coordination.

Another way to scale coordination is to relax consistency. It is easy to scale consistency, it is easy to scale availability, but not both! Eventual-agreement/convergent-consistency provides a way out of this. There are already a lot of exciting work in this area, and this area will receive getting more attention. Brewer, in his "CAP 12 years later" article, has given nice clues to pursue these kind of systems. We may see systems that also associates uncertainty with the consistency of current state in order to facilitate conflict recovery and eventual consistency.

Extremely resilient systems

Cloud computing transformed the fault-tolerance landscape. Node failures are not a big deal, thanks to the abundance and replication in cloud systems, the nodes are replaceable. Now complex failure modes and distributed state corruption based failures became more critical problems. Improving the availability of these cloud systems are very important to the face of these unanticipated failure modes; it makes the news if a large cloud service is unavailable for several minutes.  In his advice for research on cloud computing, Matt Welsh mentions these two: 1) Building failure recovery mechanisms that are robust to massive correlated outages. 2) Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on."

Self-stabilization is a great approach to deal with unanticipated faults. I am guessing we will see a surge in research on self-stabilizing algorithms to achieve extreme resiliency to the face of unanticipated faults in cloud computing systems. Recovery oriented computing (ROC), resettable systems (crash-only software) is a special case of self-stabilization. And we may see extensions of that work for distributed systems. A critical question here will be "How can we make ROC compose nicely for distributed systems?"

To insure against correlated failures, we may see multiversion programming approaches to be revisited. This can also be helpful to avoid the spooky/self-organizing synchronization in cloud computing systems.

For scalable fault-tolerance, asynchronous algorithms like self-stabilizing Propogation of Information with Feedback (PIF) algorithms may be adopted in the cloud domain. Furthermore, pheromene/hormone based algorithms that run in the background in a slow mode can be made extremely scalable exploiting peer-to-peer random-gossip techniques.

New graph-based programming abstractions for the cloud

Good programming abstractions are like good tools, they can boost productivity by several folds. In way of analogy, in wireless sensor networks, several interesting programming abstractions were proposed including, treating neighborhood or area of nodes as the unit of programming instead of simple node, stream-based programming (map/join), excel-spreadsheet like high-level programming. These abstractions bring a different perspective which can be very helpful. There has been work on designing programming abstractions for cloud computing systems, especially for dealing with big data and big graphs. I hope we can see new useful abstractions emerge for programming large scale distributed cloud services. Since scalability is very important for cloud systems, we may see hierarchical abstractions, logical tree abstractions. We may also see abstractions that capture call graph of services or dataflow through services.

Auditability tools

With very large scale complex distributed systems, observability/auditability becomes very important. We recently presented our proposal on this topic in HotCloud'15. I hope to write a post about this work soon.

Abstract models to kickstart algorithms work

Finally, I hope we will see theoretical abstract modeling to simplify the cloud computing model (goals, challenges, environment) and kickstart more algorithms work on the area. As an analogy, Dijkstra's token ring formulation was really transformative, and started the self-stabilization field of distributed systems. A useful abstraction will hide irrelevant/accidental details and allow work to focus on the inherent most important parts of the problem, and allow other researchers to adopt the same terminology/model and start building on each others work and improve.

Thursday, August 13, 2015

How to go for 10X

I think the 10X term originated from this book. (Correct me if I am wrong. I didn't check this.)

It seems like Larry and Sergey are a fan of this concept (so should you!). Actually reading this January 2013 piece, you can sense that the Alphabet transition was in the works by then.

10X doesn't just mean go fast, get quick results, and get 10X more done in the same time. If you think about it, that is actually a pretty incremental mode of operation. And that is how you incur technical debt. That means it was just a matter of time for others to do the same thing, and probably much better and more complete. Trading off quality for time is often not a good deal (at least in the academic research domain).

10X means transformative rather than incremental improvement. Peter Thiel explains this well in his book Zero to One, 2014. The main theme in the book is: Don't do incremental business, invent a new transformational product/approach. Technology is 0-1, globalization is 1-n. Most people think the future of the world will be defined by globalization, but the book argues that technology matters more. The book says: Globalization (copying  and incrementalism as China has been doing) doesn't scale, it is unsustainable. Another way to put that argument is technology creates more value than globalization.

Below I propose some strategies for achieving 10X and also approach 10X from the perspective of how it applies for the academic research.

Aim big: Don't go for the incremental, pursue the transformative

10X is a mentality, frame of mind. The idea is if you go for a moonshot, and fail, you land among the stars. If you go for incremental improvements, you may be obselete by the time you get there because the world also moved on. Silicon Valley motto, "fail big, fail fast!" embodies this thinking.

For the research part, Dijkstra captured this thinking well in his advice to a promising researcher, who asked how to select a topic for research: "Do only what only you can do!" Anybody can pick the low hanging fruit.

Use the Pareto principle effectively and you are halfway there

The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes.

On a related point, if you have to eat two frogs, eat the big frog first. Have the courage to confront the big and ugly head-first. That's where the biggest results/benefits/outcomes will come. There is an entire book on eating the big frog. And this is how the term eating frog originated if you are curious.

From the academic research perspective, the lesson is: attack the inherent complexity of the problem, not the incidental complexities, which time and improvement in technology will take care of.

Adopt/Invent better tools 

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe." -- Abraham Lincoln

Here is a more modern perspective from an XKCD cartoon.

I mean, not just better but a transformative tools of course --remember the first point. Most often, you may need to invent that transformative tool yourself. When you are faced with an inherent worthy problem, don't just go for a point solution, generalize your solution, and ultimately make it in to a tool to benefit for the future. Generalizing and constructing the tool/system pays that technical debt and gets you to have truly 10X benefit. MapReduce as a tool comes to my mind as a good example for this. By generalizing on the kind of big data processing tasks/programs written at Google, Jeff Dean was able to see an underlying pattern, write a good tool to solve the problem once and for all.

Scientists spend decades to invent transformative tools (Hadron Collider, Hubble telescope) and then they get breakthrough results. As researchers in computer science, we should try to adopt/cultivate this mentality more.

Be agile and use rapid prototyping

Here is a brief informative video about the rapid prototyping idea. 

The point of prototypes is to fail fast, learn, and move on to the next attack. If you have a plan of attack for a worthy problem, sketch it, model it, pursue it to see if it holds water. As the first/easiest step, write down your idea to explain it.
"If you think without writing, you only think you're thinking." -- Leslie Lamport

"Writing is nature's way of letting you know how sloppy your thinking is." -- Guindon.

The next step for prototyping is mathematical modeling (or writing a specification-level program).

"Math is nature's way of letting you know how sloppy your writing is." -- Leslie Lamport

Collaborate

This advice is along the same vein as using better/transformative tools. Collaborate with the best minds on the topic that you can access to. Academics are pretty open to collaboration, especially when compared to the industry where there are many challenges to collaboration. If you have an interesting question, and if you demonstrate that you did your homework, you can recruit experts on the topic as collaborators.

Employ meta thinking

Focus on results, but also on processes. If you can't solve a hard problem, spin the problem attack that version instead.  Or, maybe go for proving an impossibility result.  Wouldn't that be transformative? (Here is a very recent example.)

Practice deliberately, and see what works

We tend to get more conservative as we get older. So deliberately practice being open-minded. Experiment! We are researchers after all. Collect good practices, and form useful habits. This is a process. Good luck.

Disclaimer: I don't claim to be a 10X engineer or researcher.

Tuesday, August 11, 2015

A distributed systems research agenda for cloud computing

Distributed systems (a.k.a. distributed algorithms) is an old field of almost 40 years old. It gave us impossibility proofs on the theory side, and also algorithms like Paxos, logical/vector clocks, 2/3-phase commit, leader election, dining philosophers, graph coloring, spanning tree construction which are adopted in practice widely. Cloud computing is a relatively new field in contrast. It provides new opportunities as well as new challenges for the distributed systems/algorithms area. Below I briefly discuss some of these opportunities and challenges.

Opportunities

Cloud computing provides abundance. Nodes are replaceable, even hot swappable. You can dedicate several nodes for running customized support services, such as monitoring, logging, storage/recovery service. These opportunities are likely to have impact on how fault-tolerance is considered in distributed systems/algorithms work.

Programmatic interfaces and service-oriented architecture are also hallmarks of cloud computing domain. Similarly, virtualization/containerization facilitates many things, including moving the computation to the data. However, it is unclear how these technologies can be employed to provide substantial benefits for distributed algorithms which operate at a more abstract plane.

Challenges

Coordination introduces synchronization, which introduces potential for cascading shutdowns and halts.  Especially, in a large-scale system of systems, any coordination introduced may lead to spooking/latent/self-organized synchronization. This point is explained nicely in "Towards A Cloud Computing Research Agenda". Thus the challenge is to avoid coordination as much as possible, and still build useful and "consistent" systems.

Another curse of the extreme scale in cloud computing is the large fan-in/fan-out of components. These challenges are explained nicely in the "Tail at Scale" paper. These can lead to broadcast storms and incast storms, and algorithms/systems should be designed to avoid these problems.

Finally, another major challenge is geographically distributed services. Due to speed of light communication over large distances are prone to latencies and especially pose consistency versus availability challenges due to the CAP theorem.

The current state of distributed systems research on cloud computing

This is not an exhaustive list. From what I can recollect now, here is a crude categorization of current research in distributed systems on the cloud computing domain.


What is next?

If only I knew... I can only speculate, I have some ideas, and that will have to wait for another post.

In the meanwhile, I can point to these two articles which talked about what would be good/worthy research areas in cloud computing.
How can academics do research on cloud computing?
Academic cloud: pitfalls, opportunities

UPDATE: Part 2 of this post (i.e., new directions) is available here.