The Windows Azure cloud platform is a solid, highly available and scalable environment, but like any system (on-premise, or in the cloud) there are risks which threaten the desired operation of your application.
With Windows Azure comes the opportunity to vastly lower your infrastructure costs, fluidly manage your system architecture and switch on and pay for extra capacity only when you need it. A successful implementation of your application on Windows Azure will present your developers with a scalable set of fault-tolerant globally distributed services that will enhance productivity, increase reliability and empower your business to react quicker and more cheaply to changing needs than ever before.
Over the past 18 months in my role at Microsoft, I’ve met countless customers who are considering, or in the process of, a move to Windows Azure, but in 99% of cases, most folks feel that they can just take their existing software and put it in the cloud: and, you can. In many cases, there’s nothing stopping you doing that. Unfortunately, though, there are a few myths I have to dispel for you:
- Myth #1: The cloud is invincible.
- Myth #2: I just write the code – it’s then fire, and forget.
- Myth #3: Any decent cloud platform worth its weight will deal with failure for me.
We’ll explore why these are myths later in the article, but for now, I’d like you to ask yourself why it is you are moving (or have moved) to Windows Azure: was it the cost savings? Many people start their move to Azure based on a conversation focused on cost. In my personal experience, eight in every ten customers who move to Windows Azure and make only the minimum required code changes to have it operate in Azure (where changes are needed) are missing a trick, and within six to twelve months, engage in additional development work to take advantage of other services on offer. I think of it as somewhat like buying an expensive sports car, and only ever using the first two gears and driving at 30 MPH. And who does that?
To sweeten the deal, I’m going to share my number one tip with you for getting the most out of your venture to the Windows Azure platform. In fact, it’s such a golden tip, that it doesn’t matter where you use it, it will yield value. Cloud or on-premise, Windows Azure or elsewhere. Embed it within your development practices and watch your team build the most reliable software you’ve ever seen.
Myth #1: The cloud is invincible.
Allow me to present the notion that the most robust systems are those that are hardened against risk, in much the same way sensitive equipment is often shielded from harmful interference. It may be that your application – under normal operating conditions – will never suffer from interference or data corruption. After all, you employ smart developers and pair program.
But what happens when a hard drive fails? Or a network call to a remote database fails?
We tend to think of these things as ‘exceptions’; most of the error conditions we will refer to in this article will surface as an exception in your code. But on Windows Azure, as in any other cloud platform, you have to keep in mind that the economies of scale come in to effect: your code and your database sit on top of a few drives across hundreds of thousands. Networks are (securely) shared with others, and it is impossible to guarantee that every electron that travels across the wire will arrive in sequence or even at all.
Myth #2: I just write the code. It’s fire and forget.
It helps to think of Windows Azure as a collection of discrete services (building blocks) that cooperate to provide, at the base level, high scale, high performance and highly-available geo-redundant network connectivity: virtual machines and persistent storage in various forms glued together with an extremely efficient, intelligent and almost invincible routing layer that abstracts developers away from the complexities of all of this, and exposes everything through a set of familiar, managed APIs.
As developers, when we target Windows Azure, we’re targeting a rich set of capabilities that were probably not part of your application’s original design specifications. Triple-redundant and geo-redundant storage, anyone? An elastic load-balancer? While it’s true the many of the more basic building blocks of Windows Azure (including storage and the load balancing, even the virtual machine capabilities) are available to you, often without any requirement to modify your application, it is useful to understand that there is a rich ecosystem of additional services which you can and should leverage not only to offer enhanced functionality within your application, but also to help the cloud platform keep your app running: to toughen it against risk.
Myth #3: Any decent cloud platform worth its weight will deal with failure for me.
It is incredibly rewarding, professionally and personally, to watch an application you designed support it’s users successfully and do it’s job right every day, for years. Let me just say that I think it can be even more rewarding (not just professionally or personally, but to your business stakeholders) to watch it do that no matter what is happening on the underlying platform. Imagine a platform on which your application can often mask many would-be catastrophic failures to an on-premise datacenter completely from your end-users, by intelligently deploying pre-programmed mitigations to forecast risks. Wouldn’t it be great if you had a guide to help you figure out what those are?
Azure mitigates many of the base-level risks for us (a disk failure, a virtual machine failure, switch hardware failure etc), essentially for free, just by adopting Windows Azure. But there are other things we need to think about, too. For example, many of our risks will be mitigated within the boundaries of a single datacenter. But what if it were to suffer a catastrophic failure? An Act of God? Or what if we simply wanted to bypass it because routing connectivity to it was slow?
These are things the cloud platform should not provide for us unless we tell it how. Each mitigation will have some kind of implication, whether it is financial or in terms of the functionality you are able to provide during the failure condition. Yet, many customers I have worked with have a somewhat romantic notion that “the cloud should just do it all for me!”
It’s all about risk.
In my experience in working with all of these customers, what these questions boil down to fundamentally is a conversation about risk. Specifically, understanding what those are, which are mitigated for you, and which you have to think about and deal with yourself. Once we think in these terms, that we as developers have to share this responsibility in order to cash-in on the ‘promise’ of an invincible cloud platform, that’s when the platform magic actually happens.
So, here’s the first golden tip for you:
Hardware fails. Networks fail. Memory fails. Your code needs to be hardened against as many of these risks as possible, and you need a mitigation strategy for all of them. Windows Azure will make it very easy for you to detect and prevent many of these risks (even the ‘catastrophic’ ones) from bringing your business to it’s knees.
So, design for failure.
Identifying the risks and understanding & categorising the effects
“Risk is the potential that a chosen action or activity (including the choice of inaction) will lead to a loss (an undesirable outcome). The notion implies that a choice having an influence on the outcome exists (or existed). Potential losses themselves may also be called risks”.1
This definition hints at the necessity to both understand that there is the potential for risk in any situation and that the outcome of any given situation may be influenced (otherwise, it is a certainty) in some way so as to be able to lessen or prevent the effect from being noticeable In this section, we will identify what the risks are and, what the effect of each risk manifesting itself is.
Integral to your deployment on Windows Azure should be an understanding of:
- What the risks are;
- What steps can be taken to mitigate the effect of the risk surfacing;
- What category the effect falls within.
For example, when you buy a car, you know that there is a risk that it might get damaged, either by you (racing around again!), or by another road user. Assuming you’re a law abiding citizen, you’ll buy insurance to mitigate against the risk of damage to your car, or somebody else’s. But within your policy document will be a list of expectations around what happens when your car is damaged: you’ll be told how long your car will be unavailable, whether you’ll have the use of a rental car, etc.
It is the same for deployments on Windows Azure, except this time we’re not talking about the effects to your car, rather the effect to your business caused by risk actually becoming a reality (or, ‘surfacing’).
I’ve often found that the effects of the risk (the effect the risk has on your app once it has manifested) can generally be categorised according to the following scale (in order of descending severity):
- Catastrophic: there is nothing that can be done to mitigate the effect to normal operation;
- Fault: with careful planning and development work, a suitable mitigation can be automatically implemented to prevent the effect from surfacing;
- Avoidable: the effect can be avoided with a trivial amount of effort.
In this discussion, I’m assuming that the primary risk we’re attempting to mitigate is downtime caused by loss of connectivity to the data centre. In my example deployment, we’re talking about a simple web application with two web roles, two worker roles and a dependency on a database on SQL Azure. If we dig further, our full risk register may look similar to the following:
|Instance taken offline for patching/maintenance, where only one instance of that role is deployed||Your app goes offline.||Catastrophic|
|Instance taken offline for patching/maintenance, where two or more instances of the role are deployed||Potential for increased load on remaining instances; but otherwise no disruption to service.||Avoidable|
|Instance (in a multi-instance deployment) goes offline due to failure of the instance itself||As above.||Avoidable|
|Connectivity failure to a dependant resource in the data centre||The resource is unavailable for the duration of the disruption to connectivity.||Fault|
|Failure of the dependent resource||Potential of data loss. The resource is unavailable until it is recovered either manually or automatically.||Fault|
|Total loss of inbound and/or outbound connectivity to the data centre||Your app goes offline.||Catastrophic|
|Catastrophic loss of the data centre||Your app goes offline.||Catastrophic|
Only once both your technical team and your business leaders are aware of the risks, their manifested effect and what can technically be achieved to mitigate them, can a discussion about the extent to which you wish to implement these measures take place. Try and avoid the tendency of shooting for 100% availability across 100% of your dependant resources and remember that often, different parts of an app can tolerate different failures differently! Understand that risks also have a field of impact, too. For example, a catastrophic data centre failure would affect the whole of your app, whereas the failure of a database would impact only those sections which require connectivity to it.
Crucial to this discussion is having an open and honest discussion with the business, and with your customers, about what level of risk is acceptable to them. This will determine how much effort goes into your risk avoidance strategy. You need to understand what level of risk is acceptable.
On Windows Azure, one significant advantage is that the cost of maintaining a highly available, highly scalable solution that is both maintained and secure is generally orders of magnitude cheaper than the equivalent private, on-premise set up. The last thing you’d want to do is erode that saving by planning and deploying avoidance techniques that are completely over the top: so be reasonable with your understanding of acceptable risk.
This exercise may seem academic and fairly obvious but it is often overlooked for that reason. Without it, though, it is difficult to fully appreciate what steps are necessary, and to inform your UX designers properly about the types of scenarios that could naturally occur that you may well need to surface in your app to let your users know.
We’ve covered risk, now let’s turn our attention to what we need to do should the worst happen: a risk has manifested and the effect has begun.
It’s a common misconception that disaster recovery and risk mitigation are the same things.
‘Disaster recovery’ refers to the things you do (either automatically or as part of a manual activity) that restore you to your normal scenario; for example, something exceptional occurred and you have suffered a catastrophic event and need to get back to ‘business as usual’ as fast as possible, while minimising loss. Risk mitigation, on the other hand, is about the things you can do before a condition occurs that triggers your failure scenario.
So that you can do this effectively, you need to first understand what risk has surfaced, what your recovery options are for that particular risk, and therefore what your recovery strategy and objectives actually are.
Let’s put this into context:
Your app went offline due to a failure of a database connection. The effect was that users of your app could no longer publish new content. There are potentially two recovery options available to you here: you could either write new content to a separate store temporarily and automatically update the failed database when it becomes available, or your other recovery option is to simply wait until the failed database is available again. Your strategy for recovery from this particular risk is therefore directly dependent on what your business expects you to be able to achieve in this scenario.
Putting it all together…
I’ve introduced the notion that risks are no less likely to occur on the Windows Azure platform than on-premise, and we know that Azure is capable of recovering from most of these risks without any input from you. What we’re trying to look at here is what steps you can take as developers to stop any non-catastrophic effects from impacting your app, causing a ‘failure scenario’. If you embrace the concept of expecting failure, it becomes quite easy to see what you must do in order to maintain normal operation during a failure situation. In general, remember you can:
- Use alternative persistent storage should a database become unavailable, and re-synch when available;
- Continue retrying a failed connection until it succeeds or ‘defer failure’ until after a certain number of retries;
When designing for high availability, it is a good idea to keep these questions in mind:
- Prevention: what can you do to stop the risks you’ve anticipated from occurring?
- Detection: how will you detect that your app is no longer in it’s ‘normal state’?
- Recovery: what can your app do to either temporarily mask the failure condition and maintain the appearance that everything is OK, or what steps must take place to get things back to normal operation?
Do not rely on the availability guarantee: it isn’t enough (a 100% up-time guarantee wouldn’t be, either) and remember, availability is only one part of the equation. If we go back to the car insurance metaphor, you don’t just buy car insurance to mitigate against the risk of injury or damage to yourself or to others: you also drive safely and obey traffic rules. So it’s actually more about adopting a philosophy and taking a series of actions that is important.
In summary, Windows Azure is and will remain a highly available, stable and reliable cloud platform and it will continue to be enhanced and improved over time. As developers though, we have to appreciate that failures of course can, and do, occur. Every object is subject to entropy, and hard disks, network cables and switches are no exception. Understanding that there are parts of the availability equation that you can – and should – take responsibility for is essential to a healthy cloud deployment and arguably, even if your app is deployed on-premise, you might want to consider adopting ‘cloud risk principles’, too!
My point ultimately is that risk isn’t a problem: not knowing what they are, what the cloud platform is responsible for mitigating, and how you can efficiently deploy platform services to assist you is.
Microsoft’s Premier Support for Developers team is able to provide your developers with specific, technical and process guidance to help you mitigate risks to your business as you move to Windows Azure and short cut your time-to-market.