A couple weeks ago, one of the major storage vendors had two major problems to resolve after one of their arrays suffered a firmware bug-induced failure at one of their cloud (email) service provider customers. They had to:
- Help the customer get back to normal service levels after they had become unacceptable.
- Confront a public relations problem after it was exposed by a leading storage publisher.
Meanwhile, their service provider customer had four major problems to resolve:
- Get service levels back to acceptable levels.
- Communicate to their customers what the problem was and how it was being addressed.
- Re-engineer a solution to avoid the same happening again.
- Credit customers for not delivering against SLA terms.
A vendor employee tried to address their public relations problem this way in his blog:
"OK, I'll take the blame for this -- sort of. We pride ourselves in putting a lot of thought into our customer designs. I'd argue that we're really, really good at it as well.
But not everyone is 100% sure of how their application will grow over time -- unfortunately, we're not psychics. And, let's be honest, not everyone necessarily wants to pay for redundancy we like to put into our designs.
We don't always get to directly engage all the time, either -- with products such as the (blanked out), most of this stuff moves through the channel. Somebody calls up one of our partners, says that they want to buy one of our products, and one gets sold -- and a lot of product gets sold that way."
I understand the desire to explain how messes become messy, but I'm not sure why he felt the need to speculate that his company's business partners or that their customer's budget were key elements of the problem. That is tantamount to saying, "All of our (blanked out) customers could have the same thing happen to them too." Anybody who has ever been close to one of these melt-downs knows there are many variables involved - including vendors underbidding each other and shaving elements from their bid in order to win the business.
From a distance, it looks like the vendor's response to the customer was good, although there apparently were some issues with failure notification from the array when the event occurred. I wouldn't call these sorts of things "Perfect Storms", but there are unfortunate scenarios where multiple things go awry. All vendors have these sorts of bad days, which serve as painful learning experiences. Unfortunately for customers, it's one of the ways vendors improve their customer support processes.
The customer also wrote in his blog, explaining the situation to their customers:
"Our SAN vendor analyzed the system logs for the event and determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. Our vendor performed an emergency upgrade. The newer version of firmware includes a fix for the bug. We are taking additional corrective actions to make certain that there is enough spare capacity on the SAN. This will assure it performs without performance degradation in the event of a single hardware failure."
The reparation sounds reasonable, but it's not what I would call best of breed either. I'll explain why in the remainder of this post.
The old trusted dual controller just can't keep up
The explanation the service provider gave to their customers was only half correct. Yes, the failure in one controller was due to a firmware bug -and yes, all vendors find out about some of them at customer sites - but the inability of the surviving controller to handle the workload was another matter altogether.
The major defect of all dual controller designs for service provider applications is the uselessness of write cache when operating in degraded mode on a single controller.
When a dual controller array has a controller failure, all traffic is failed over to the surviving controller. However, this controller can't afford to place writes in cache because if this controller also fails any un-flushed writes in cache would be lost- making the recovery process all the more painful. As a result, the throughput of the controller degrades significantly because writes now take several orders of magnitude longer to process as each write must be completed at the physical disk level, instead of in fast cache memory. When you consider the sort of read/write ratios involved with an email application (heavy writes), it's not surprising to hear that it took 32 hours for the system to get caught up. I suspect that if the surviving controller had been able to use write cache, the customer might have experienced some amount of service level problems, but not nearly as bad as they suffered.
Write performance during array component failures is an important point that many customers give insufficient weight to when making their purchases. Public service providers certainly need to understand this. The exact same scenario - controller failure and subsequent drop in service levels - could certainly happen to a traditional data center customer, but the ramifications of this scenario are not as ugly as they are for a multi-tenant public service provider.
This case is a perfect example of how an older architecture is incapable of meeting the requirements of the new cloud service business model. If you are a cloud service provider reading this and wondering if you might have a similar exposure to a controller failure (including 3PAR customers with dual -controller arrays), my advice is to review what you have and start thinking about what you should expect if you have a controller failure and how you might want to deal with it on both a short-term and long-term basis. Best of breed cloud storage should not include dual controller arrays.
Their solution is to buy more and utilize it less
One of the identified corrective actions is having "enough spare capacity on the SAN", which in this case involves installing a second array. Without knowing the inside scoop, it looks like the idea is to split the workload across the two arrays so that if a controller failure occurs in either array, the performance drop won't be as noticeable. The array that doesn't suffer the failure will keep working as expected and the array that has the failure will only have half the load to deal with.
There are two primary problems with this "fix"
- Performance will still suffer on the array with the controller failure
- The I/O load will continue to increase over time
You are always going to have performance degradation of some sort when you can't use write caching, unless you are only reading data - which isn't the case here. It is flat out wrong to assume that a performance problem will not occur. Regardless, with the new two-array SAN, whichever system has the controller failure should be able to get caught up much faster than the 32 hours this customer had to wait. Of course, the customer's capacity and I/O load will almost certainly increase over time, and as that happens, the strategy of splitting the load between two arrays loses its effectiveness.
Along with adding the controllers, they are also certainly adding disk drives, and some notion of what "reasonable" utilization limits should be for them. The problem with limiting utilization as a best practice is that it puts the stamp of approval on inefficiency - not only for capacity utilization abut also for the power and cooling required to support all those underutilized drives. Most legacy arrays have built-in inefficiencies in the way data is laid out on disks, making it virtually impossible to achieve uniform utilization across all disk resources. The result is uneven consumption of disk capacity, as well as uneven I/O service levels among different disk groups, which is another variable in how much performance degrades following a controller failure in a dual controller array.
Finally, the customer now has two arrays to manage, including multipath connections, SAN zones, and all other aspects of the configuration, which all contribute down the road to change management complexities. The result is a net drag on administrator effort and an increased TCO.
How many do you need?
A true best of breed solution would address the root-cause deficiency in the array's design, without creating additional management and cost burdens to the customer. Obviously, more than two controllers are needed. But how many controllers does a cloud service provider need in an array? The answer is at least three. Why? Because when a single controller fails, there can still be two surviving controllers working together, mirroring their cache contents, and performing fast writes to cache memory. That said, controllers are usually packaged in pairs for redundancy purposes, which means that the most likely configurations will have four controllers.
If you compare a single quad controller array with two dual controller arrays there are some key advantages that immediately jump out:
- No or limited loss of performance after a controller failure
- All drives and cache can be used to service all workloads
- Managing a single array significantly reduces cost and complexity
A better recipe for maintaining performance levels
The next question is; "Is there a suitable quad controller array that the customer could have used instead of the two dual controller arrays they have?" Yes, 3PAR's F400 or T400 arrays are both quad controller arrays. The disk drives in these arrays can be either SATA or FC, or a mix of both types if the customer wanted to implement tiering. Product information of the F400 can be found here, and the T400 here.
However, simply putting four controllers in an array does not necessarily guarantee that they will be able to sustain write caching if one of them fails. The array must have the ability to remap and re-mirror the write cache contents of all four controllers to the surviving controllers following the loss of a controller. It's an interesting geometric sort of problem: There are four controllers, each with their own cache and cache that is mirrored from the other controllers in the array. All cache contents, including mirrors, need to be distributed evenly across all controllers to avoid congestion and load imbalances. All cache content, including mirrors needs to be accounted for within the array so that if a controller fails, the other controllers will be able to identify all the surviving original and mirrored copies of data. For cache data that has lost either a primary or mirrored copy, a second (new) copy needs to be made. Finally, the amount of data in cache may need to be re-leveled (decreased) to fit into the degraded cache capacity (3 controllers instead of 4).
The software for doing this in a 3PAR array is Persistent Cache. Product information on Persistent Cache is here (PDF)
I made a 9 minute last year video describing how Persistent Cache works. Here it is again. Thanks for watching.
Non-mirrored single storage processor write cache operations can continue when a peer storage processor is unavailable when the system is running a recent version of code.
In this case it wasn't, which was unfortunate.
Posted by: Storagezilla | May 05, 2010 at 06:51 AM
Hi Marc
In the spirit of fairness, there are many choices in the market for multi-controller storage arrays that exhibit the properties you describe.
And I know you know that EMC's V-Max does this well, among others.
Turns out V-Max is getting increasingly popular with service providers for that very reason -- as you might know!
Low cost-to-serve at scale, ability to "turn up" new features and capabilities non-disruptively, trusted brand -- it's a nice package for an SP.
I wouldn't entirely count out dual-controller designs just yet, though. Lots of SP use cases where this sort of approach makes all sorts of sense.
As always, the challenge is matching the right technology to the right use case (and right business model for SPs!), and not uniformly claiming that "A" is always better than "B".
BTW, congrats on 3PAR's results .. it's progress!
-- Chuck
Posted by: Chuck Hollis | May 05, 2010 at 09:03 AM
Chuck, Thanks for the Netapp-ish spam-ment. As for v-Max, you forgot to mention features such as "outdated disk management software" and "cache hogging, weak performing snapshots" and "expensive professional services".
Definitely disagree with you about SPs - at least cloud SPs. You yourself have expounded eloquently on how much different the new world of cloud is. When there is a paradigm shift - and the word applies here - old technology gets put out to pasture. And the people who choose to stay with the old technology risk getting bypassed by more agile competitors. Dual controller array designs are the modern equivalent of steamshovels. Great while they lasted.
Posted by: marc farley | May 05, 2010 at 09:52 AM
From what I have seen of the T800s, disk shelves are only connected to a single node pair, meaning each disk ia accessible by only 2 of the 4 controllers. Your disk shelves end up being evenly divided between your node pairs just like in your "2 controller split between 2 array" picture. So in essence, a 4 Node 3Par is a federation of two dual-controller systems in one rack, is it not?
Posted by: Richard Siemers | May 09, 2010 at 01:55 AM
Hi Richard,
The T800 has 8 controllers (or nodes), the T400 and F400 have 4 nodes. As you point out, disk magazines are connected to a specific pair of nodes but the write cache is distributed over all the nodes in the cluster. Also, wide striping in a 3PAR system typically uses all the drives connected to all nodes.
3PAR storage systems are designed as tightly coupled clusters and are much more than aggregations or federations of dual clustered arrays.
Posted by: marc farley | May 10, 2010 at 01:15 AM
I'm a little late to this party but Zilla, are you referring to Major Vendor's "WCA" (write cache availability) feature? If I understand correctly, this capability was designed precisely to address the design flaw that Marc discusses here. Dual-controller setups from other vendors that blindly transition to write-through during single-SP failure mode remain, of course, the general rule.
Posted by: Jeremy Barth | September 29, 2010 at 04:56 AM