A couple weeks ago, one of the major storage vendors had two major problems to resolve after one of their arrays suffered a firmware bug-induced failure at one of their cloud (email) service provider customers. They had to:
- Help the customer get back to normal service levels after they had become unacceptable.
- Confront a public relations problem after it was exposed by a leading storage publisher.
Meanwhile, their service provider customer had four major problems to resolve:
- Get service levels back to acceptable levels.
- Communicate to their customers what the problem was and how it was being addressed.
- Re-engineer a solution to avoid the same happening again.
- Credit customers for not delivering against SLA terms.
A vendor employee tried to address their public relations problem this way in his blog:
"OK, I'll take the blame for this -- sort of. We pride ourselves in putting a lot of thought into our customer designs. I'd argue that we're really, really good at it as well.
But not everyone is 100% sure of how their application will grow over time -- unfortunately, we're not psychics. And, let's be honest, not everyone necessarily wants to pay for redundancy we like to put into our designs.
We don't always get to directly engage all the time, either -- with products such as the (blanked out), most of this stuff moves through the channel. Somebody calls up one of our partners, says that they want to buy one of our products, and one gets sold -- and a lot of product gets sold that way."
I understand the desire to explain how messes become messy, but I'm not sure why he felt the need to speculate that his company's business partners or that their customer's budget were key elements of the problem. That is tantamount to saying, "All of our (blanked out) customers could have the same thing happen to them too." Anybody who has ever been close to one of these melt-downs knows there are many variables involved - including vendors underbidding each other and shaving elements from their bid in order to win the business.
From a distance, it looks like the vendor's response to the customer was good, although there apparently were some issues with failure notification from the array when the event occurred. I wouldn't call these sorts of things "Perfect Storms", but there are unfortunate scenarios where multiple things go awry. All vendors have these sorts of bad days, which serve as painful learning experiences. Unfortunately for customers, it's one of the ways vendors improve their customer support processes.
The customer also wrote in his blog, explaining the situation to their customers:
"Our SAN vendor analyzed the system logs for the event and determined that the service processor failure occurred due to a unique bug in the specific version of firmware on the system. Our vendor performed an emergency upgrade. The newer version of firmware includes a fix for the bug. We are taking additional corrective actions to make certain that there is enough spare capacity on the SAN. This will assure it performs without performance degradation in the event of a single hardware failure."
The reparation sounds reasonable, but it's not what I would call best of breed either. I'll explain why in the remainder of this post.
The old trusted dual controller just can't keep up
The explanation the service provider gave to their customers was only half correct. Yes, the failure in one controller was due to a firmware bug -and yes, all vendors find out about some of them at customer sites - but the inability of the surviving controller to handle the workload was another matter altogether.
The major defect of all dual controller designs for service provider applications is the uselessness of write cache when operating in degraded mode on a single controller.
When a dual controller array has a controller failure, all traffic is failed over to the surviving controller. However, this controller can't afford to place writes in cache because if this controller also fails any un-flushed writes in cache would be lost- making the recovery process all the more painful. As a result, the throughput of the controller degrades significantly because writes now take several orders of magnitude longer to process as each write must be completed at the physical disk level, instead of in fast cache memory. When you consider the sort of read/write ratios involved with an email application (heavy writes), it's not surprising to hear that it took 32 hours for the system to get caught up. I suspect that if the surviving controller had been able to use write cache, the customer might have experienced some amount of service level problems, but not nearly as bad as they suffered.
Write performance during array component failures is an important point that many customers give insufficient weight to when making their purchases. Public service providers certainly need to understand this. The exact same scenario - controller failure and subsequent drop in service levels - could certainly happen to a traditional data center customer, but the ramifications of this scenario are not as ugly as they are for a multi-tenant public service provider.
This case is a perfect example of how an older architecture is incapable of meeting the requirements of the new cloud service business model. If you are a cloud service provider reading this and wondering if you might have a similar exposure to a controller failure (including 3PAR customers with dual -controller arrays), my advice is to review what you have and start thinking about what you should expect if you have a controller failure and how you might want to deal with it on both a short-term and long-term basis. Best of breed cloud storage should not include dual controller arrays.
One of the identified corrective actions is having "enough spare capacity on the SAN", which in this case involves installing a second array. Without knowing the inside scoop, it looks like the idea is to split the workload across the two arrays so that if a controller failure occurs in either array, the performance drop won't be as noticeable. The array that doesn't suffer the failure will keep working as expected and the array that has the failure will only have half the load to deal with.
There are two primary problems with this "fix"
- Performance will still suffer on the array with the controller failure
- The I/O load will continue to increase over time
You are always going to have performance degradation of some sort when you can't use write caching, unless you are only reading data - which isn't the case here. It is flat out wrong to assume that a performance problem will not occur. Regardless, with the new two-array SAN, whichever system has the controller failure should be able to get caught up much faster than the 32 hours this customer had to wait. Of course, the customer's capacity and I/O load will almost certainly increase over time, and as that happens, the strategy of splitting the load between two arrays loses its effectiveness.
Along with adding the controllers, they are also certainly adding disk drives, and some notion of what "reasonable" utilization limits should be for them. The problem with limiting utilization as a best practice is that it puts the stamp of approval on inefficiency - not only for capacity utilization abut also for the power and cooling required to support all those underutilized drives. Most legacy arrays have built-in inefficiencies in the way data is laid out on disks, making it virtually impossible to achieve uniform utilization across all disk resources. The result is uneven consumption of disk capacity, as well as uneven I/O service levels among different disk groups, which is another variable in how much performance degrades following a controller failure in a dual controller array.
Finally, the customer now has two arrays to manage, including multipath connections, SAN zones, and all other aspects of the configuration, which all contribute down the road to change management complexities. The result is a net drag on administrator effort and an increased TCO.
How many do you need?
A true best of breed solution would address the root-cause deficiency in the array's design, without creating additional management and cost burdens to the customer. Obviously, more than two controllers are needed. But how many controllers does a cloud service provider need in an array? The answer is at least three. Why? Because when a single controller fails, there can still be two surviving controllers working together, mirroring their cache contents, and performing fast writes to cache memory. That said, controllers are usually packaged in pairs for redundancy purposes, which means that the most likely configurations will have four controllers.
If you compare a single quad controller array with two dual controller arrays there are some key advantages that immediately jump out:
- No or limited loss of performance after a controller failure
- All drives and cache can be used to service all workloads
- Managing a single array significantly reduces cost and complexity
A better recipe for maintaining performance levels
The next question is; "Is there a suitable quad controller array that the customer could have used instead of the two dual controller arrays they have?" Yes, 3PAR's F400 or T400 arrays are both quad controller arrays. The disk drives in these arrays can be either SATA or FC, or a mix of both types if the customer wanted to implement tiering. Product information of the F400 can be found here, and the T400 here.
However, simply putting four controllers in an array does not necessarily guarantee that they will be able to sustain write caching if one of them fails. The array must have the ability to remap and re-mirror the write cache contents of all four controllers to the surviving controllers following the loss of a controller. It's an interesting geometric sort of problem: There are four controllers, each with their own cache and cache that is mirrored from the other controllers in the array. All cache contents, including mirrors, need to be distributed evenly across all controllers to avoid congestion and load imbalances. All cache content, including mirrors needs to be accounted for within the array so that if a controller fails, the other controllers will be able to identify all the surviving original and mirrored copies of data. For cache data that has lost either a primary or mirrored copy, a second (new) copy needs to be made. Finally, the amount of data in cache may need to be re-leveled (decreased) to fit into the degraded cache capacity (3 controllers instead of 4).
I made a 9 minute last year video describing how Persistent Cache works. Here it is again. Thanks for watching.