Storagebod is rapidly becoming one of my favorite storage blogs for its clear thinking, informative content with a bit of an edge. He recently wrote about the death of RAID-5 and referred to a blog post that appeared on Storagemojo's site last year. The gist is that RAID-5 rebuild times and the risk of encountering additional hard read errors on large SATA disk drives when an array is operating in degraded mode makes it more likely that an unrecoverable read error will occur in an array. 3PAR does things differently and I thought I would use StorageBod’s inspiration to write about it.
3PAR InServ storage systems subdivide disk drive raw storage capacity into small, granular 256 MB sized units we call “chunklets”. If the term chunklet doesn’t work for you, it might help you to think of them as data compartments. A 300 GB disk drive would have 1170 data compartments.
An InServ’s volume manager uses chunklets when forming RAID groups, which I like to think of as micro-RAID arrays. For instance, a micro-RAID 1 array is formed by writing data to two chunklets on different disk drives. Likewise, a micro-RAID 5 array is made by writing application data to data chunklets and a parity chunklet. All chunklets belonging to a single micro-RAID array are located on different disks that are located on in different FRUs.
Multiple micro-RAID arrays are combined to form real-world-sized logical disks, which are then exported to host systems as LUNs. The bottom line is that LUNs are protected by multiple micro-RAID arrays, spread throughout the system. This means that 3PAR RAID 5 arrays can withstand multiple drive failures, (although they can’t survive two drives failures from the same micro-RAID array).
Sparing in InServ arrays also use chunklets. That means there are no physical spare drives wasting energy and generating heat waiting for a productive drive to fail. When a disk fails in an InServ system, its used chunklets are evacuated to spare chunklets on other drives. The wide striping algorithms that spread micro-RAID arrays throughout an InServ system are also used to relocate evacuated chunklets from a failed drive.
Wide striping circumvents the performance bottlenecks and dangerously high duty cycles that are characteristics of typical RAID degraded mode and rebuild operations, including those performed on dual-redundancy RAID such as RAID 6. Micro-RAID array parity rebuilds complete very quickly with far less stress to individual disk drives.
In summary, the combination of chunklet-based micro-RAID arrays and wide striping creates a significantly more robust and safer environment for RAID 5 than on any other competitive product - without the inherent performance disadvantages of RAID 6.
But wait, what about SUPER FAST RAID-DP? Then you get both protection and performance! Which, by the way, NetApp owns the largest market share of RAID-DP deployments!
Ok, sarcastic hat off now, take a look at what Atrato and Xiotech (ICE) are doing, because I think they have interesting similarities to the theory behind 3Par's data protection.
I thought I was your favorite Storage Blog?
Posted by: Steven Schwartz - The SAN Technologist | October 22, 2008 at 08:17 PM
Oh SAN Technologist - I am so fickle. U R my favorite storage blog right now! And thanks for the pointers here.
Posted by: marc farley | October 23, 2008 at 12:32 AM
That's no different from RAID-5; for instance, a LUN created across two RAID-5 groups can sustain a double disk failure as long as the disks aren't in the same group. Same for micro-raid.
And not having spares doesn't save on fuel bills when you have to have enough free space in chunklets to do a rebuild. The spare chunklet overhead cost is just amortized across more disks.
Or am I missing something here?
Posted by: Alex McDonald | October 23, 2008 at 04:41 AM
Atrato is a company to watch, they are doing some clever stuff at the moment. However, I'm expecting chunk based schemas to become the norm, what that chunk is will vary from vendor to vendor. In a couple of years time, we'll look back and wonder why it was any different. Obviously once we go to chunk based schemas, protection will be done at the chunk level. And, I'm quite happy to be one of Marc's favourites, I'm not proud!
Posted by: Martin G | October 24, 2008 at 12:23 AM
This is precisely, how the HP EVA family of arrays have been doing things for the past 7 years. The argument usually levelled against the performance benefits and availability of wide striped arrays is the ability to isolate I/O load. Which BTW is still possible, the counter argument being I suppose if you don't have enough spindles to service the I/O load you don't have enough spindles. So add some.
Posted by: John H | October 27, 2008 at 03:33 AM
Thanks for commenting John,
The ability to isolate I/O load depends on the architecture and administrative tools of the storage system (array).
People tend to believe they can conquer the problems of I/O tuning by themselves (or with the help of professional services). Often the result is spending a lot of money or time to get results that are almost guaranteed to disappoint over time as additional LUNs are exported for other hosts and applications. It's difficult to balance system resources (and remove bottlenecks) if the granularity of resources is too coarse.
But, it's not just a matter of granularity. The architecture and tools of some products limit performance tuning to manual methods. If that's the only thing you know, you tend to believe in it. (John, is there a techie's twist on Russel's teapot in here somewhere?)
Most of the time, storage admins would be better off letting an intelligent system manage the placement of data on fine-grained resources. Assuming adequate intelligence and granularity, the focus of storage administration moves to a higher plane, where creating and maintaining a balanced system is the goal. FWIW, that's not necessarily child's play, but it is much easier to understand than all the minutiae that storage admins typically have to deal with.
Posted by: marc farley | October 27, 2008 at 09:39 AM
I ran across your blog when I googled wide striping (which seems to be a recent favorite buzzword).
Anyway, I have a basic understanding of the InSpire architecture and how data is distributed in chunklets. I do have some questions of course. Is there an sub-level chunklet? For instance, an individual application I/O will surely be under 256 MBs in size. Will all of this one request go to one chucklet or will it be spread evenly among a group of chunklets? In other words, is there a 3Par equivalent to "chunk size/stripe depth" that is even smaller than a chunklet?
The best comparison I can think of is the HP EVA. All of their disks are grouped into RSS groupings. Data in a LUN is first written to one RSS and then moves onto another RSS group. I believe the "stripe width" for the RSS group is 2MB. However, the amount written to each individual drive (chunk size) is 128 KB I believe. So, in the very rare occurrence that there is a 2MB hotspot, it will always be hitting those 7 to 11 drives instead of the 100+ drives in a system.
Another question is whether or not it is better to isolate different I/O streams into different pools. Throwing a large sequential stream at the same time as a small random stream onto the same set of drives is absolute murder for performance. Even HP advises separating out these workloads into different LDADs (pools) such as Exchange logs (100% sequential) and EDBs (small random reads). Your thoughts on this? The 3Par SE finally admitted this was optimal as well when I pushed hard enough.
Posted by: maobacks | December 16, 2008 at 03:19 AM
Thanks for the inquiry maobacks. Yes, there are smaller sub-units within the chunklet. The best way to think of it is that the chunklet replaces the disk drive as the granular storage target that the controller operates on. Just as there are multiple sub units of storage on a disk drive, there are multiple sub units of storage within chunklets too.
Data is striped across chunklets using RAID algorithms and as you would expect you can adjust the stripe depth to fit the I/O requirements of your application. RAID is implemented as a series of concatenated "micro-RAID" arrays. If you use a RAID5(3+1) configuration and you are wide striping across 40 drives, then you would have 10 logical arrays that are concatenated to create a much larger volume (or LUN).
Wide striping addresses disk contention bottlenecks through massive queuing. Data is spread more or less at random over the available drives. Hotspots are far less likely to occur.
The question about isolating different I/O streams into different pools is a real good one. I think it makes sense to put low priority data on slower, lower cost drives. (Duh!) But what about mixing transaction processing and streaming data? I don't think there is another vendor's products that come close to the performance of 3PAR systems for mixed workloads. Nonetheless, if you want to create special pools of disks for certain purposes you can, but I'm not convinced that your results will be better. FWIW, the striping in a 3PAR array is done on a much wider scale than on an EVA, and I don't think it makes sense to call what the EVA does "wide striping".
Application developers often make sub-optimal recommendations about how to use storage with their applications. They tend to think their applications justify special treatment without consideration for the cost and problems of owning and managing storage. Of course, every situation is different.
Posted by: marc farley | December 17, 2008 at 12:01 AM