« The SWCSA rides again: A smackback for Georgens' whacked fast smack | Main | Storage anarchist apprehended in 3PARvaTAR's chunklet matrix »

February 24, 2010


Feed You can follow this conversation by subscribing to the comment feed for this post.

Alex McDonald

Good digging Mark, and I'm glad that an old SAN guy can pick up on file technology patents and give them a read. But you're too obsessed with RAID. Look elsewhere for illumination, and you might see why tiering isn't a problem for us.

The RAID or disk level isn't where the action's at; it's the aggregate, which is a collection of RAID groups.

Love the picture btw.

marc farley

Thanks ALex, I'm glad to hear Netapp can do some virtualization tricks, but the question is how closely coupled are WAFL and the disk subsystem? Is it still using pointers from each drive to determine write locations?

Alex McDonald

I can see why you're having trouble with this from the comment, so here's a quick overview of the way WAFL really works.

"As writes come into a WAFL system, they are first staged to NVRAM in order to eliminate parity RAID write penalties..."

That's not correct. Blocks (because that's what WAFL at this level is dealing with) are staged so we can do a stripe write; there's no parity RAID penalty because we *never rewrite blocks* and hence never have to recalculate parity for each updated block. You have to -- and that's why traditional SANs have large write caches, and RAID6 is such a performance pain -- so that

(1) the parity can be done once of a set of updated blocks on the same parity stripe
(2) writes can be coalesced to "sweep" across the disk and minimize the effects of seeking
(3) you can defer expensive seeking writes (including the re-writing of parity blocks) for as long as is possible.

WAFL doesn't have those issues. It doesn't need huge write caches; by comparison with just about every other system out there, they're tiny.

Collect the blocks, calculate the 2 parities, write the stripe across the RAID group, rinse, lather, repeat. Simple; and *fast*.

Built on top of that there are consistency points, which a traditional SAN doesn't have. The on-disk version of the data moves from consistent state to consistent state; non-volatile RAM (mirrored in an HA cluster) covers the gaps between the consistency points. Makes failover and recovery easy if a system dies; just replay the log.

And above that again, something that you've omitted; the aggregate, a collection of one or more RAID groups, from which we carve objects like LUNs or filesystems for NFS or CIFS.

Which is why I see your problem, and why you say "... creating block abstraction layers is highly improbable". No, we've done it. All blocks for the object are in the aggregate as a tree-like structure, with meta-data pointers to each 4K block. That's what makes thin provisioning, clones, snapshots, dedupe and all the other NetApp features that make for such a high degree of storage virtualization; abstraction of the physical.

With that WAFL 101, perhaps you can see how storage tiering isn't a problem. Look up the stack; the virtualization NetApp systems provide has little to do with physical disks.

marc farley

I DID understand the write striping no RAID penalty thing and thought I covered it concisely, but its OK that you spelled it out in more detail.

You mentioned "meta-data pointers to each 4K block" Convinced by your comment that Netapp can aggregate storage resources, the question for me is still around a relationship between WAFL and physical storage. It is certainly possible to create aggregated storage that still exposes some details of individual disk drives to WAFL. The patent I referenced discusses exactly that.

I assume writes to disk drives are still based on these meta-data pointers? In addition, the block mappings for aggregated storage on a Netapp Filer still carries information about sequential (contiguous) blocks on individual disk drives as well as some useful locality information - which is used to minimize disk latencies during writes. In other words, the use of meta-data pointers is structured and is not random.

Then the next question is how static or dynamic those aggregate block maps are. When WAFL writes to a block it updates it's inodes. As far as WAFL is concerned, the meta-data pointer for a used block is static and does not change until it is returned to the free pool through any number of maintenance procedures.

But of course, things can be different on the subsystem side of the equation. Pointers inside storage can be remapped, as our Dynamic Optimization tool does when it redistributes data over a larger number of disks. The question is whether or not Netapp Filers can perform dynamic remapping of blocks. I'm not asking about returning blocks to the free pool, I'm asking whether or not a reference to a block in an inode can access a block that is physically different than the block that was associated with that inode at the time it was written.

Mike Richardson

"The question is whether or not Netapp Filers can perform dynamic remapping of blocks. I'm not asking about returning blocks to the free pool, I'm asking whether or not a reference to a block in an inode can access a block that is physically different than the block that was associated with that inode at the time it was written."

It looks like you wasted a blog post on this chaos theory of yours. We have no issues in dynamically changing the relationship of physical blocks to logical on the fly.

Using your logic, perhaps you can explain how we perform deduplication and drastically change/reduce the physical blocks of of the aggregate without logically changing how the data appears to the application.

Seems like your mind is set on forcing legacy array logic to the present day WAFL architecture.

Alex McDonald

My, this is deep.

Two questions from what I can see;

1. "writes to disk drives are still based on these meta-data pointers [cwl pointers in the patent]" -- don't know. Probably to something similar, since WAFL is careful to make sure that it writes where is best; the WA in WAFL isn't really "write anywhere", it's write where we choose to.

2. "a reference to a block in an inode can access a block that is physically different than the block that was associated with that inode at the time it was written" -- true. Think of a block update; WAFL doesn't do that as a rewrite in place. WAFL writes a new block and adjusts the inode ptrs. That's also how snapshots, clones, dedupe etc do their magic.

What's your point?

marc farley

Mike, There are no wasted posts - just wasted posters. You ask a good question, if a Filer dedupes at the block level, wouldn't that require an addressing remapping?

I don't know - would it? There are still other copies of the same data that are not being remapped. As the Filer knows it's own block map, I can imagine a scenario where the adjustments are made in the inodes instead of within the block storage subsystem. I'm sure people would be interested in knowing the answer to that, but that wasn't where I was going with this chaos theory of mine.

All us knuckle draggers need to be brought up to speed on the new wonders of WAFL. Are you the guy for the job? - because I'm getting tired of tweaking Alex, who unwittingly volunteered for this task and I'd rather not irritate HIM anymore - it's getting close to his bedtime.

marc farley

Alex, it is getting deep. Maybe it's time to knock off. These social things just have a way of going on and on.

Anyway, you answered the first question about writing where you want to. Good answer.

The 2nd question was slightly different. I understand how writes are always made to free blocks and do not overwrite existing blocks. The question I was asking here is being tackled by Mike Richardson now it seems. And that is whether or not there can be any "back end" substitutions made in the block mapping. To be clear (is that possible?) - that only makes sense if there is a block mapping that is kept by the storage subsystem. WAFL maintains a block map file that indicates if a certain block is being used by any version of any file in the system. Is there a similar construct on the storage subsystem side that allows blocks to be changed, swapped, substituted, whatever?

If so, block based (storage) tiering can work . If not, then it's much more difficult - and could explain why Tom Georgens said the things he did about tiering.

But seeing as how its getting to be the end of your day, if you want to let this slide, that's OK with me.

Mike Richardson

I don't know much about your Dynamic Optimization tool. But it sounds similar to NetApp Reallocation. For example, if you add additional disks to an aggregate and want your existing LUNs/Volumes to make use of the performance capabilities of the new spindles more quickly, reallocation can redistribute already written data from the existing spindles across all spindles.

With the same facility, we also have the capability to analyze data as it is being read by clients, determine if the data is appropriately optimized on disk and, if necessary, rewrite the physical segments with a better, more sequential layout completely transparent to the client reading the data.

I hope this specifically answers your question about our physical layer flexibility. However, this feature in itself is just one small part of the robust efficiency and performance solution set we have in OnTap. Alex already explained some of the performance benefits inherent in WAFL, even apart from the additional performance and efficiency benefits you get with things like Performance Acceleration Modules and High Performance Deduplication.

We can already meet customers efficiency needs by storing and retrieving data more intelligently than the brute force hardware vendors. As such, I don't think NetApp is starving for a tiering solution. We haven't created the same siloed problem that tiering will fix.

Alex McDonald

No, that's fine, I'm still up.

I think you answered your own question. How it does it shall remain sooper seekrit. Actually, I don't know how it does it in detail, since I'm not a WAFL engineer. I have a much smaller brain than Eric Hamilton or Peter Corbett, two of our chief WAFL architects, some of whose papers and patents you may have run across in your research.

What I do know is that any blocks pointed to by inodes can be remapped onto any other block on any other drive in any other RAID group. As Mike pointed out, that's how we do dedupe -- and much other magic now and in the future.

I think I've also explained why PAM-II is a read-only cache; WAFL doesn't need a write cache beyond NVRAM.

Lastly, Georgens may be right about storage tiering. But he didn't say it because there's something inherent in WAFL that makes it difficult.

Thanks for the opportunity to explain. Still like the photo, btw.

marc farley

Good answers Alex, Sooper Seekricy is acceptable. Glad we didn't "blunt-object you into submission" today!

Chuck Hollis

Sorry for wading in. And I am no expert.

Have you noticed that you don't end up mixing drive types in an aggregate? AFAIK they all have to be exactly the same.

That would, for example, prevent you from inserting a single flash drive (or two) and using it intelligently. Or mixing FC and SATA and placing the data intelligently.

My sense is that WAFL is not aware of individual drive characteristics, and can't intelligently exploit them. It wants a big pool of homogeneous disks to do its stuff.

Simple, yes. But every approach has its limitations.

My two cents ...

-- Chuck

marc farley

Chuck, you are always welcome to wade in - no apologies necessary. I suspect you are correct about their volumes needing homogeneous pools. The business of PAMs doesn't add up to me. There is something else there, but it might take awhile to find out.

Alex McDonald


You're right. Currently, all drives have to be of the same type in an aggregate.

@Marc; back to PAM-II cards. In WAFL, writes are cheap but reads are more expensive. Hence, large cache for read, and none for write. There's actually a NetApp SPECsfs benchmark that shows this; here's a link to a good blog about it. http://blogs.netapp.com/extensible_netapp/2009/08/shed-a-tier-with-pam-ii-an-alternative-to-emc-fast.html. It's by Chuck's old friend Kostadis.

Mike Riley

You can mix drive types in an aggregate and there may be good use cases for these types of aggregates for NetApp customers. We haven't taken that off the table. However, it's not a technical limitation of WAFL.

I don't know how super secret all this stuff is. You can find the basics in this 2008 Usenix paper by NetApp engineers: http://www.usenix.org/event/usenix08/tech/full_papers/edwards/edwards_html/.

When we were dealing with what we called "Traditional Volumes" we used direct block mappings. When we introduced Flexible Volumes in ONTAP 7.0 (released in 2004) we introduced a logical construct - that block abstraction layer you were asking about - between the aggregate and the data container, a FlexVol. This virtualization layer (also called a level of indirection in the whitepaper) allows you to seamlessly introduce storage features dependent on virtualized storage such as (but not limited to) thin provisioning, cloning, deduplication. This level of indirection has also served as the basis for EMC's research into CBFS. It shouldn't be a foreign concept to EMC readers but as EMC begins to introduce this concept into their own products, I suspect we'll hear a campaign along the lines of "WAFL done right." Regardless, we already have a virtualization layer inside of ONTAP and have for years now. It's not a technical gating factor.

As far as accessing the blocks of data, it's important to note that WAFL stores metadata in files. There are block map files, inode files, etc. WAFL can find any piece of data or metadata by simply looking it up in these cross-indexed files. The index has a tree structure to it as Alex mentioned. This tree structure is rooted in something we call a vol_info block (like a superblock). As long as it can find the vol_info block, it doesn't matter where any of the other blocks are allocated on disk.

WAFL is also "RAID aware" which is somewhat unique but for more on how, why and WAFL integration of RAID-DP, I will point you to the Usenix paper of the year in 2004 written by Peter Corbett and colleagues: http://usenix.org/events/fast04/tech/corbett.html. So, having a RAID-aware virtual storage system is not a technical gating factor.

I know that's a lot of reading but you'll find that the story behind PAM has everything to do with economics and nothing to do with some conspiratorial technical cover-up. There's simply no there there.

marc farley

Alex says the drives have to be the same and Mike Riley says they can be mixed. Is this the kind of flexibility Mike Richardson was talking about? :)

Mike Riley

Whoops - that's a future. My mistake.

Mike Riley

Technically possible vs. shipping. The takeaway is it's not a WAFL limitation.

Mike Riley

Sorry - this was bugging me so I had to go look it up. SAS & FC can be combined in the same aggregate (same RPM). SATA cannot.


I think the 3PAR dynamic optimization goes a bit further than what the NetApp restriping stuff can do(unless the NetApp stuff has changed significantly recently). It was my understanding that the restriping was limited to within an aggregate. And there were space limitations as far as how large an aggregate could be which thus limited the number of spindles a particular volume could span.

With 3PAR the only upper limit is the limit of the system as a whole and the lower limit is only for real small data sets(1GB of data won't get striped very far).

I just added another 96 disks to my 3PAR array, and am in the process of migrating a bunch of volumes to them to free up space on my other spindles, and converting from RAID 5 5+1 to 2+1 in the process, once this initial set of volumes is moved that will free up enough space on the original spindles to restripe all volumes across all 300 spindles with RAID 5 2+1.
Really nice for me to be able to do this. I remember having talks with HDS a bit over a year ago they claimed to be able to do the same thing, with the fine print of requiring blank spindles to migrate to. Since of course the RAID arrays are virtualized, in 3PAR's case all you need is space to migrate the data to, rather than needing any new blank disks.

An even more interesting technology IMO than DO though is the 3PAR system tuner, which they so rarely (if ever) talk about, I guess because not many use it their systems balance themselves pretty well. But what system tuner does is analyze the performance characteristics of the individual 256MB chunklets and can surgically re-stripe hot data at the sub volume level to other less busy disks on the fly. If I recall right you just give it response time thresholds for the chunklets and thresholds for the spindles and it will try to find those hot chunklets and move them to the spindles that meet your requirements while still maintaining availability characteristics(such as being able to survive a shelf failure if you happened to keep it in the default configuration).

It's not an automated process(last I checked) it's a batch job you kick off on the array itself, and it runs for usually a few minutes looking at the data, the amount of performance data from the chunklets is just incredible, if you think that your data is divided up into 256MB chunks, times the amount of written data on the array, the number of unique things the system has to track in order to perform this process I can't even imagine.. some systems could literally have millions of points of data to track and compare.

I'm sure 3PAR can(maybe has) leverage the system tuner technology and adapt it for automagic tiering, not exactly Compellent's "block" level(not sure how big their blocks are), but significantly better than volume level. As it stands today I believe system tuner will only place moved data on similar spindles(SATA, 10k, 15k etc).

Alex McDonald


"300 spindles with RAID 5 2+1".

My word, that's seriously inefficient. Plus protection is considerably less than RAID-6 by a couple of orders of magnitude; performance won't improve either given the amount of parity IO you'll have to do.

Why not RAID-6 at something like 14+2? More space, more protection and the performance can't be much worse than RAID5 2+1, surely?

Alex McDonald


One more (sorry Marc, I prematurely posted). Aggregates in Ontap8 are up to 100TB currently; and we'll be moving that number further out over time.

the storage anarchist

FWIW, V-Max can also dynamically rebalance data layout across the back-ends and across drives, for both thick and thin volumes.

On thin, this is done by both/either FAST v1 or Symm Optimizer (different but overlapping use cases - FAST goes across drives, Optimizer does not).

On thin volumes, the rebalance feature is included (for free) as part of Virtual Provisioning (which is also free). Drives can be added to a logical pool (1-N on a system, admin's choice), and the rebalancing occurs at a rate faster than either HDS or 3PAR, and with far less impact on production applications (rebalancing does not need to be scheduled for off-hours).

And yes - all the sub-LUN FAST stuff is a future, on track for its announced availability in the 2nd half of 2010.

John F.

Hi Marc,

When I was a kid, and didn't want to eat {fill in the blank - most despised vegetable name}, my mother would say "eat all of your {fill in the blank}, there are millions of people starving in {name your third world poverty stricken region}". It sounds a lot like the topic of this discussion:

1. Problem: Starvation in poverty stricken third world area.
2. Solution: Eat despised vegetable.

The hole in the logic was of course that I was not the one starving. A better solution in my mind would have been to ship the despised vegetable to the poverty stricken region.

It's a lot like the current conundrum:

1. Problem: Write hot spots
2. Solution: Automated data movement

The flaw in the logic is that NetApp doesn’t have the write hot spots to begin with. What do I mean by that?

In the Traditional Legacy Array, if you change a block of data you have to overwrite. If you have just a few small regions of data that you overwrite a lot, there’s a benefit to moving those regions of data somewhere else that isn’t as busy or is capable of higher performance. On NetApp, the writes are coalesced and written as full stripes to free segments. NetApp doesn’t overwrite. All the writes end up leveled across all the disks in the aggregate without any need to move stuff around after the fact to adjust for “hot spots” like you have on the TLA.

For both the TLA and NetApp, if you keep reading the same small region of disk then the disk that region resides on can get busy. For the TLA if you’ve already invested in technology to move stuff around to address the “hot spot” issue for writes, then I suppose you could leverage it. Cache works well too, as evidenced by the large cache sizes in Traditional Legacy Arrays. For NetApp, the obvious answer is read cache since pressure based on locality is limited to reads. Instead of applying a solution to a problem that NetApp doesn’t have, NetApp chose to invest in increasing the efficiency and effectiveness of the solution that does apply; read cache.


marc farley

Mike Richardson - look what happens when you introduce schilling into a post about technology implementations - it turn the discussion into a virtual progression of billboards! Shame on you (and yes, pot@kettle.blackness applies)

Its so great to see everybody talking about futures here! And Anarchist, did you have to bring HDS into the discussion? Do we really need to hear about their futures too?

nate, System Tuner is an amazing tool, and you are correct that not many of our customers use it. It was developed for a large customer several years ago as an insurance policy to deal with the unlikely event of a water landing and it is rarely purchased and even less frequently used. FWIW, I wouldn't exactly say that we are leveraging System Tuner in the development of our tiering technology because we have a core competency in block virtualization/redirection from implementing thin provisioning and snapshots the way we do.

John F, Write hot spots in TLAs are solved by cache. I hadn't thought about this before, but I assume overwrites (high frequency updates to the same block) in Netapp systems take up multiple locations in NVRAM? Can you clarify how this is handled?

I'm not sure who you were referring to when you mentioned Traditional Legacy Arrays, but I want to make sure readers understand TLA is not the same as block storage. 3PAR makes non-traditional block arrays that spread data over lots of spindles in tiny increments - as nate discussed above. Wide distribution of data across hundreds of disk drives flattens hot spots too, but without creating "write holes" and without the need for ongoing garbage collection.

Of course, the advantage of garbage collection in Filers is returning unused capacity to the free pool. Comparitively, the relative weakness in block storage is wasted capacity in blocks that are occupied with data that the file system deleted. That's why we have been delivering tools to deal with that, such as Thin Reclamation and Thin Persistence. (OK, that was the pot.kettle.black part of this). I just wanted to reinforce that 3PAR was NOT a traditional legacy array.

John F.

Hi Marc,

1. My understanding is that 3Par effectively utilizes wide striping, however is still constrained by overwrites. In that sense, it’s not a TLA. The specific behavior of the TLA I was discussing was overwrites. Cache does help to a degree. Because of the overwrite constraint, there’s much more head positioning. Cache does allow holding and then reordering the writes, but in the end you still have the excess head movement and it can require increasing amounts of cache proportional to the size of the volume.

2. NVRAM stores a log of operations, not the actual blocks. I think you’re not quite clear in this area yet, and that is a source of confusion.

3. NetApp has been doing space reclaimation for some time now. The interface is integrated into SnapDrive. It coordinates with the storage to reclaim non-zero space that the host OS considers deleted. I’m not sure how this subject crept into the discussion, but just thought you’d like to know.

Thanks for your quick reply and don’t forget to eat your Lima Beans.



I'm running RAID 5+1 today, which is faster than anything RAID DP, since I'm not depending on multiple parity disks, and I was topping out at 120 IOPS/disk for SATA drives, our load has gone up quite a bit since we originally put the system in, spindles were running at ~80ms write and ~50ms read service times that's how hard the system was being slammed 24/7. When the IOPS got to 120/disk service times on the disks jumped to the ~200ms range.

A blog entry on this a few months ago:

The decision to go RAID 2+1 was mostly based on performance, though also based on data distribution, being able to ensure that we can survive a full shelf failure(in 3par's terms "cage level availability") - not that we expect such a failure but I like the option being there, combined with mismatched drive sizes (750G & 1024G), evenly distributing the data to maximize availability becomes harder with RAID arrays with larger data+parity ratios, the array handles it automatically of course but your more likely to have uneven data distribution once you exhaust the space of the smaller drives.

I'll measure performance and adjust again if needed, I collect a few thousand data points a minute from our system(using a system I wrote, I don't like the 3PAR system reporter). But my main goal was to cut spindle response time in half (despite having high spindle response times the front end response times were OK most of the time, though out of my own comfort zone.)

I was *this* close to just going RAID 1+0, but RAID 2+1 seems like a good compromise, for a 2TB volume it consumes 3.1TB of raw space it appears vs 2.5TB on RAID 5+1, and ~4TB on RAID 1+0.

I believe with RAID 5 2+1 I'll be within 2 or 3% of RAID 1+0 performance with the fast RAID that 3PAR has.

But that really is one of the big things I like about the system, if I screw up the data distribution I can just change it later. I screwed it up the first time around making an assumption our NAS cluster would be thin provisioning friendly, and it was not. I had to convert from RAID 5 3+1 to 5+1 to keep us from having to buy more disks to support it, but I managed to convert the volumes in time. Have been running at ~88% space utilization the spindles for the past 6 months, which for a rapidly growing organization can be a scary situation to manage, but we got through it and installed the new capacity recently. The big caches on the controllers allowed us to last long enough I/O wise as well.

My previous 3PAR array at another company was an E200, and I can tell you from personal experience when the spindles on that box got to 50% busy response times spiked and alerts were triggered. Here we have been operating 24/7 at north of 60% busy and the system runs fine, I have seen it hit 90%+ and even 100% busy(holding my breath when that happens, I can see in real time the DB queries stall at that point).

Same goes for response times, on the E200 with 10k RPM drives the limit seemed to be about 125-130ms before alerts got triggered, here(T400) it's about 200-220ms. The controllers can certainly take a lot more smack than the older box. Which is kind of expected given it has 4-8 times the IOPS capacity of an E200 (depending on whether you have 2 or 4 controllers).

RAID 6 is an option now on 3PAR arrays with the most recent code(which I have), while it's fast, it's no RAID 5 2+1. If/when we become space constrained again I'll certainly be re-evaluating our performance/space setup, and can of course change it whenever it needs to be changed. Right now we have a bunch of "excess" space(because I needed the extra I/O), so I'm taking advantage of it by going to RAID 5 2+1.

The system as a whole has far exceeded my performance expectations to-date, which was mostly based on my experience on the E200, back when I worked at a company that used EMC/HDS/NetApp boxes we didn't get good performance metrics from them for whatever reason(I wasn't responsible for those systems).


And - as for restriping of data and overhead associated with it, with a VERY busy array last year I restriped ~100TB of data online with no noticeable impact(reading+writing back to the same disks and changing RAID level in the process). It took about 5 months(pretty much 24/7) to do given the array uses only idle cycles to move data around. All SATA disks. But in the end it worked, didn't have to buy extra disks.

With my recent upgrade there have been more idle resources(2->4 controllers, 200->296 disks) so I've managed to so far move about 46TB of data in the past 6 days(24/7), again no noticeable impact, still SATA disks. I'm astonished at the rate of data movement this time around myself, last year towards the end I was averaging maybe 1TB/week, now I'm doing multiple TB a day.

3PAR is by no means perfect, I do really like their products(there's only two companies that I myself am such an advocate of and 3PAR is one of them, the other is a networking company - I do more than storage!) though like anyone they do have their faults.

marc farley

nate, thanks for not listing our faults and I appreciate the time you spent discussing your implementation. When Alex asked about your choice of RAID I figured you might explain with your usual incredible attention to detail. :)

marc farley

John F., Lima beans :) I used to put unwanted vegetables in my pockets where my mother would find them doing laundry - at least that is her side of the story. My personal cache, so to speak.

What do you mean when you say, "overwrite constraint"? If it's disk head movements, we obviously have a lot of that going on in 3PAR arrays because we are writing across so many disks simultaneously. As I said previously, writing across so many disks is how we avoid hot spots. But in general, write I/O performance stays high due to write caching.

BTW, thanks for clarifying that Filers are tuned to reduce disk drive head movements. It's obviously an important element to support performance in Netapp's architecture. Of course, this sort of disk drive oriented tuning wouldn't buy WAFL any advantages with flash SSDs - which only partially explains why Netapp doesn't see any advantages from tiering with them.

Mike Riley commented that different disk drives with the same rotational speed can be mixed in an aggregate - such as 10k SAS and FC drives. The next question related to that is if disk drives with different capacities can be mixed in an aggregate? More precisely, can drives with different capacities contribute all their capacity to the aggregate - as opposed to being short stroked to match the capacity of smaller drives.

All that said, the main issue for Netapp and tiering appears to be WAFL's write placement method. As data updates are always being written to new block locations, how would the system determine if that write should be placed on an SSD? Updates to blocks necessarily obsolete the old block locations in a Filer (except when accessed in a snapshot). A write to an SSD in a Filer only makes sense if it is going to be read many times again without being written to. However, there is no way of knowing if a block will be updated the next millisecond or next week.

Tiering to SSDs is a feature that works particularly well with storage that has "overwrite constraints" (if I am interpreting this term correctly). Updates to data in SSDs overwrite the old data and can be read and written as quickly as necessary from a single block location. The block does not become obsolete.

In summary, tiering is a storage application that doesn't fit Netapp's architecture very well. That's fine - Netapp does things differently. But it certainly doesn't mean that tiering is a dead technology - as Tom Georgens declared it to be. It simply means that Netapp isn't likely going to adopt it any time soon.

Alex McDonald


Stop it with the FREE stuff! You mean "at no additional charge".

Alex McDonald


"I'm running RAID 5+1 today, which is faster than anything RAID DP, since I'm not depending on multiple parity disks".

RAID-DP isn't RAID-6, so parity disk count isn't an issue; RAID-4 (single parity) to RAID-DP overhead is around 2-3%. We can drive SATA up towards the 300 IOPS mark on sequential writes, and with a PAM-II card for read cache you could make those 300 SATA drives run like double or more the number of FC drives. You're wasting money on seeks, and lots of space on RAID5 at 2+1.

I'm not trying to be super-critical here, but you really ought to investigate other solutions. The one you have works, and works well -- but what a price!


I may have mislead Marc about what ends up in NVRAM, to simplify the argument. Sorry.

John F.

Hi Marc,

Glad you found a place to dispose of the most despised vegetables. To the point, there's nothing in the NetApp architecture that would make tiering difficult or impossible. There's also nothing in the NetApp architecture that compels the use of tiering; the customer needs can be met today with simpler and more cost effective solutions that are a closer match to the problems we attempt to solve. That doesn't mean teirs are inherently bad. That doesn't mean that you won't see tiers on NetApp later if needed. It's actually neutral on tiering.

You do know that, interally, SSDs have an overwrite contraint. The controllers within the SSD go to extraordinary lengths to ensure a cell is not overwritten too frequently. It's called wear leveling. The SSD controller transforms a workload that frequently overwrites specific locations to one that spreads the writes evenly across all the cells in the device. They do this to extend the lifetime of the device.

Interesting, isn't it.


marc farley

Oh, you meant wear leveling !! Why didn't you say so in the first place??


Our workload is random rather than sequential, dozens of systems hitting the array for different apps and workloads simultaneously, a lot of NFS, some CIFS, a bunch of fiber channel, some iSCSI.

Just touched 30k disk IOPS for the first time a few minutes ago, those new disks certainly getting a workout, 130k of IOPS capacity left in the controllers.

As for costs, we're doing significantly more I/O and stuff with 200 disks(now almost 300) than we were with over 500 disks(half of which were 10k RPM) on the previous system so it was a pretty good savings :) Older array has been turned off and sitting in the corner(4 racks) for the past 7 months, so far can't get anyone to take it even for free.


@Alex: Of course when you turn random writes into sequential you can get better IOPs. SAS drives inside a low end Xeon server can get 1500 IOPs sequential (begs the question why you can only get 300).

How many random read IOPs can your TB-class SATA disks get? Answer: the same as everyone else, no more, YMMV. Since most applications in enterprise consolidation environments are predominated by reads, it's vitally important to have good random read performance as well as random write performance.

@everyone: You are all wrong about where the solid state storage should go. But I'll leave you to figure out where that should be on your own... ;)

Alex McDonald

1500 IOPS sequential? From a single SAS drive! Really? Where can I buy one?

I think you're measuring IOPS across 5 or more drives for that figure. I'm talking about a *single* SATA drive.


I'm talking about physical disk IOPS as well not cache IOPS or anything else.

But in any case Xiotech claims to somehow violate the SCSI protocol in conjunction with their tight integration with Seagate disks and claim to get 25% more IOPS per disk than anyone else because they violate the spec. Apparently it's a supportable thing by Seagate, which is partially why they only support Seagate disks. I've never used their stuff, but the concept did sound interesting at least.

I've talked to a couple other 3PAR customers that are using V-series with 3PAR gear and they claim significant performance gains over their previously native NetApp stuff.

I may be in the market for some new NAS gear soon, though so far NetApp has been their usual stubborn selves and are refusing to be flexible with evaluation gear. NetApp drove me into the arms of 3PAR at my last company(my first 3PAR deployment) because they refused to do an eval, and they may yet again drive me to the competition.

One mistake by not letting us eval a small array has cost them the equivalent of I'd say at least 6 or 7 array purchases from me and companies that I have influence in in this area(over the past couple of years). Though in the end maybe a good thing since I found a great system in 3PAR.

I'm sure small change for them but just goes to show how not being friendly with the customer can cascade against you in other accounts, small world indeed.



1500 IOPS at 4KB/s is only 6MB/s. Any SAS drive, or any SATA drive for that matter, can do that sequentially if it's the only thing the drive is doing. Sequential access has minimal access/seek times, it's literally spooling off the disk. I have a 2.5" laptop drive connected to a Mac mini via an old Firewire 400 cable that I can do 365 IOPs on, if its sequential.

DAS system regularly push over 1000 IOPs per spindle (!!! sequential or semi-sequential !!!), because there is no sharing of the drive amongst applications or servers. External RAID arrays (specifically SAN gear like 3PAR, NetApp and others) often share spindles between applications and hosts, meaning that all IO essentially becomes randomized. Unless, of course, like the Symmetrix and DMX and CLARiiON of days gone past, you segregate everything manually (i.e. painfully) in order to ensure that host sequential IO translates to spindle sequential IO.

Now, if you were averaging 300 IOPs per spindle for random access, I'd be impressed, or more likely, dubious, if it represented a highly consolidated (several hosts) workload. Customers that try to use SATA for performance are paying too much per IOP.

The comments to this entry are closed.

Search StorageRap


Latest tweets


  • Loading...


Infosmack Podcasts

Virtumania Podcasts