There's a lot to be said for thinning storage and de-dupe has some pretty impressive advantages for thinning backups. I'm less sure about de-dupe on primary storage due to the potential for smacking performance down into the unacceptable range.
Then there is the question about what the benefits actually are. Does primary de-dupe help a customer reduce the amount of capacity they need to buy initially or does it simply remove clutter from storage they already bought? If its the de-clutter case, I can see how it could reduce the need to buy more storage later, but I would think most people want to start getting their ROI right away, like you do with TP (thin provisioning). The cost benefit analysis of primary dedupe is probably not all that simple and probably requires a complete data analysis.
It seems like primary dedupe is promising stuff with a ways to go yet. Maybe Ocarina Networks will make a breakthrough here - who knows?
The situation for dedupe with backup seems like it would be simpler, but when you look closely. its not as simple as it appears. Scalability and performance are both areas that are very important to customers, but its difficult to tell what your reality would be as a prospective customer. The vendors that specialize in dedupe, such as Data Domain, probably make it easiest to understand.
Curtis Preston's blog on his Backup Central site has some very interesting stuff on dedupe technology. Recently he posted a dedupe performance comparison that was an industry first. As somebody who gets in EMC's drawers from time to time, he gets attention (if not love) from EMC social media folks - and this smacking around over dedupe technology qualifies as modern entertainment for me (Yes I like it much more than our industry's Twitter dribblings!)
The take away from reading Curtis' thorough analysis (you may need to read more than one) and the various rebuttals is that there are a lot of important details involved with dedupe solutions that don't often see the light of day. Here is a quote from one of Curtis' comments in the comment thread:
I completely agree with you on the need for an independent test. It will be the subject of a later blog.
Yes, but seeing as how we have such a hard time getting independent testing done for primary storage, I think this could end up being an exercise in jousting at windmills, Curtis. If we can't get a common ground on what plain old storage performance is, how are we ever going to know what dedupe performance?
Before really understanding the technology behind dedupe and what it can/can't do we tested out some data domain gear in the hopes it could reduce our primary storage needs for our storage purchase recently. We fed it uncompressed feeds of the sample data that we wanted to see if it could de-dupe, about 300GB of data in all, it didn't work out the way we were hoping, it turns out just compressing it gave us better results so we ended up not going the de-dupe route(were originally looking to data domain's high end ~$100k SAN-attached boxes).
They confirmed fairly late in the eval that our data wasn't a good candidate for de-dupe.
Due to the somewhat static nature of the bulk of our data(store for X days and delete the oldest day each day, data is used the first day and only really gone back to for backup purposes or testing purposes beyond the first day), we opted for an entirely SATA based storage system which gave us the raw storage to house all of this stuff, but at the same time because much of the data(currently 50%) of it is this data that isn't accessed much after the first day, we can live with much higher utilization rates on the SATA disks with everything else sharing them. We probably could not run full SATA if we had really high utilization rates for the majority of data on the system, at least not without a lot more spindles.
We thought about getting tier 3 or similar storage to put this data on but the math in the end just didn't work out, if we did that then we wouldn't need as many spindles on the main array, resulting in us having to go to FC(for performance) instead of SATA which drove up the cost per TB by about 2x(based on list pricing for each solution)
Going with SATA from the start gave us enough spindles for a good baseline level of performance and with linear scalability as we add more disks we get good benefits there. We probably have more raw space than we really need, but that just means we can keep things like snapshots around for longer periods of time.
With a FC array I think we wouldn't be able to drive enough I/O to justify the price after a certain point. Of course you can mix/match SATA and FC but then you have to balance what workload goes where, and coming from another storage array that had at least 4 different pools/tiers of storage that wasn't a easy task. And determining the workloads of various tasks using shared resources(e.g. large NFS data stores) can be difficult/complicated as well, splitting everything out has it's own complications as far as managing space etc.
It wasn't an easy task to determine what the configuration of the new system should be, took a lot of work! I'm happy with the results myself.
Looking forward to the next rev Inform OS myself, want that thin built in turned on!
Posted by: nate | March 12, 2009 at 09:37 AM
Hey, Mark! It's Curtis!
I still haven't gotten around to posting that post that I referred to above, but let me give you a hint. I am trying to form a company that would do such testing. Think Consumer Reports for IT.
We're not talking benchmarks, like the ones that the industry fights over. We're talking about tests in real-world environments with real-world data and people.
Anyway, I'm crawling back to to my hole. BTW, I fixed the typo in the original comment. Can you fix it in your quote of said comment? It's "...how MUCH each of you are exaggerating..." That'll be great. (Office Space manager voice there.)
Posted by: W. Curtis Preston | April 23, 2009 at 10:05 PM