Tuesday, October 9, 2007

Data DeDuplication - Been There Done That!

I just got off a pretty good NetApp webcast covering their VTL and FAS solutions. One of the items they discussed was the data deduplication feature with their NAS product. When the IBM rep spoke up they discussed TSM's progressive backup terminology and I find it interesting to contrast TSM's process with the growing segment of disk based storage that is the deduplication feature. The feature really helps save TONS of space with the competing backup tools since they usually follow the FULL+INC model causing them to backup files even when they haven't changed. Here deduplication saves them room by removing the duplicate unchanged files, but this shows how superior TSM is, in that it doesn't require this kind of wasted processing. What would be interesting is to see how much space is saved in redundant OS files, but that is still minor compared to the weekly full process that wastes so much space.

This brings us to the next item, disk based backup. This is definitely going to grow over time, but costs are going to have to come down for it to fully replace tape. The two issues I see with disk only based backups is in DRM/portability and capacity/cost. If you cannot afford to have duplicate sites with the data mirrored then you are left having to use a tape solution for offsite storage. Also with portability disk can be an issue. For example we are migrating some servers from one data center to another and we used the export/import feature. We have also moved TSM tapes from one site to another and rebuilt the TSM environment. To do this with disk is a little more time consuming, you would need the same disk solution and the network capacity to mirror the data (time consuming on slow connection) or have to move the whole hardware solution. Tape in this scenario is a lot easier to deal with. Now when it comes to capacity vs. cost there is a definite difference that will keep many on tape for years to come. Many customers want long term retention of their data, say 30+ days for inactive files and TDP backups (sometimes longer with e-mail and SARBOX data). So what is the cost comparison for that type of disk retention (into the PB) compared to tape. Currently it's no contest and tape wins in the cost vs. capacity realm, but hopefully that can someday change. So if any of you have disk based solutions or VTL solutions chime in I'd like to hear what you have to say and how it's worked for you.

2 comments:

  1. the c:\windows sounds like a big savings in theory, but I think if you added up the backup occupancy of all your \\server\c$ filespaces and took that as a percentage of the total occupancy that it would be less impressive. My avg c:\windows directory is less than 4G. Across 200 servers that's 800Gig. If I could dedupe that at 20:1, then I'd go from storing 800Gig to 40G with a saving of 740Gig. But if the total onsite occupancy is 50TB, then your overal saving is less than 1/50th of the capacity. Anything is good I suppose?

    I think that most shops could achieve better savings just by putting more thought into retention policies or by deleting data from TSM that nobody cares about anymore.

    Chad is correct. DeDupe is going to be most beneficial in cases where the backup tool forces periodics full copies. For TSM, this could (but not necessarily be) DB Backups, NDMP backups and perhaps archives (especially people with poor data mgmt skills that archive the same files over and over again - oh wait that what the NetBackup guys do every quarter). The DB Backup one is interesting.. what happens if you do a defrag/re-org of the DB. The actual data doesn't change but the structure on the disk could change quite dramatically.

    ReplyDelete
  2. Problem with deleting is, of course, that people actually have to care about keeping their data. These days there are more and more regulations that demand data to be kept for years on end, including email.

    ReplyDelete