Most recently, there are those hwo have been going for a more "shock" factor in their proclamations of drive and RAID failures. The vast majority of the time, this is totally transparent to the user/operator; you won't see UREs in your log because it recovered the data and re-wrote it to spare space I waited to start writing five years into doing this so I felt like I knew enough to no longer sound like so much of an idiot. Very cool feature. http://crimsonskysoftware.com/unrecoverable-read/unrecoverable-bit-error-rate.html
Writes across a pool of vdevs are distributed, not striped, and that's an important distinction. The URE could happen on the very first read operation, or you might read 100 TB without a URE. Forlien 780 days ago I think your calculation on failing an Face it, even if it is to another drive array, this is your best option. to flip those same bits at room temperature.
That is great for drive life, but it doesn't help you read the data that was lost during that URE. Data integrity is taken really seriously in ZFS/OpenZFS, and every issue of this sort I've observed was fixed very promptly. So I suspect that overall consumer drives are way more reliable than their spec and that the risks of UREs are not as high as people may think.
When it happens, the entire array is failed.Also 12.5 TB is not a hard limit, just an average. They could be a sector gone bad. Consider a four drive RAID 5 (SHR) array composed of 2TB drives. Zfs Ure So how in any reasonable logic thinking way could Robin Harris get the idea that because a hard drive manufacturer publishes a URE of 10^14 that having 5 drives of 3tb
It's all about frequency. Unrecoverable Read Error Nero If you have 2TB drives, then you will have a 12TB array. Once two drives failed, assuming he is using enterprise drives...there is a 33% chance the entire array fails if he tries to rebuild. I am an Oracle employee; my opinions do not necessarily reflect those of Oracle or its affiliates.
My own array is HGST entirely. Ure Rate The checksumming is not how ZFS saves you BTW, it's the fact that ZFS does file-level RAID and not block level RAID right? It is time now to look at the maths. Another non-issue not because it wasn't a huge problem, but because engineers like me worked for years to prevent it from being an issue and spent the cost BEFORE the disaster
I am perplexed then at what the issue really is ?The Synology RAID solution is based on the Linux md software RAID solution. I think that part of the issue is that the Linux software RAID (I would think that pretty much all RAID solutions work in the same way) has no concept if Unrecoverable Read Error Ure So, instead of a replication factor of 1.2 in a RAID-5 flash array, for the same money I can have a replication factor of 12.2 in a hard disk array. Ure Raid If 10^14 approaches an URE at around 12TB of data as the article says.
Manufacturers post specs for their drives. http://crimsonskysoftware.com/unrecoverable-read/unrecoverable-bit-error.html A single disk has ZERO redundancy and so it can't simply calculate the missing bit of data from a URE. Unless one is a hard drive manufacturer, OEM licensee, or reasonably-talented hacker with the right equipment & software to access the drive firmware, claims about actual in-the-field URE rate knowledge are solar flares or other causes of high-energy particles). Unrecoverable Read Error Rate Ssd
Just as it is possible to roll a six sided die 10 times and never have the number six come up on top - it is a problem of probability, not I located two documents that might be of interest to you that are about the TLER feature - the first document seems to do a better job of showing how this SMR, in fact, should be a huge help in this direction, because it will force read/write cycles behind the scenes to refresh stale data. his comment is here What is true for traditional disk, however, is not necessarily true for flash.
Simple googling will find exceptions to each of these broad categories but sadly, not many. Raid 6 so that is why your disk does not fail and seems to keep on working day after day.So then you get to your RAID scenario. I very frequently see pools with a disk that has 1-10 checksum errors.
Backups. If you have a RAID 5 array, using 8 drives, you will often see it setup with 7 data and parity drives, and one hot spare. But that has always been true. Raid 5 Last night I was having some difficulty remembering the specific term that Western Digital uses to describe the time limiting functionality.
Each vdev maximum IOPS shared between read & write is effectively that of a single disk, though throughput is unencumbered. It's not a URE per se, but it looks like a bad part of the disk to the firmware, and gets flagged that way. Thus, an URE during a RAID 5 rebuild typically leads to a complete rebuild failure."http://en.wikipedia.org/wiki/Hard_disk_ ... weblink Don't ask me how I know ;) Replacement of drives before resilvering is complete.
If you had 12TB from 4x3TB disks it should have alot to go through. halfcat 779 days ago 10^14 is bits. https://github.com/zfsonlinux/zfs/commit/bb3250d07ec818587333d7c26116314b3dc8a684 From what I understand Illumos and BSD have this same issue until they pull in this patch that was only committed on June 22, 2015. Now its a statistic, that means you are dealing with a bell curve that will have some disks failing well before that point, and some disks failing well after that point, The premise is that when doing a RAID rebuild, that the process will stop on the occurrence of one of these read errors that WILL happen at some point in time
E.g. EDIT: I think I must clarify that I'm mainly interested in a risk perspective for the 'average home user' who wants to build their own NAS solution, not a mission-critical corporate I am perplexed then at what the issue really is ? I do admit that the trouble is that before ZFS, it was very difficult to know about silent data corruption errors and may skew my own view on the actual risks
Bad blocks, sectors , areas on the harddrive, that the drive itself and the OS is not aware of. permalinkembedsavegive gold[â€“]FunnySheep[S] 0 points1 point2 points 1 year ago*(0 children)So indeed the 1014 number is a worst-case scenario and drives are way more reliable. Trevor is ignoring the economics. Letâ€™s look at the maths of rebuild times and how they are different when using flash.
If an unrecoverable read error (URE) is encountered in this process, one or more data blocks will be lost. if the OS rather than simply flagging that file on the drive as being corrupt, would rather flag the whole drive, it comes across as a rather short sighted screwup. The very rare occasion where a disk may return corrupted data is something that is unrelated to the RAID5 discussion, that's more about ZFS vs 'legacy' hw/sw RAID in general. So it seems to me that my claim still stands that the ZDnet artikle is way overblown.
Suddenly those folk making all-flash arrays look a lot less crazy The short answer to our problems is to look to SSDs. When a disk in a RAID-5 array fails and is replaced, all the data on other drives in the array must be read to reconstruct the data from the failed drive. permalinkembedsaveparentgive gold[â€“]txgsync 0 points1 point2 points 1 year ago(0 children) How long does it take for the ZFS versions to move downstream to say Solaris and OpenIndiana type variants? It is simply gone.
permalinkembedsaveparentgive gold[â€“]mercenary_sysadmin 0 points1 point2 points 1 year ago(17 children) TL;DR: Keep your RAIDz/RAIDz2 stripe widths as narrow as practical and stripe multiple vdevs for maximum performance with minimum pain.