Also, to get more debugging info from SCSI Layer is this the appropriate flag to be set for this issue ? The above logs demonstrate my problem perfectly. We are running some data integrity test tool which gives out data miscompare errors when such errors happen. Jan 29 22:17:15 s01 kernel: [772716.664331] end_request: I/O error, dev sdg, sector 4096 Jan 29 22:17:15 s01 kernel: [772716.672247] end_request: I/O error, dev sdg, sector 3388584944 Both of the above were Source
We checked our data storage array when the data compare error is reported for the specific block and the data is correct. Why the kernel reports as Unhandled Error code" --------------------- "Sep 16 15:51:05 per610-01 kernel: sd 9:0:0:25: [sdat] Unhandled error code" --------------------- snippet of message log =============================== Sep 16 15:51:05 per610-01 kernel: We want to have a conference call to futher discuss how to root cause this issue. Anyhow, here are chunks of some events: > > Sep-17 ~18:59 test started and everything was good till… (more than a day) > Sep-19 ~01:47 Failing path; one or more paths
May 13 16:52:24 server1 kernel: [ 988.581316] sd 0:0:1:0: [sda] Unhandled error code May 13 16:52:24 server1 kernel: [ 988.581319] sd 0:0:1:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 13 16:52:24 server1 kernel: May 4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices. A reboot fixes the problem, but it happens again. Was your swap space not on a RAID-1 device? █ GigaTux, Value Linux Hosting █ UK, US and Germany based Xen VPS.
Code: [19510.024140] sd 9:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 3f 00 00 08 00 [19510.024146] end_request: I/O error, dev sdb, sector 63 [19510.024148] __ratelimit: 1 callbacks suppressed [19510.024150] Do you want to help us debug the posting issues ? < is the place to report it, thanks ! Reply With Quote 0 05-14-2012,08:11 AM #6 msmgaza View Profile View Forum Posts View Forum Threads WHT Addict Join Date Sep 2009 Posts 110 Originally Posted by gigatux It End_request: I/o Error, Dev Sdb, Sector I will try other disks first, so I can check if it's the same problem. #13 check-ict, Jan 18, 2012 linum Member Joined: Sep 25, 2011 Messages: 56 Likes Received:
After doing a fair bit of research around the Marvell 88SX7042 online, I did find a number of references to similar issues with other Marvell chipsets, such as issues with the logs too by downloading smartmontools. █ GigaTux, Value Linux Hosting █ UK, US and Germany based Xen VPS. At this time do you have more paths available? this page I could still see the data, but nothing was working.
I had used their SiI-based PCI-e 1x controllers successfully before and found them good budget-friendly JBOD controllers, but being a budget no-name vendor card, I didn't hold high hopes for them. Smartctl You are probably right about the power/cables. Yes. Even the read in count 57 failed.
Hardware can be time consuming and expensive. http://askubuntu.com/questions/597010/main-data-drive-giving-i-o-errors If two of them are going bad, I'd suspect overheating, but it could be just plain bad luck. Result: Hostbyte=did_error Driverbyte=driver_ok MarkForUpload: True ProcEnviron: TERM=xterm PATH=(custom, no user) LANG=en_GB.utf8 SHELL=/bin/bash ProcFB: 0 nouveaufb ProcKernelCmdLine: root=LABEL=hostname-root ro PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: Home directory /home/mythtv not ours. No Terminate Command Early Due To Bad Response To Iec Mode Page The same applies to RAID-like filesystem technologies such as ZFS.
e2fsck -c /dev/sdb The man page for badblocks recommends to do it this way, not directly. http://crimsonskysoftware.com/unhandled-error/unhandled-error-code.html Jul 4 20:20:05 host-37 kernel: [413193.748847] XFS (sdf1): xfs_log_force: error 5 returned. The bug report wasn't for the disk error but that Linux didn't handle the error! Either the initiator gave us bad WRITE data or the data got garbled somewhere along our stack after reaching the target port." Running Panic on storage node ====================================== We also did Exception Emask 0x0 Sact 0x0 Serr 0x0 Action 0x6 Frozen
May 4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code May 4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: I will have to go to the datacenter etc. Return address = 0xffffffffa014d1c2 May 13 16:52:24 server1 kernel: [ 988.565814] xfs_force_shutdown(sda1,0x2) called from line 1043 of file /build/buildd-linux-2.6_2.6.32-31-amd64-vrfdM4/linux-2.6-2.6.32/debian/build/source_amd64_openvz/fs/xfs/xfs_log.c. have a peek here The Marvell 88SX7042 controller is advertised as being a 4-channel controller, I therefore do not expect issues on one channel to impact the activities on the other channels.
Snaptest data verification tool: Running long duration I/O ================================ When we run snaptest I/O tool and just doing I/O we hit data corruption error sometimes within 2hr and sometimes its longer One thing to do is check the cables. The SiI controller is much larger thanks to the PCI-X to PCI-e converter chip on it's PCB.
Why the kernel reports as Unhandled Error code" > It is just a bad log message. May 4 16:52:13 phobos kernel: md/raid:md2: Operation continuing on 6 devices. Read more... When you tested with qla2xxx/lpfc did you see scsi error messages and dm-multipath path failures, or did the scsi layer and driver internally retry in those cases.
Contact Us - Advertising Info - Rules - LQ Merchandise - Donations - Contributing Member - LQ Sitemap - Main Menu Linux Forum Android Forum Chrome OS Forum Search LQ Had to use power button to reboot. Note that registered members see fewer ads, and ContentLink is completely disabled once you log in. http://crimsonskysoftware.com/unhandled-error/unhandled-error-code-164.html bit_waitqueue+0x17/0xa8 Nov 11 02:14:56 proxmox01 kernel: [
I have virtually the same supermicro hardware with the same issue. lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg I was at an absolute loss and grabbing for straws as to why I was having so many issues with two different, brand new, cards. Code: uname -a Linux server1 2.6.32-5-openvz-amd64 #1 SMP Mon Mar 7 22:25:57 UTC 2011 x86_64 GNU/Linux Code: May 13 16:51:49 server1 kernel: [ 954.196601] ata1.01: exception Emask 0x0 SAct 0x0 SErr
After rebuilding the array, a disk would fail and lead to a cascading failure where other disks would throw up "failed to IDENTIFY" errors until they eventually all failed one-by-one until No, create an account now. I anticipated just needing to arrange the replacement of one disk and being able to move on with my life. Running without device mapper =================================== We ran on block device without device mapper (disabled it)ex: /dev/sdb the snaptest ran fine for 24hrs with no corruption reported.
What about Enterprise vs Consumer disks I've looked into the differences between Enterprise and Consumer grade disks before. If you'd like to contribute content, let us know. Former boss asking me to do presentations Why is this Sudoku Skyscraper Failing? "Squeezing out of a dead man" proverb How would a person see with an adjustable cross-shaped pupil? When I checked /mnt/backup1, wich isn't used for snapshot back-ups, i noticed it was also mounted as read-only.
Once it happens the machine has long pauses between certain operations (GUI updates like alt-tab, responding to command-line commands like lsb_release -a, and file accesses such as directory reads) as if We have host logs which are huge how do we provide for your analysis ? 3. Running on Fibre Channel: ================================ Removed the iscsi and connected Fibre Channel HBA to storage and ran snaptest and works without any data corruption for 8-10hrs. These cards use the standard AHCI driver which I hoped would reduce the chances of such problems.
How much more than my mortgage should I charge for rent? The fact that when I added a bad disk to the onboard AMD server controller I didn't see the same cascading failure, also helped eliminate a kernel RAID bug as the