Sun X4100 and X4200 Lower Non-critical going low
For over a year now our team of oncall engineers has been tortured by an error generated periodically by our racks of Sun X4100 and X4200 servers. These alerts come from the integrated ILOMs which we have set to syslog to our EM7 monitoring platform. Usually about once a week one of our many servers will report something along the lines of the following error:
FIRST REPORTED: 2009-04-29 14:50:33
LAST REPORTED: 2009-04-29 14:50:34
SEVERITY: CRITICAL
OCCURRENCES: 2
SOURCE: Syslog
ORGANIZATION: Management
DEVICE: prsun1-sc
Full message text for most recent occurrence:
<130>logmgr: ID = 343 : Wed Apr 29 14:52:39 2009 : IPMI : Log : critical : ID = 7f : 04/29/2009 : 14:52:39 : Voltage : mb.v_+12v : Lower Non-critical going high : reading 12.16 > threshold 10.96 Volts
This event has not been acknowledged
Sent by notification policy: Major/Critical Events
The EM7 has received a CRITICAL syslog notification from this server.
If you go look at the event log on the ILOM it looks more like this:
04/29/2009 : 14:52:39 : Voltage : mb.v_+12v : Lower Non-critical going high : reading 12.16 > threshold 10.96 Volts
04/29/2009 : 14:52:38 : Voltage : mb.v_+12v : Lower Non-critical going low : reading 7.37 < threshold 10.96 Volts
Looking at the event log another server with the same type of issue, the error is for a different sensor, but yet it has the same behavior:
02/21/2009 : 06:25:01 : Voltage : p1.v_vddio : Lower Non-critical going high : reading 1.85 > threshold 1.60 Volts
02/21/2009 : 06:24:55 : Voltage : p1.v_vddio : Lower Non-critical going low : reading 0.97 < threshold 1.60 Volts
I should note that these errors *never* seem to turn out to be anything but noise… We all just acknowledge the alarm and go back to bed.
This week I finally got annoyed enough to go look further into this issue as I do participate in the on-call rotation which covers these systems (even though I don’t *own* these systems).
After doing some digging, I found the following obscure note in the release notes for some firmware update bundle which includes ILOM firmware:
ILOM Service Processor firmware 2.0.2.5
* Fixed the bug of lower non-critical voltage sense issue.
So I have gone ahead and upgraded a couple of my servers thus far. Hopefully this will resolve the issue!
I have to get in a couple of jabs at Sun here since I burned an entire day today messing with their servers:
- When you upload the ILOM firmware (which includes a system BIOS upgrade also) your server may get powered off during the upgrade without any warning.
- When you upgrade to a 2.0 BIOS from a 1.x version, you have to manually clear the CMOS according to their release notes (the update utility seriously could not do this for us?)
- And my personal favorite, their documentation makes some obscure reference to some bug you might run into and so they tell you that you must upload the new firmware *twice* in order to ensure it applied properly. Mind you they don’t tell you what the problem you might run into is, and they give you no way to tell if the person that upgraded the firmware for you previously did the double firmware update properly.
- After the ILOM firmware and system BIOS updates I did today, the servers somehow managed to change the device ID’s (or something) on the onboard NVIDIA NICs in such a way that Windows recognized them as new NIC’s (5 and 6). This caused them to loose all IP settings and I had to log in through the ILOM and reset them. This happend on the two servers I upgraded.
- To upgrade the RAID card firmware/BIOS you must boot the server from a CD that runs DOS. Note that on a Dell box you drop in the Openmanage CD, it scans your system to determine what needs updating to get you to a “known good set” of drivers, and you click the go button. It takes care of all Firmware/Drivers/Software for you.
- The LSI software for Windows to monitor the built in RAID card is a joke. It looks like an intern wrote it.
- At least Sun does provide a streamlined Windows driver installer package, this did work well.
Overall, I am not completely thrilled with Sun’s x86 hardware lines, though I suppose things may be better if you are a Solaris-on-x86 shop.
-Eric
UPDATE 5/13/09
I got another voltage error on one of my fully updated servers. I have called Sun and opened another case on this, though so far Tier 1 and Tier 2 techs do not seem to have any ideas as to what is causing this issue. I sent them a bunch of output from the ipmi tool that they are looking through.
ID = 1 : 05/10/2009 : 23:58:42 : Voltage : p1.v_vtt : Upper Non-critical going high : reading 1.79 > threshold 1.00 Volts
I should also note that after the firmware updates, one of the machines is now reporting ECC errors. This makes me wonder if the previous firmware was not properly reporting them. We have had almost zero RAM problems with our dozens of Sun x86 servers which makes me worry that they are just hiding their problems. I must say the server handled the failure gracefully. It was getting dual bit (uncorrectable) ECC errors and so upon boot it disabled the two (of four) offending DIMMS. Very nice.
Also, I would like to take a moment to comment on Sun’s build quality in the x4100 and x4200 servers. I opened a couple of them up today for the first time and I must say, I am *very* impressed with the physical build quality. Sun has some very talented hardware engineers (almost over-built I would say). The servers are made from some heavy gauge metal among other things.
So while I have changed my mind a bit on Sun’s build quality, they are certainly lacking some of the finer touches needed for x86 servers. Their out of band management controllers (previously ALOM’s, now iLOM’s) have been quite the fiasco for us. They also are a royal pain to bring all the different firmwares/drivers up to “known good sets”. Dell has quite a nice tool for this.
One of the tech’s also did mention that there was a firmware update for the power supplies to keep them from powering the machine off in the event of a momentary power loss (like as a UPS kicks in). Apparently they are programmed to power down after 20ms of lost power. They should be able to run for over 100ms even after power is lost.