Friday, August 3, 2012

HP Servers disconnecting

We come across an issue lately, with several types of HP servers that have QLogic/NetXen NC375i networkcards in them. They disconnect, causing a disruption of service. You can imagine that having an NFS mount or iSCSI target with that happening is less than desirable and has caused Windows clusters to fail over and ESX/ESXi hosts to go crazy. This problem is solved by rebooting the host. This issue is very much OS independent!

In windows eventlog you may see things like:
DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter #4
PROBLEM: Tx path is hung. The device is being reset.

In ESX you see things in /var/log/vmkernel like:
Jul 31 21:02:12 server01 vmkernel: 165:01:40:09.914 cpu19:4295)<5>nx_nic[vmnic8]: Device is DOWN. Fail count[8]
Jul 31 21:02:12 server01 vmkernel: 165:01:40:09.915 cpu19:4295)<3>nx_nic[vmnic8]: Firmware hang detected. Severity code=0 Peg number=2 Error code=1 Return address=0


HP has brought out an advisory saying that indeed there are problems:

Network Adapters and Affected Firmware Versions
Network Adapter
Affected Firmware Versions
CN1000Q Dual Port Converged Network Adapter
EARLIER than firmware version 4.8.22
NC375i Integrated Quad Port Multifunction Gigabit Server Adapter
EARLIER than firmware version 4.0.585
NC375T PCI Express Quad Port Gigabit Server Adapter
EARLIER than firmware version 4.0.585
NC522m Dual Port Flex -10 10GbE Multifunction BL-c Adapter
EARLIER than firmware version 4.0.585
NC522SFP Dual Port 10GbE Server Adapter
EARLIER than firmware version 4.0.585
NC523SFP 10Gb 2-port Server Adapter
EARLIER than firmware version 4.9.81
The NC375i adapter is integrated on the following servers and storage systems:
  • ProLiant DL370 G6 Server
  • ProLiant DL580 G7 Server
  • ProLiant DL585 G7 Server
  • ProLiant DL980 G7 Server
  • HP Business Data Warehouse Appliance
  • StorageWorks D2D4312 Backup System
  • StorageWorks D2D4324 Backup System

Servers manufactured after 1 april 2012 are not affected by this, but check the firmware level if you suffer from this issue. An older interface may still have this issue in your newer machine.

How to check the firmware version:

Windows:
Go to the HP network utilities, and click on the network interface you are having issues with, and click Properties. The Information tab will show the Boot Code, which is the firmware version:


Alternatively, you can run the update tool, and it will tell you which version you are currently running as well.


Linux:

Type "modinfo netxen_nic" and look for the firmware line.
[user@server-01 ~]$ modinfo netxen_nic | grep firmware
firmware: phanfw-4.0.579.bin   <--------  version 4.0.579, so needs an update

ESX/ESXi:
VMware have released a KB article to get the firmware and driver version, available here.

Resolution:

The resolution is to update the firmware of the network cards. The advisory lists the latest drivers and firmware. For Windows and Linux, there are proper update tools, but unfortunately for VMware, no firmware update utility is given, and the Linux firmware utility does not work.

On ESX/ESXi you have to make use of a Linux LiveCD and boot from it (ESX-server in Maintenance mode and reboot). In our case we used Novell SLES11 CD (free ISO download, registering necessary) as the Rescue-CD for RHEL5 gave several errors running the firmware update-utility. Perhaps a OpenSUSE, Fedora, Ubuntu or other distro LiveCD can be used as well, but we haven't tested those.

Many thanks go to my colleague Sven for the info :-)

No comments:

Post a Comment