Current revision updated by kpark322 on
Originally created by kpark322 on

Submitted by David Mercer on May 22, 2013

Troubleshooting

Here are our Infiniband troubleshooting steps.  Similar advice can be found at the link below:

 

Check installed

Use the following commands to see if the infinband module is properly installed and configured.

# lspci | grep -i infiniband    (on a Dell node)
05:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
# dmesg | grep -i infiniband
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
# modinfo mlx4_ib
filename:       /lib/modules/2.6.32-131.6.1.el6.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
version:        1.0
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA InfiniBand driver
author:         Roland Dreier
srcversion:     FF725F6011FD56274ABD04E
depends:        mlx4_core,ib_core,ib_mad
vermagic:       2.6.32-131.6.1.el6.x86_64 SMP mod_unload modversions 

 

For RHEL check that the ibutils, infiniband-diags, and perftest packages are installed.

*** Be sure that infinband tools and kernel are upgraded together 

Check local IB link

# ibstat

CA 'mlx4_0'

CA type: MT26428

Number of ports: 2

Firmware version: 2.7.0

Hardware version: b0

Node GUID: 0x0002c903005352fc

System image GUID: 0x0002c903005352ff

Port 1:

State: Active

Physical state: LinkUp

Rate: 40

Base lid: 50

LMC: 0

SM lid: 1

Capability mask: 0x02510868

Port GUID: 0x0002c903005352fd

Link layer: InfiniBand

Check IB Network Status

Display all hosts visable on the network 

# ibhosts

Ca : 0x78e7d10300236d24 ports 2 "jinx17 mlx4_0"

Ca : 0x78e7d1030023721c ports 2 "jinx22 mlx4_0"

Ca : 0x78e7d10300236d9c ports 2 "MT25408 ConnectX Mellanox Technologies"

Ca : 0x78e7d10300238584 ports 2 "jinx21 mlx4_0"

Display all switches visable on the network

# ibswitches

Switch : 0x0008f105002029c2 ports 36 "Voltaire 4036 # 4036-29C2" enhanced port 0 lid 1 lmc 0

Switch : 0x0008f105002029de ports 36 "Voltaire 4036 # 4036-29DE" enhanced port 0 lid 2 lmc 0

Display speed and status of all links on the network

# iblinkinfo

CA: jinx28 mlx4_0:

      0x0002c90300589731     51    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       2    6[  ] "Voltaire 4036 # 4036-29DE" ( )

CA: MT25408 ConnectX Mellanox Technologies:

      0x0002c903004e425b     48    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>       2    9[  ] "Voltaire 4036 # 4036-29DE" ( )

Switch: 0x0008f105002029c2 Voltaire 4036 # 4036-29C2:

           1    1[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      20    1[  ] "jinx14 mlx4_0" ( )

           1    2[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      15    1[  ] "jinx10 mlx4_0" ( )

           1    3[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      22    1[  ] "jinx13 mlx4_0" ( )

           1    4[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      17    1[  ] "jinx9 mlx4_0" ( )

           1    5[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      32    1[  ] "jinx4 mlx4_0" ( )

           1    6[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      19    1[  ] "jinx8 mlx4_0" ( )

           1    7[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      18    1[  ] "jinx7 mlx4_0" ( )

           1    8[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      13    1[  ] "jinx5 mlx4_0" ( )

           1    9[  ] ==(                Down/ Polling)==>             [  ] "" ( )

           1   10[  ] ==(                Down/ Polling)==>             [  ] "" ( )

           1   11[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      30    1[  ] "jinx19 mlx4_0" ( )

           1   12[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      28    1[  ] "jinx23 mlx4_0" ( )

           1   13[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      24    1[  ] "jinx20 mlx4_0" ( )

           1   14[  ] ==( 4X          10.0 Gbps Active/  LinkUp)==>      31    1[  ] "jinx24 mlx4_0" ( )

           1   15[  ] ==(                Down/ Polling)==>             [  ] "" ( )

Test IB Connectivity

To do this requires 2 machines on the same IB network.

Machine one will be a server
#ibping -S

Machine two will be the client (You can get the GUID of the server via ibstat on machine one)

# ibping -G 0x0002c903005352fd

Pong from jinx-login.(none) (Lid 50): time 0.288 ms

Pong from jinx-login.(none) (Lid 50): time 0.322 ms

Pong from jinx-login.(none) (Lid 50): time 0.244 ms

Test RDMA Performance

To do this requires 2 machines on the same IB network.

Test Latency

Machine one will be a server

#rdma_lat

Machine two will be the client (You can get the GUID of the server via ibstat on machine one)

# rdma_lat <hostname-of-server>

   local address: LID 0x1f QPN 0x0049 PSN 0x42d29b RKey 0x001d00 VAddr 0x00000000731001

  remote address: LID 0x32 QPN 0x004b PSN 0x3e2d9 RKey 0x054300 VAddr 0x00000001dc2001

Conflicting CPU frequency values detected: 1600.000000 != 2666.000000

Latency typical: inf usec

Latency best   : inf usec

Latency worst  : inf usec

Test Bandwidth

Machine one will be a server

#rdma_bw

Machine two will be the client (You can get the GUID of the server via ibstat on machine one)

# rdma_bw <hostname-of-server>

9242: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |

9242: Local address:  LID 0x1f, QPN 0x004a, PSN 0x710f49 RKey 0x001e00 VAddr 0x007f4b59ef6000

9242: Remote address: LID 0x32, QPN 0x004c, PSN 0xafa0b9, RKey 0x054400 VAddr 0x007f43e1a44000

Conflicting CPU frequency values detected: 1600.000000 != 2666.000000

9242: Bandwidth peak (#0 to #982): 0 MB/sec

9242: Bandwidth average: 0 MB/sec

9242: Service Demand peak (#0 to #982): 803 cycles/KB

9242: Service Demand Avg  : 803 cycles/KB

References

Questions that we need to answer

  1. What to do when ibstatus shows port 1 as INIT?
  2. When to reset the IB card on a node?
  3. When to reset the IB switch in the rack?
  4. How to test code using the IB switch?
Filing Categories
Identifier Categories
Specific categories