Submitted by David Mercer on May 22, 2013
Troubleshooting
Here are our Infiniband troubleshooting steps. Similar advice can be found at the link below:
Check installed
Use the following commands to see if the infinband module is properly installed and configured.
# lspci | grep -i infiniband (on a Dell node)
05:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
# dmesg | grep -i infiniband
mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
# modinfo mlx4_ib
filename: /lib/modules/2.6.32-131.6.1.el6.x86_64/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko version: 1.0 license: Dual BSD/GPL description: Mellanox ConnectX HCA InfiniBand driver author: Roland Dreier srcversion: FF725F6011FD56274ABD04E depends: mlx4_core,ib_core,ib_mad vermagic: 2.6.32-131.6.1.el6.x86_64 SMP mod_unload modversions
For RHEL check that the ibutils, infiniband-diags, and perftest packages are installed.
*** Be sure that infinband tools and kernel are upgraded together
Check local IB link
# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.0
Hardware version: b0
Node GUID: 0x0002c903005352fc
System image GUID: 0x0002c903005352ff
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 50
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c903005352fd
Link layer: InfiniBand
Check IB Network Status
Display all hosts visable on the network
# ibhosts
Ca : 0x78e7d10300236d24 ports 2 "jinx17 mlx4_0"
Ca : 0x78e7d1030023721c ports 2 "jinx22 mlx4_0"
Ca : 0x78e7d10300236d9c ports 2 "MT25408 ConnectX Mellanox Technologies"
Ca : 0x78e7d10300238584 ports 2 "jinx21 mlx4_0"
Display all switches visable on the network
# ibswitches
Switch : 0x0008f105002029c2 ports 36 "Voltaire 4036 # 4036-29C2" enhanced port 0 lid 1 lmc 0
Switch : 0x0008f105002029de ports 36 "Voltaire 4036 # 4036-29DE" enhanced port 0 lid 2 lmc 0
Display speed and status of all links on the network
# iblinkinfo
CA: jinx28 mlx4_0:
0x0002c90300589731 51 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 6[ ] "Voltaire 4036 # 4036-29DE" ( )
CA: MT25408 ConnectX Mellanox Technologies:
0x0002c903004e425b 48 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 9[ ] "Voltaire 4036 # 4036-29DE" ( )
Switch: 0x0008f105002029c2 Voltaire 4036 # 4036-29C2:
1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 20 1[ ] "jinx14 mlx4_0" ( )
1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 15 1[ ] "jinx10 mlx4_0" ( )
1 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 22 1[ ] "jinx13 mlx4_0" ( )
1 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 17 1[ ] "jinx9 mlx4_0" ( )
1 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 32 1[ ] "jinx4 mlx4_0" ( )
1 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 19 1[ ] "jinx8 mlx4_0" ( )
1 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 18 1[ ] "jinx7 mlx4_0" ( )
1 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 13 1[ ] "jinx5 mlx4_0" ( )
1 9[ ] ==( Down/ Polling)==> [ ] "" ( )
1 10[ ] ==( Down/ Polling)==> [ ] "" ( )
1 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 30 1[ ] "jinx19 mlx4_0" ( )
1 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 28 1[ ] "jinx23 mlx4_0" ( )
1 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 24 1[ ] "jinx20 mlx4_0" ( )
1 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 31 1[ ] "jinx24 mlx4_0" ( )
1 15[ ] ==( Down/ Polling)==> [ ] "" ( )
Test IB Connectivity
To do this requires 2 machines on the same IB network.
Machine one will be a server
#ibping -S
Machine two will be the client (You can get the GUID of the server via ibstat on machine one)
# ibping -G 0x0002c903005352fd
Pong from jinx-login.(none) (Lid 50): time 0.288 ms
Pong from jinx-login.(none) (Lid 50): time 0.322 ms
Pong from jinx-login.(none) (Lid 50): time 0.244 ms
Test RDMA Performance
To do this requires 2 machines on the same IB network.
Test Latency
Machine one will be a server
#rdma_lat
Machine two will be the client (You can get the GUID of the server via ibstat on machine one)
# rdma_lat <hostname-of-server>
local address: LID 0x1f QPN 0x0049 PSN 0x42d29b RKey 0x001d00 VAddr 0x00000000731001
remote address: LID 0x32 QPN 0x004b PSN 0x3e2d9 RKey 0x054300 VAddr 0x00000001dc2001
Conflicting CPU frequency values detected: 1600.000000 != 2666.000000
Latency typical: inf usec
Latency best : inf usec
Latency worst : inf usec
Test Bandwidth
Machine one will be a server
#rdma_bw
Machine two will be the client (You can get the GUID of the server via ibstat on machine one)
# rdma_bw <hostname-of-server>
9242: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=0 | cma=0 |
9242: Local address: LID 0x1f, QPN 0x004a, PSN 0x710f49 RKey 0x001e00 VAddr 0x007f4b59ef6000
9242: Remote address: LID 0x32, QPN 0x004c, PSN 0xafa0b9, RKey 0x054400 VAddr 0x007f43e1a44000
Conflicting CPU frequency values detected: 1600.000000 != 2666.000000
9242: Bandwidth peak (#0 to #982): 0 MB/sec
9242: Bandwidth average: 0 MB/sec
9242: Service Demand peak (#0 to #982): 803 cycles/KB
9242: Service Demand Avg : 803 cycles/KB
References
- Getting Started with Infiniband
- Setting up a Basic Infiniband Network
- Mellanox Infiniband Training presentation
- Infiniband How-To
Questions that we need to answer
- What to do when ibstatus shows port 1 as INIT?
- When to reset the IB card on a node?
- When to reset the IB switch in the rack?
- How to test code using the IB switch?