Skip to content

Infiniband Subsystem

Current Status

Lego’s IB stack is ported based on linux-3.11.1. We ported:

  • ib_core
  • mlx4_ib
  • mlx4_core

Lego does not support uverbs. At the time of writing, Lego IB stack has only been tested on Mellanox Technologies MT27500 Family [ConnectX-3].

Random summary

The stack is SUPER complex, a lot data structures and pointers fly all over. Good thing is the whole stack is layered clearly.

Top down


  • IB core code is in driver/infiniband/core, which exposes the major IB API to both user and kernel applications. Inside, it has two parts. The first part is function callback, that call back to underlying device-specific functions. The second part is the management stack, including communication manager (cm), management datagram (mad), and so on.
  • In IB, each port’s QP0 and QP1 are reserved for management purpose. They will receive/send MAD from/to subnet manager, who typically runs on switch. All the IB management stuff is carried out by exchanging MAD.
  • There are several key data structures: ib_client, ib_device, and mad_agent. MAD, CM, and some others are ib_client, which means they use IB device, and will be called back whenever a device has been added. mad_agent is something that will be called back whenever a device received a MAD message from switch (see ib_mad_completion_handler()). A lot layers, huh?
  • ib_mad_completion_handler(): we changed the behavior of it. we use busy polling instead of interrupt. Originally, it will be invoked by mlx4_core/eq.c

mlx4_ib and mlx4_core

  • mlx4_core is actually the Ethernet driver for Mellanox NIC device (drivers/net/ethernet/mellanox/hw/mlx4), which do the actual dirty work of talking with device. On the other hand, mlx4_ib is the glue code between ib_core and mlx4_core, who do the translation.

  • A lot IB verbs are ultimately translated into fw.c __mlx4_cmd(), which actually send commands to device and get the result. There are two ways of getting result: 1) polling: after writing to device memory the command, the same thread keep polling. 2) sleep and wait for interrupt. By default, the interrupt way is used (obviously). But, at the time of writing (Aug 20, 2018), we don’t really have a working IRQ subsystem, so we use polling instead. I’m still a little concerned that without interrupt handler, we might lose some events and the NIC may behavave incorrectly if interrupts are not handled.

Init Sequence

  1. Init PCI subsystem, build data structures
  2. Core IB layer register ib_client
  3. mlx4_init(): register PCI driver, provide a callback
  4. __mlx4_init_one(): initialize the hardware itself, register interrupt handler.
  5. mlx4_ib_init(): allocate a ib_device, and register, which will callback through all ib_client registered at step 1.

Yizhou Shan
Created: Aug 20, 2018
Last Updated: Aug 20, 2018