how to configure Mellanox 100G NIC

  • download appropriate Mellanox’s linux driver from the below URL

Linux Drivers(MLNX_OFED) download

below driver supports 10Gbps, 40Gbps and 56Gbps per port, but not 100Gbps.
therefore, you’d better use MLNX_OFED driver in case of 100G port.
Mellanox EN Driver for Linux (i.e. EN means Ethernet-only NIC)

  • download and install MFT(Mellanox Firmware Tool)
tar xvfz mft-4.9.0-38-x86_64-deb.tgz
cd mft-4.9.0-38-x86_64-deb/
service mst start

  • install MLNX_OFED and configure
lspci |grep -i mellanox
./mlnxofedinstall --upstream-libs --dpdk
# firmware version should be 12.21.1000 and above
ibv_devinfo  | grep vendor_part_id
mlxconfig -d /dev/mst/mt4115_pciconf0 query |grep LINK_TYPE
mlxconfig -d /dev/mst/mt4115_pciconf0 query | grep SRIOV_EN
mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=8
mlxfwreset -d /dev/mst/mt4115_pciconf0 reset
service openibd restart

# in case of mlxconfig is not executed, start mst(mellanox software tools)
service mst start

# Connect-X4 EN is Ethernet-only NIC, thus it is not needed to change
# LINK_TYPE_P1 and LINK_TYPE_P2 to 2(Ethernet) from default 1(Infiniband).
# if needed, change those values by executing below commands
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P2=2

# if link type was changed, firmware must be reset as well
mlxfwreset -d /dev/mst/mt4115_pciconf0 reset

# dynamic VN assignment
echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs

# configure IPs and MTU
ifconfig ens785f0 up
ifconfig ens785f0 mtu 9000

# for useful information
lspci -s 04:00.0 -vvv

  • after reboot, check that the port type was changed to Ethernet

# alternatively, you can make sure Ethernet interfaces are in working order and linked to kernel verbs
ls -d /sys/class/net//device/infiniband_verbs/uverbs | cut -d / -f 5

  • pre-test link and interface using iperf


numactl --cpunodebind=0 iperf -s -t $TIME -P $PTH & \
ssh $CLIENT "numactl --cpunodebind=0 iperf -c $SERVER -t $TIME -P $PTH "

  • check kernel modules are loaded using lsmod


  • configure hugepages
vi /etc/sysctl.conf
# 32 x 1GB = 32GB

# make the change permanently
sysctl -p

  • download and configure DPDK config to enable MLX PMD driver, librte_pmd_mlx5
vi config/common_linuxapp



  • build and install DPDK
make install T=x86_64-native-linuxapp-gcc DESTDIR=/usr/local
# configure environment variables and add them to ~/.bashrc
vi ~/.bashrc
export RTE_SDK=/usr/local/share/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc

# if you encounter 'numa.h' missing error, you need to install libnuma-dev
sudo apt install libnuma-dev

# copy shared linked library to LD library path
cp x86_64-native-linuxapp-gcc/lib/librte_pmd_mlx5_glue* $(ldconfig -p | grep librte_pmd_mlx5_glue | awk '{print $4}')

# if you will use vpp with dpdk, it requires shared library to build vpp
vi ~/.bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

  • configure performance tuning: BIOS configuration

– disable hyperthreading
– disable I/O non-posted prefetch
– if you want to use SR-IOV, turn it on too when you enter BIOS mode

  • configure performance tuning: 1GB hugepage size and CPU isolation
vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G isolcpus=16-31 nohz_full=16-31 rcu_nocbs=16-31 iommu=pt intel_iommu=on"

vm.nr_hugepages in /etc/sysctl.conf should be adapted accordingly

  • configure performance tuning: NUMA-aware assignment of NIC
lstopo topo.png
numactl -H

  • configure performance tuning: disable pause frame
ethtool -A <netdev> rx off tx off

# turn off auto-negotiation if you want
ethtool -s dev autoneg off

  • configure performance tuning: CPU frequency policy
# let frequency scaling be 'performance' rather than 'powersave'

# R730
declare -a CPUs=("8" "9" "10" "11" "12" "13" "14" "15")

for i in "${CPUs[@]}"
    echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor

# cpufrequtils utility
sudo apt install cpufrequtils
sudo cpufreq-set -r -g performance

  • configure performance tuning: set Max_Read_Req PCIe parameter to 1K or other
dpdk-devbind -s
setpci -s <NIC BIOS address> 68.w
setpci -s <NIC BIOS address> 68.w=3XXX

if the first command outputs other than 3XXX, set it to 3XXX.

even though Mellanox recommends this configuration, I observed negative effect of this configuration. Thus, be careful when you adopt optimization. in my experiment, 3XXX for R730 and 2XXX for R930 show the best throughput

lspci | grep Mellanox
lspci -s 81:00.0 -vvv | grep PCIeGen
lspci -s 81:00.0 -vvv | grep MaxReadReq

  • configure performance tuning: interrupt configuration
service irqbalance stop
echo '2,4,6,8,10,12,14' | sudo tee /proc/irq/*/smp_affinity_list

# Set all possible interrupts to different NUMA. for example, 
echo '6-9' | sudo tee /proc/irq/*/smp_affinity_list

# Set NIC interrupts to same NUMA. for example, 0-1 ethX

# Set other NIC interrupts to different NUMA. for example, 6-9 ethY

this should be performed only if the needed performance was not achieved.
Mellanox provides this tool:
/usr/sbin/ 8 ethN
/usr/sbin/ 1 enp65s0f0
/usr/sbin/ 1 enp65s0f1
/usr/sbin/ 1 enp67s0f0
/usr/sbin/ 1 enp67s0f1
/usr/sbin/ 2 enp129s0f0
/usr/sbin/ 2 enp129s0f1
/usr/sbin/ 2 enp131s0f0
/usr/sbin/ 2 enp131s0f1

  • configure performance tuning: disable kernel memory compaction
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
sysctl -w vm.swappiness=0
sysctl -w vm.zone_reclaim_mode=0

  • configure performance tuning: aggressive CQE Zipping
mlxconfig -d /dev/mst/mt4115_pciconf0 q | grep CQE
mlxconfig -d <mst device> s CQE_COMPRESSION=1

you can see all in /dev/mst directory. if not, service mst start.
how to roll back CQE Zipping to its default value:
mlxconfig -d s CQE_COMPRESSION=0

  • run simple test
cd /usr/local/share/dpdk/examples/l3fwd
vi l3fwd_lpm.c
static struct ipv4_l3fwd_lpm_route ipv4_l3fwd_lpm_route_array[] = {
    {IPv4(1, 1, 1, 0), 24, 0},
    {IPv4(2, 1, 1, 0), 24, 1},
    {IPv4(3, 1, 1, 0), 24, 2},
    {IPv4(4, 1, 1, 0), 24, 3},
    {IPv4(5, 1, 1, 0), 24, 4},
    {IPv4(6, 1, 1, 0), 24, 5},
    {IPv4(7, 1, 1, 0), 24, 6},
    {IPv4(8, 1, 1, 0), 24, 7},
    {IPv4(192, 168, 1, 0), 24, 0},
    {IPv4(192, 168, 2, 0), 24, 1},
    {IPv4(192, 168, 3, 0), 24, 2},
    {IPv4(192, 168, 4, 0), 24, 3},
    {IPv4(192, 168, 5, 0), 24, 4},
    {IPv4(192, 168, 6, 0), 24, 5},
    {IPv4(192, 168, 7, 0), 24, 6},
    {IPv4(192, 168, 8, 0), 24, 7},
cd build
numactl -H

# simple test with one queue pair per port
./l3fwd -l 13-14,17-18,21-22,25-26,29-30 -n 4 -- -p 0xff --config="(0,0,17),(1,0,21),(2,0,25),(3,0,29),(4,0,18),(5,0,22),(6,0,26),(7,0,30)" -L

# for best performance, use two queue pairs per port
./l3fwd -l 1-2,5-6,9-10,13-14,17-18,21-22,25-26,29-30 -n 4 -- -p 0xff --config="(0,0,1),(0,1,17),(1,0,5),(1,1,21),(2,0,9),(2,1,25),(3,0,13),(3,1,29),(4,0,2),(4,1,18),(5,0,6),(5,1,22),(6,0,10),(6,1,26),(7,0,14),(7,1,30)" -L

# hyperthreading case
./l3fwd -l 45-46,49-50,53-54,57-58,61-62 -n 4 -- \
        -p 0xff --config="(0,0,49),(1,0,53),(2,0,57),(3,0,61),(4,0,50),(5,0,54),(6,0,58),(7,0,62)" -L

./l3fwd -c 0xf -- -p 0x3 --config="(0,0,0),(1,0,1)" -L
./l3fwd -c 0xf -- -p 0x3 --config="(0,0,0),(0,1,1),(1,0,2),(1,1,3)" -L

  • how to retrieve PCI addresses of interfaces for whitelisting

INTERFACES="enp65s0f0 enp65s0f1 enp66s0f0 enp66s0f1 enp129s0f0 enp129s0f1 enp130s0f0 enp130s0f1"

    for intf in $INTERFACES;
        (cd "/sys/class/net/${intf}/device/" && pwd -P);
} |
sed -n 's,.*/\(.*\),-w \1,p'

# 'dpdkdev-bind -s' will do the same thing for you

# DUT(i.e. R930) should enable IP forwarding
vi /etc/sysctl.conf
sysctl -p

this should be considered in two aspects for /etc/default/grub
– interface name
– hugepage size

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G isolcpus=2-7 net.ifnames=0 biosdevname=0 iommu=pt intel_iommu=on"

this README mainly follows instructions in the below URL.

Linux Driver Solutions

HowTo Install MLNX_OFED Driver

Mellanox DPDK Quick Start Guide

Getting started with ConnectX-4 100Gb/s Adapter for Linux

Performance Tuning for Mellanox Adapters

HowTo Configure SR-IOV for ConnectX-4/ConnectX-5 with KVM (Ethernet)

HowTo Set Virtual Network Attributes on a Virtual Function (SR-IOV)

HowTo Configure QoS over SR-IOV

HowTo Configure Rate Limit per VF for ConnectX-4/ConnectX-5

Ethtool Commands

How to get best performance with NICs on Intel platforms

Understanding PCIe Configuration for Maximum Performance

Leave a Reply

Your email address will not be published. Required fields are marked *