how to configure Mellanox 100G NIC

  • download appropriate Mellanox’s linux driver from the below URL

Linux Drivers(MLNX_OFED) download
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

FYI.
below driver supports 10Gbps, 40Gbps and 56Gbps per port, but not 100Gbps.
therefore, you’d better use MLNX_OFED driver in case of 100G port.
Mellanox EN Driver for Linux (i.e. EN means Ethernet-only NIC)
http://www.mellanox.com/page/products_dyn?product_family=27


  • download and install MFT(Mellanox Firmware Tool)
wget http://www.mellanox.com/downloads/MFT/mft-4.9.0-38-x86_64-deb.tgz
tar xvfz mft-4.9.0-38-x86_64-deb.tgz
cd mft-4.9.0-38-x86_64-deb/
./install.sh
service mst start

  • install MLNX_OFED and configure
lspci |grep -i mellanox
./mlnxofedinstall --upstream-libs --dpdk
# firmware version should be 12.21.1000 and above
ibv_devinfo 
ibv_devinfo  | grep vendor_part_id
mlxconfig -d /dev/mst/mt4115_pciconf0 query |grep LINK_TYPE
mlxconfig -d /dev/mst/mt4115_pciconf0 query | grep SRIOV_EN
mlxconfig -d /dev/mst/mt4115_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=8
mlxfwreset -d /dev/mst/mt4115_pciconf0 reset
service openibd restart

# in case of mlxconfig is not executed, start mst(mellanox software tools)
service mst start

# Connect-X4 EN is Ethernet-only NIC, thus it is not needed to change
# LINK_TYPE_P1 and LINK_TYPE_P2 to 2(Ethernet) from default 1(Infiniband).
# if needed, change those values by executing below commands
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2
mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P2=2

# if link type was changed, firmware must be reset as well
mlxfwreset -d /dev/mst/mt4115_pciconf0 reset

# dynamic VN assignment
echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs

# configure IPs and MTU
ifconfig ens785f0 192.168.100.2/24 up
ifconfig ens785f0 mtu 9000

# for useful information
lspci -s 04:00.0 -vvv

  • after reboot, check that the port type was changed to Ethernet
ibdev2netdev

# alternatively, you can make sure Ethernet interfaces are in working order and linked to kernel verbs
ls -d /sys/class/net//device/infiniband_verbs/uverbs | cut -d / -f 5


  • pre-test link and interface using iperf
cat iperf.sh
#!/bin/sh
#

PTH=8
TIME=10
SERVER="192.168.1.1"
CLIENT="192.168.1.2"

numactl --cpunodebind=0 iperf -s -t $TIME -P $PTH & \
ssh $CLIENT "numactl --cpunodebind=0 iperf -c $SERVER -t $TIME -P $PTH "

  • check kernel modules are loaded using lsmod

ib_uverbs
mlx5_core
mlx5_ib


  • configure hugepages
vi /etc/sysctl.conf
# 32 x 1GB = 32GB
vm.nr_hugepages=32

# make the change permanently
sysctl -p

  • download and configure DPDK config to enable MLX PMD driver, librte_pmd_mlx5
vi config/common_linuxapp

CONFIG_RTE_LIBRTE_MLX5_PMD=y
CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS=y
CONFIG_RTE_LIBRTE_MLX5_DEBUG=n

CONFIG_RTE_LIBEAL_USE_HPET=y

  • build and install DPDK
make install T=x86_64-native-linuxapp-gcc DESTDIR=/usr/local
# configure environment variables and add them to ~/.bashrc
vi ~/.bashrc
export RTE_SDK=/usr/local/share/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc

# if you encounter 'numa.h' missing error, you need to install libnuma-dev
sudo apt install libnuma-dev

# copy shared linked library to LD library path
cp x86_64-native-linuxapp-gcc/lib/librte_pmd_mlx5_glue* $(ldconfig -p | grep librte_pmd_mlx5_glue | awk '{print $4}')

# if you will use vpp with dpdk, it requires shared library to build vpp
vi ~/.bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

  • configure performance tuning: BIOS configuration

– disable hyperthreading
– disable I/O non-posted prefetch
– if you want to use SR-IOV, turn it on too when you enter BIOS mode


  • configure performance tuning: 1GB hugepage size and CPU isolation
vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G isolcpus=16-31 nohz_full=16-31 rcu_nocbs=16-31 iommu=pt intel_iommu=on"
update-grub
reboot

vm.nr_hugepages in /etc/sysctl.conf should be adapted accordingly


  • configure performance tuning: NUMA-aware assignment of NIC
lstopo topo.png
numactl -H
lscpu

  • configure performance tuning: disable pause frame
ethtool -A <netdev> rx off tx off

# turn off auto-negotiation if you want
ethtool -s dev autoneg off

  • configure performance tuning: CPU frequency policy
# let frequency scaling be 'performance' rather than 'powersave'
cat cpu-freq-policy.sh
#!/bin/bash
#

# R730
declare -a CPUs=("8" "9" "10" "11" "12" "13" "14" "15")

for i in "${CPUs[@]}"
do
    echo performance > /sys/devices/system/cpu/cpu${i}/cpufreq/scaling_governor
done

# cpufrequtils utility
sudo apt install cpufrequtils
sudo cpufreq-set -r -g performance


  • configure performance tuning: set Max_Read_Req PCIe parameter to 1K or other
dpdk-devbind -s
setpci -s <NIC BIOS address> 68.w
setpci -s <NIC BIOS address> 68.w=3XXX

if the first command outputs other than 3XXX, set it to 3XXX.

even though Mellanox recommends this configuration, I observed negative effect of this configuration. Thus, be careful when you adopt optimization. in my experiment, 3XXX for R730 and 2XXX for R930 show the best throughput

lspci | grep Mellanox
lspci -s 81:00.0 -vvv | grep PCIeGen
lspci -s 81:00.0 -vvv | grep MaxReadReq


  • configure performance tuning: interrupt configuration
service irqbalance stop
echo '2,4,6,8,10,12,14' | sudo tee /proc/irq/*/smp_affinity_list

# Set all possible interrupts to different NUMA. for example, 
echo '6-9' | sudo tee /proc/irq/*/smp_affinity_list

# Set NIC interrupts to same NUMA. for example, 
set_irq_affinity_cpulist.sh 0-1 ethX

# Set other NIC interrupts to different NUMA. for example, 
set_irq_affinity_cpulist.sh 6-9 ethY

this should be performed only if the needed performance was not achieved.
Mellanox provides this tool:
/usr/sbin/set_irq_affinity_cpulist.sh 8 ethN
/usr/sbin/set_irq_affinity_bynode.sh 1 enp65s0f0
/usr/sbin/set_irq_affinity_bynode.sh 1 enp65s0f1
/usr/sbin/set_irq_affinity_bynode.sh 1 enp67s0f0
/usr/sbin/set_irq_affinity_bynode.sh 1 enp67s0f1
/usr/sbin/set_irq_affinity_bynode.sh 2 enp129s0f0
/usr/sbin/set_irq_affinity_bynode.sh 2 enp129s0f1
/usr/sbin/set_irq_affinity_bynode.sh 2 enp131s0f0
/usr/sbin/set_irq_affinity_bynode.sh 2 enp131s0f1


  • configure performance tuning: disable kernel memory compaction
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
sysctl -w vm.swappiness=0
sysctl -w vm.zone_reclaim_mode=0

  • configure performance tuning: aggressive CQE Zipping
mlxconfig -d /dev/mst/mt4115_pciconf0 q | grep CQE
mlxconfig -d <mst device> s CQE_COMPRESSION=1

you can see all in /dev/mst directory. if not, service mst start.
how to roll back CQE Zipping to its default value:
mlxconfig -d s CQE_COMPRESSION=0


  • run simple test
cd /usr/local/share/dpdk/examples/l3fwd
vi l3fwd_lpm.c
static struct ipv4_l3fwd_lpm_route ipv4_l3fwd_lpm_route_array[] = {
    {IPv4(1, 1, 1, 0), 24, 0},
    {IPv4(2, 1, 1, 0), 24, 1},
    {IPv4(3, 1, 1, 0), 24, 2},
    {IPv4(4, 1, 1, 0), 24, 3},
    {IPv4(5, 1, 1, 0), 24, 4},
    {IPv4(6, 1, 1, 0), 24, 5},
    {IPv4(7, 1, 1, 0), 24, 6},
    {IPv4(8, 1, 1, 0), 24, 7},
    {IPv4(192, 168, 1, 0), 24, 0},
    {IPv4(192, 168, 2, 0), 24, 1},
    {IPv4(192, 168, 3, 0), 24, 2},
    {IPv4(192, 168, 4, 0), 24, 3},
    {IPv4(192, 168, 5, 0), 24, 4},
    {IPv4(192, 168, 6, 0), 24, 5},
    {IPv4(192, 168, 7, 0), 24, 6},
    {IPv4(192, 168, 8, 0), 24, 7},
};
make
cd build
numactl -H

# simple test with one queue pair per port
./l3fwd -l 13-14,17-18,21-22,25-26,29-30 -n 4 -- -p 0xff --config="(0,0,17),(1,0,21),(2,0,25),(3,0,29),(4,0,18),(5,0,22),(6,0,26),(7,0,30)" -L

# for best performance, use two queue pairs per port
./l3fwd -l 1-2,5-6,9-10,13-14,17-18,21-22,25-26,29-30 -n 4 -- -p 0xff --config="(0,0,1),(0,1,17),(1,0,5),(1,1,21),(2,0,9),(2,1,25),(3,0,13),(3,1,29),(4,0,2),(4,1,18),(5,0,6),(5,1,22),(6,0,10),(6,1,26),(7,0,14),(7,1,30)" -L

# hyperthreading case
./l3fwd -l 45-46,49-50,53-54,57-58,61-62 -n 4 -- \
        -p 0xff --config="(0,0,49),(1,0,53),(2,0,57),(3,0,61),(4,0,50),(5,0,54),(6,0,58),(7,0,62)" -L

./l3fwd -c 0xf -- -p 0x3 --config="(0,0,0),(1,0,1)" -L
./l3fwd -c 0xf -- -p 0x3 --config="(0,0,0),(0,1,1),(1,0,2),(1,1,3)" -L

https://www.dpdk.org/doc/guides/sample_app_ug/l3_forward.html


  • how to retrieve PCI addresses of interfaces for whitelisting
cat whitelist.sh
#!/bin/sh

INTERFACES="enp65s0f0 enp65s0f1 enp66s0f0 enp66s0f1 enp129s0f0 enp129s0f1 enp130s0f0 enp130s0f1"

{
    for intf in $INTERFACES;
    do
        (cd "/sys/class/net/${intf}/device/" && pwd -P);
    done;
} |
sed -n 's,.*/\(.*\),-w \1,p'

# 'dpdkdev-bind -s' will do the same thing for you

# DUT(i.e. R930) should enable IP forwarding
vi /etc/sysctl.conf
net.ipv4.ip_forward=1
(exit)
sysctl -p


this should be considered in two aspects for /etc/default/grub
– interface name
– hugepage size

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G isolcpus=2-7 net.ifnames=0 biosdevname=0 iommu=pt intel_iommu=on"

this README mainly follows instructions in the below URL.
http://dpdk.org/doc/guides/nics/mlx5.html

Linux Driver Solutions
https://community.mellanox.com/docs/DOC-2287

HowTo Install MLNX_OFED Driver
https://community.mellanox.com/docs/DOC-2688

Mellanox DPDK Quick Start Guide
http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v16.11_1.5.pdf

Getting started with ConnectX-4 100Gb/s Adapter for Linux
https://community.mellanox.com/docs/DOC-2294

Performance Tuning for Mellanox Adapters
https://community.mellanox.com/docs/doc-2489

HowTo Configure SR-IOV for ConnectX-4/ConnectX-5 with KVM (Ethernet)
https://community.mellanox.com/docs/DOC-2386

HowTo Set Virtual Network Attributes on a Virtual Function (SR-IOV)
https://community.mellanox.com/docs/DOC-1123

HowTo Configure QoS over SR-IOV
https://community.mellanox.com/docs/DOC-1480

HowTo Configure Rate Limit per VF for ConnectX-4/ConnectX-5
https://community.mellanox.com/docs/DOC-2565

Ethtool Commands
https://community.mellanox.com/docs/DOC-2813

How to get best performance with NICs on Intel platforms
https://doc.dpdk.org/guides-16.04/linux_gsg/nic_perf_intel_platform.html

Understanding PCIe Configuration for Maximum Performance
https://community.mellanox.com/docs/DOC-2496

Leave a Reply

Your email address will not be published. Required fields are marked *