(c) Copyright 2019-2020 Xilinx, Inc. All rights reserved. This file contains confidential and proprietary information of Xilinx, Inc. and is protected under U.S. and international copyright and other intellectual property laws. DISCLAIMER This disclaimer is not a license and does not grant any rights to the materials distributed herewith. Except as otherwise provided in a valid license issued to you by Xilinx, and to the maximum extent permitted by applicable law: (1) THESE MATERIALS ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND XILINX HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON- INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under or in connection with these materials, including for any direct, or any indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinx had been advised of the possibility of the same. CRITICAL APPLICATIONS Xilinx products are not designed or intended to be fail- safe, or for use in any application requiring fail-safe performance, such as life-support or safety devices or systems, Class III medical devices, nuclear facilities, applications related to the deployment of airbags, or any other applications that could lead to death, personal injury, or severe property or environmental damage (individually and collectively, "Critical Applications"). Customer assumes the sole risk and liability of any use of Xilinx products in Critical Applications, subject only to applicable laws and regulations governing limitations on product liability. THIS COPYRIGHT NOTICE AND DISCLAIMER MUST BE RETAINED AS PART OF THIS FILE AT ALL TIMES ============================================================== README for Onload Operator for Kubernetes and OpenShift v2.0.0 ============================================================== This software package is the v2.0.0 release of the Onload Operator for using Onload with Openshift or Kubernetes/Calico. See the Software_License_Agreement_for_Onload_Kubernetes_Operator.doc file in this distribution for licensing details. ====== Onload ====== Onload is Xilinx's application acceleration platform for cloud data centers. Onload software dramatically accelerates and scales network-intensive applications such as in-memory databases, software load balancers, and web servers. ================= Operator features ================= The Onload Operator automates the deployment of Onload for Kubernetes and Red Hat OpenShift. It allows for the creation of pods with interfaces which can run accelerated Onload applications. There are two use cases: * OpenShift/Multus: Creates accelerated secondary interfaces on the pod. * Kubernetes/Calico: Accelerates the primary interface on the pod. OpenShift/Multus use case ------------------------- Red Hat OpenShift 4 includes the Multus multi-networking plugin, which is used to configure secondary networks. One typical use case is deploying an application network which is separate from the primary management network. The MACVLAN or IPVLAN networking plugins can then be used to provision sub-interfaces for each pod. Onload can accelerate workloads using these secondary networks. Because Onload bypasses the kernel's MACVLAN or IPVLAN driver, performance is equivalent to using Onload on the parent interface. Kubernetes/Calico use case -------------------------- Onload can seamlessly accelerate applications in a Kubernetes cluster using the Calico networking plugin. NOTE: Calico support is currently limited to configurations without an overlay. Calico network policy is not enforced by Onload. Functionality ------------- When a pod is created with an annotation requesting Onload acceleration, the operator will do three things: 1. Ensure the pod is scheduled on a node that has a suitable Solarflare interface. 2. Allocate an accelerated soft interface based on the Solarflare interface. i. For the OpenShift multi-network use case, this is a macvlan or ipvlan interface created and passed through into the pod using Multus. ii. For the Kubernetes/Calico use case, this is acceleration of the veth interface allocated to the pod by Calico. 3. Install binaries, libraries and device files required to run Onload inside the pod. ============ Requirements ============ The following sections explain the requirements for the two use cases in more detail. Onload requirements ------------------- Onload is compatible with Solarflare XtremeScale X2 network adapters. Supported Linux Kernels ----------------------- Xilinx provides prebuilt drivers for a number of recent Red Hat kernels. On nodes running one of these kernels, the operator will download and install a suitable driver. For kernels not on this list, we provide the tools to build a custom driver image for use in your cluster. See the "Building for custom kernels" section below. We provide prebuilt drivers for the following kernel versions: * 3.10.0-957.27.2.el7.x86_64 * 3.10.0-1062.el7.x86_64 * 4.18.0-147.3.1.el8_1.x86_64 * 4.18.0-147.5.1.el8_1.x86_64 Cluster requirements for OpenShift/Multus use case -------------------------------------------------- In order to install the operator, the user needs an Openshift cluster meeting the following requirements: * Openshift version >= 4.1 * Cluster must be configured to use the Multus CNI * Any nodes running onloaded pods must be running RHEL7, RHEL8 or RHCOS8. * At time of writing all RHEL/RHCOS distributions must be patched with the included SELinux policy update (see below for details) Cluster requirements for Kubernetes/Calico use case --------------------------------------------------- In order to install the operator, the user needs a Kubernetes cluster meeting the following requirements: * Kubernetes version >= 1.11.1 * Calico version >= 3.1.3 * Cluster must be configured to use Calico networking with no encapsulation on the wire * Any nodes not running RHEL must have SELinux disabled * At time of writing all RHEL/RHCOS distributions must be patched with the included SELinux policy update (see below for details) Solarflare adapter firmware and boot configuration -------------------------------------------------- Before Onload can be installed, please download the Solarflare Linux Utilities package (SF-107601-LS), and set up your adapter as follows: - update your Solarflare adapter's firmware to the latest version - select either full-featured or ultra-low latency firmware for compatibility with Onload - configure the correct boot mode if necessary SELinux policy update --------------------- The latest versions of the selinux-policy and container-selinux packages will allow applications running inside Openshift containers to run under Onload. However, as of August 2019 most linux distributions still package older versions of these packages which do not include the fix. In the meantime this package includes an SELinux policy update to allow onloaded pods. The fix is distributed as source and must be compiled and then installed on each node in the cluster that will run accelerated pods. To compile and install the patch you will need the following packages: * selinux-policy-devel * policycoreutils You will also require the onload.te config file found in the same bundle as this readme. The steps to apply the policy update are (as root): # cd /tmp # cp /onload.te . # make -f /usr/share/selinux/devel/Makefile # semodule -i onload.pp NOTE: If all nodes are running the same distribution and kernel then it is sufficient to compile on a single node and distribute the onload.pp binary to each node for installation. ======================= Installing the operator ======================= Installing the operator can be split into the following steps: 1. Ensure the operator's container images are available to the cluster 2. Prepare the operator deployment manifests 3. Apply the manifests 4. Verify that the operator is now running The following sections describe these steps in more detail. Debugging and uninstallation steps are also described. Container images (1) -------------------- This bundle contains several container images in tar form. These must be loaded into a container registry that is visible to the nodes in the cluster. The following images are required: solarflare_communications/cloud-onload-operator: solarflare_communications/cloud-onload-driver:.. solarflare_communications/cloud-onload-device-plugin: solarflare_communications/cloud-onload-device-plugin-test: solarflare_communications/cloud-onload-ocka: solarflare_communications/cloud-onload-node-manager: NOTES: 1. The tag for the driver image is prefixed with the kernel version of the nodes you will be running Onload on and the onload version to use. If your cluster contains nodes with different kernel versions, you will need a separate driver image for each kernel 2. The image names and tags in your registry must exactly match the format above For example, to load the operator image: # docker load -i cloud-onload-operator-.tar # docker tag solarflare_communications/cloud-onload-operator: my-docker-registry/solarflare_communications/cloud-onload-operator: # docker push my-docker-registry/solarflare_communications/cloud-onload-operator: Prepare the manifests (2) ------------------------- The bundle containing this readme provides sets of example manifests for Kubernetes and OpenShift4 clusters in their respective folders. 0500_operator.yaml: * Update the cloud-onload-operator image with the URL of your registry (if using a custom image registry). 9000_example_cr.yaml: * Update any fields under "spec" as desired. See the section below "Customizing the operator install" for notes on what the fields do. Apply the manifests (3) ----------------------- Use the kubectl CLI tool (or oc on OpenShift) to apply the manifests in order. # for yaml_spec in manifests/*; do kubectl apply -f $yaml_spec; done This will create the operator example CR and pods in the "cloud-onload-operator" namespace. Verifying operator installation (4) ----------------------------------- The kubectl commands above will initially create a single deployment which runs the main operator daemon. This daemon will create a number of daemonsets on the cluster to manage operator components. Installation of these daemonsets will take a few minutes. To determine when installation has completed, monitor the "Phase" field on the operator's status as reported by "kubectl get", e.g.: # kubectl -n cloud-onload-operator get cloud-onload-operator example -o=jsonpath="{.status.phase}{\"\n\"}" Once this reports "Success", pods can request onload accelerated interfaces. Customizing the operator install -------------------------------- The Custom Resource defined in 9000_example_cr.yaml can be edited prior to applying the manifests, to customize the install. If customization is required after installation, simply edit the manifest and re-apply. The fields that you can set on the CR are as follows: NOTE: All field names are case sensitive and start with a lower-case letter. If you get the name or case of a field wrong, Kubernetes will not warn you, but simply ignore it. maxPodsPerNode Sets the maximum number of onloaded containers that can be run on each node. nodeManagerImageFqin onloadDriverImageFqin ockaImageFqin devicePluginImageFqin devicePluginTestImageFqin These five fields all do a similar thing: they allow you to override the docker image to use for each of the different pods that the operator manages. NOTE: For the OnloadDriver pod only, you can include the template string {{kernel}} in the image name. This will evaulate on each node to the kernel that node is running. version This field controls the version of Onload to install. Note that if you update it you must download or build a cloud-onload-driver image for the new version and make it available to the cluster. onloadDebug If set to true, this will load the Onload drivers in debug mode. This has a performance impact and so is only recommended for testing or diagnosing issues. nodeSelector The operator will attempt to install Onload on all nodes matching this selector. tolerations Tolerations to apply to the pods managed by the operator. This key is optional, and should be set if any of the nodes on which you want to run Cloud Onload have taints that would normally prevent pods being scheduled. The format of this key should be a list of strings, each in the form "=:" (for a toleration with operator: Equal) or ":" (for a toleration with operator: Exists) installOcka If set to true, the operator will install the OCKA alongside Onload on all nodes. The OCKA is an extension to the Onload control plane which keeps it in sync with updates to Kubernetes services. It is required to accelerate service traffic under Calico, but is not necessary in other environments. nodeReadyTimeoutSec This timer starts counting down when the operator starts to install on a node. If it expires before installation is complete, the operator will report that node as "failed" in its status (see the "Querying operator status" section below for details) postLoadHook This field provides a way to run custom commands on each node after Onload drivers have been loaded. This may be necessary when a node's network subsystem does automatically configure interfaces when they are created (for example, the legacy "network" service). If specified, the value should be a single executable which must be available on all nodes in the cluster. You may not specify any arguments to the executable. The executable will be run for each sfc interface on the node, with the interface's name as the sole command-line argument. Querying operator status ------------------------ The Custom Resource defined in 9000_example_cr.yaml includes a number of status fields that can be queried. These include: * currentVersion: The version of Onload currently in use * operatorImageTag: The version of the operator image currently running * phase: Reports "Running" while the operator is installing, then "Success" once installation has succeeded and it is ready to use * readyNodeCount: The number of nodes which the operator has successfully setup to run Onload pods * notReadyNodeCount: The number of nodes on which the operator is currently in the process of setting up * failedNodeCount: The number of nodes on which the operator has tried and failed to setup * failedNodes: A list of nodes on which setup has failed (only present when failedNodeCount > 0) * lastUpdate: Timestamp at which any of the above status fields last changed The failedNodes field deserves particular attention - if a node appears in this list it is likely to require manual intervention to recover it. See the next section for hints on diagnosing failures. Diagnosing operator failures ---------------------------- If the operator doesn't become ready within about 10 minutes of installing the manifests, something has probably gone wrong. The following steps may help to diagnose the failure. First, check whether the main operator pod is running: # kubectl -n cloud-onload-operator get pods --selector name=cloud-onload-operator This command should report a single pod which is both running and ready. If not, "kubectl describe" or "kubectl logs" may explain why: # kubectl -n cloud-onload-operator describe pod # kubectl -n cloud-onload-operator logs If the pod is running but the operator still isn't working, then first check the CR status with: # kubectl -n cloud-onload-operator get cloud-onload-operators -o yaml The "status" section in the output will tell you more about the failure mode. See the "Querying operator status" section above for details on what each of the fields means. Some possible problems you may encounter: 1. Status section is missing or has very few fields: This indicates that the main operator pod has hit a failure; run the kubectl logs command above to find out why 2. All the "NodeCount" fields report zero: The "nodeSelector" field on the CR is not matching any nodes in your cluster; try updating it 2b. The "NodeCount" fields are still zero after checking the nodeSelector: Your nodes have taints which are preventing the operator from scheduling pods; set the "tolerations" field on the CR to match the taints. 3. The "failedNodeCount" field is nonzero: The operator has failed to configure one or more nodes to run Onload. Look at the "failedNodes" to see which nodes have failed, then read on for more debugging hints Diagnosing failed nodes ----------------------- If the operator has failed to install on one or more nodes, then you will need to examine the pods it has created on that node to determine what has failed. The first pod created by the operator on each node is the node-manager; this pod is responsible for managing all other operator pods on that node. Query the node manager status with: # kubectl -n cloud-onload-operator get pods --field-selector spec.nodeName= --selector name=cloud-onload-node-manager If this pod is not running, get more details with: # kubectl -n cloud-onload-operator describe pod If it is running, view its logs with: # kubectl -n cloud-onload-operator logs If there are no obvious failures in either set of output, then you will need to repeat this procedure for each of the other operator pods on the node. To list all the pods, run: # kubectl -n cloud-onload-operator get pods --field-selector spec.nodeName= On a healthy node you should see the following pods reported, all running and ready: * cloud-onload-node-manager * cloud-onload-driver * cloud-onload-ocka * cloud-onload-device-plugin * cloud-onload-device-plugin-test NOTE: The cloud-onload-ocka pod will only be present if the installOcka field is set on the CR manifest. The above list is ordered; the operator will attempt to create each pod in turn, waiting until the pod is healthy before creating the next one. Identify the first pod that is not healthy and examine it further: # kubectl -n cloud-onload-operator describe pod # kubectl -n cloud-onload-operator logs Altering operator settings -------------------------- The operator has a number of settings that are controlled through parameters in the spec of the operator CR. See the section "Customizing the operator install" for details on the available parameters. Operator settings can be altered while the operator is running by re-applying the manifest (9000_example_cr.yaml), with modified parameters. This will cause all operator pods in the cluster to be evicted and replaced with pods using the new settings. Eviction and replacement of operator pods will reload the Onload driver on each node and thus interrupt traffic. The eviction of pods can be controlled using a PodDisruptionBudget (PDB). Adding a PDB as below will make the update process try to keep the Onload driver loaded on at least 1 node at any given time. > apiVersion: policy/v1beta1 > kind: PodDisruptionBudget > metadata: > namespace: cloud-onload-operator > name: cloud-onload-node-manager-pdb > spec: > minAvailable: 1 > selector: > matchLabels: > name: cloud-onload-node-manager Increasing minAvailable will keep the driver loaded on more nodes at once. Increasing it too much, however, may make the update slow or impossible to perform. The number should be chosen to balance upgrade speed against the needs of applications using Onload. Uninstalling the operator ------------------------- The operator can be uninstalled by deleting all objects created when the manifests were applied. # kubectl delete -f 0000_namespace.yaml # kubectl delete -f 0100_role.yaml # kubectl delete -f 0200_role_binding.yaml # kubectl delete -f 0400_crd.yaml Objects in all other manifests were created inside the cloud-onload-operator namespace and are automatically deleted with that namespace. The following command is also safe, but will result in error messages when attempting to delete those objects twice: # for yaml_spec in manifests/*; do kubectl delete -f $yaml_spec; done NOTE: This uninstalls the operator, but does not completely remove onload from cluster nodes. A total uninstall may be available in future operator versions. Kernel modules loaded by the operator will remain loaded until the next time the node is rebooted, at which point they will not be reloaded. An uninstall followed by a reboot should bring the node back to the state prior to installation, with the exception of some libraries left on the file system. Network Attachment Definitions and Multus configuration files created in the section below are not directly associated with the operator and will not be deleted by the uninstall. NADs can be removed through kubectl: # kubectl delete network-attachment-definition onload-network Multus configuration files used to configure onload networks can simply be deleted from each node. ======================== Creating Multus networks ======================== The OpenShift multi-network use case uses Multus to create accelerated interfaces inside pods. For the Kubernetes/Calico use case, this section can be skipped. The accelerated interfaces must be macvlan or ipvlan interfaces configured on top of an underlying Solarflare interface. The two behave similarly: they are both virtual interfaces that use the configured underlying interface for transmitting and receiving packets, but a macvlan interface is given a randomly-generated MAC address, while an ipvlan interface uses the MAC address of its underlying interface. The Multus network configuration can vary from node to node, which is useful if you want to assign static IPs to pods, or if the name of the Solarflare interface to use varies between nodes. The following steps will create a Multus network that assigns static IPs to each pod. For other configurations, see the Multus documentation. NOTE: Multus networks provide a means to communicate using onload, but are not specifically associated with Onload. Even without Onload installed the steps below will create a viable network, Onload just won't be available to use on it. Network Attachment Definition ----------------------------- In order to create accelerated pod interfaces you must define a Network Attachment Definition (NAD) in the Kubernetes API. This object specifies which of the node's interfaces to use for accelerated traffic, and also how to assign IP addresses to pod interfaces. The NAD is basically a type of virtual network to which pod secondary interfaces can be connected. Create the NAD object in Kubernetes: # cat << EOF | kubectl apply -f - # apiVersion: "k8s.cni.cncf.io/v1" # kind: NetworkAttachmentDefinition # metadata: # name: onload-network # EOF NOTES: 1. OpenShift pods are unable to use NADs that exist in a different namespace to the pod. Create NADs for each namespace that will have pods that use the network defined in the NAD. 2. Onload does not perform intra-node communication with Multus. Pods on the same node cannot generally communicate using onload. (There are exceptions, e.g. networks returning traffic to the node with a hairpin switch.) Multus network configuration ---------------------------- The NAD defines the network within Kubernetes, but Multus must additionally be configured with the properties of that network. This will determine the attributes given to pod secondary interfaces that connect to the network and the physical interface that the network will use to communicate with other nodes. On each node that will use the network defined in the NAD, write a Multus config file specifying the properties of this network. The following example defines a macvlan network: # mkdir -p /etc/cni/multus/net.d # cat << EOF > /etc/cni/multus/net.d/onload-network.conf # { # "cniVersion": "0.3.0", # "type": "macvlan", # "name": "onload-network", # "master": "sfc0", # "mode": "bridge", # "ipam": { # "type": "host-local", # "subnet": "172.20.0.0/16", # "rangeStart": "172.20.10.1", # "rangeEnd": "172.20.10.253", # "routes": [ # { "dst": "0.0.0.0/0" } # ] # } # } # EOF Creation of an ipvlan network is similar. The "type" field should be set to "ipvlan", and there is no "mode" field. For example: # mkdir -p /etc/cni/multus/net.d # cat << EOF > /etc/cni/multus/net.d/onload-network.conf # { # "cniVersion": "0.3.0", # "type": "ipvlan", # "name": "onload-network", # "master": "sfc0", # "ipam": { # "type": "host-local", # "subnet": "172.20.0.0/16", # "rangeStart": "172.20.10.1", # "rangeEnd": "172.20.10.253", # "routes": [ # { "dst": "0.0.0.0/0" } # ] # } # } # EOF NOTES: 1. "name" must match the name in the NAD for the NAD to use the interface 2. "master" specifies the name of the Solarflare interface on the node 3. "subnet" should be the same on all nodes for a given network, but should not overlap between different networks, regardless of type. 4. "rangeStart" and "rangeEnd" should specify subsets of the subnet that do not overlap between nodes. ========================= Creating accelerated pods ========================= At this point you are ready to create pods that can run Onload. Pods based on most common images can be accelerated, provided they include glibc >= 2.12-1.212. This does exclude very old or extremely lightweight images, such as Alpine Linux. The pods also need to have the following packages installed in order to run with the onload prefix, although not to accelerate using LD_PRELOAD: * sed * gawk * grep * which * kmod * coreutils The Custom Resource defined in 9000_example_cr.yaml and applied earlier causes the operator to consider a node to be eligible to run Onloaded pods if both of the following conditions are met: * The node has all the labels specified in the "nodeSelector" CR field * If the node has taints, they are listed in the "tolerations" CR field Note that if a node meeting the above constraints does not have a Solarflare NIC, then the operator will still attempt to install on it, but will fail to do so and report the node as "failed" in its status. To assist with scheduling, you may wish to use Node Feature Discovery. The NFD operator automatically detects hardware features and advertises them using node labels. This would allow you to, for example, install Onload only on nodes with a Solarflare device present by using the node selector: feature.node.kubernetes.io/pci-1924.present=true NOTE: 0x1924 is the PCI Vendor ID assigned to Solarflare NICs. Example benchmark test image ---------------------------- The examples below use an onloadable image that contains the Netperf and ApacheBench performance benchmarks and other tools. The following Dockerfile produces the required image. benchmark.Dockerfile: > FROM registry.access.redhat.com/ubi8/ubi:8.0 > RUN dnf -y install gcc make net-tools httpd iproute iputils procps-ng kmod which > ADD https://github.com/HewlettPackard/netperf/archive/netperf-2.7.0.tar.gz /root > RUN tar -xzf /root/netperf-2.7.0.tar.gz > RUN netperf-netperf-2.7.0/configure --prefix=/usr > RUN make install > CMD ["/bin/bash"] Build with, e.g. # docker build -t benchmark -f benchmark.Dockerfile This image should be pushed to a docker registry accessible from the Kubernetes cluster. Example benchmark daemonset --------------------------- This is an example daemonset that will run benchmark test pods on all nodes that have Solarflare interfaces. > apiVersion: apps/v1 > kind: DaemonSet > metadata: > name: benchmark > spec: > selector: > matchLabels: > name: benchmark > template: > metadata: > labels: > name: benchmark > annotations: > k8s.v1.cni.cncf.io/networks: onload-network > spec: > nodeSelector: > node-role.kubernetes.io/worker: '' > containers: > - name: benchmark > image: {{ docker_registry }}/benchmark:latest > stdin: true > tty: true > resources: > limits: > solarflare.com/sfc: 1 NOTES: 1. Replace "{{ docker_registry }}" with the registry hostname (and :port if required) 2. The "annotations" section under "spec/template/metadata" specifies which Multus network to use. This is required for OpenShift multi-network use case only; it should be omitted for the Kubernetes/Calico use case. 3. The "resources" section under "containers" indicates that the pod must be scheduled on a node with a Solarflare NIC. This is required for either use case. 4. Different Kubernetes distributions use different labels to identify worker nodes. OpenShift 4 uses node-role.kubernetes.io/worker='' as above. Another common arrangement is to set node-role.kubernetes.io/compute='true'. Please consult the documentation for your Kubernetes distribution. Running Onload inside pods with OpenShift/Multus ------------------------------------------------ Once you have created some accelerated pods, you should see that each has two network interfaces. eth0: the default Openshift interface net1: the Onload interface NOTE: net* is the default name for secondary interfaces created by multus. It is not configured by anything in this readme. Any traffic between the net1 interfaces of two pods can be accelerated using Onload by either: 1. Prefixing the command with "onload" 2. Running with the environment variable LD_PRELOAD=libonload.so NOTE: One caveat to the above is that two accelerated pods can only communicate using Onload if they are running on different nodes. (Onload bypasses the kernel's macvlan and ipvlan drivers to send traffic directly to the NIC, so traffic directed at another pod on the same node will not arrive, unless the port is connected to a switch which is configured to loop packets back to the host where necessary.) To run a simple Netperf latency test we select two benchmark pods from the benchmark daemonset running in the cluster and run an Onload accelerated TCP_RR Netperf test between them. First list all pods # kubectl get pods and select two pods from the list. Then, on pod 1: * get the IP address of the "net1" Onload interface # kubectl get pod -o yaml * accelerate netserver by prefixing its command with onload --profile=latency: # kubectl exec -- onload --profile=latency netserver -p 4444 and on pod 2: * accelerate Netperf by prefixing its command with onload --profile=latency: # kubectl exec -- onload --profile=latency netperf -p 4444 -H -t TCP_RR Internal tests have shown a transaction rate in the region of 260000. For the purpose of comparison, a non-onloaded run of the same Netperf test reported per transaction rate in the region of 24000 transactions per second (i.e. 10 times slower!). Running Onload inside pods with Kubernetes/Calico ------------------------------------------------- For Calico, any traffic between the default eth0 interfaces of two pods can be accelerated using Onload by either: 1. Prefixing the command with "onload" 2. Running with the environment variable LD_PRELOAD=libonload.so NOTE: One caveat to the above is that communication between two pods will only be accelerated by Onload if they are running on different nodes in this release. Pods run with onload on the same node will be able to communicate, but will not actually use onload to do so. This limitation will be addressed in a future version. Create the example benchmark daemonset, making the following modifications: * Remove the annotations section from the spec as we are no longer using multus. * Modify the node_selector to suit your distributions, on base kubernetes it is likely to be node-role.kubernetes.io/compute='true'. To run a simple Netperf latency test we select two benchmark pods from the benchmark daemonset running in the cluster and run an Onload accelerated TCP_RR Netperf test between them. First list all pods # kubectl get pods and select two pods from the list. Then, on pod 1: * get the IP address of the pod with # kubectl get pod -ojsonpath='{.status.podIP}{"\n"}' * accelerate netserver by prefixing its command with onload --profile=latency: # kubectl exec -- onload --profile=latency netserver -p 4444 and on pod 2: * accelerate Netperf by prefixing its command with "onload --profile=latency": # kubectl exec -- onload --profile=latency netperf -p 4444 -H -t TCP_RR Internal tests have shown transaction rates in the region of 260000 transactions per second. For the purpose of comparison, non-onloaded runs of the same Netperf test reported a transaction rate in the region of 15000 transactions per second (i.e. 17 times slower!). ================================ Accelerating Kubernetes Services ================================ Onload can be used to accelerate Kubernetes Services in the Calico use case, provided the OCKA is installed. This is determined by the "installOcka" flag, which is set to true in the in provided 9000_example_cr.yaml manifest. If your cluster does not use Calico, you can safely set it to false. Netperf cannot be run over Services as it opens an extra connection on a random port, services only a advertise a predetermined set of ports (usually only one). So a different tool is required to test services. Example Nginx test image ------------------------ This test will use Nginx as the server. ApacheBench will be used as the client process and is included in the benchmark test image. Nginx requires configuration files as well as the Dockerfile. To build the image create the following files in a temporary directory: index.html: > Hello World! default.conf: > server { > listen 80; > server_name localhost; > > location = /basic_status { > stub_status; > } > > location / { > root /usr/share/nginx/html; > index index.html index.htm; > } > > error_page 500 502 503 504 /50x.html; > location = /50x.html { > root /usr/share/nginx/html; > } > } nginx.Dockerfile: > FROM nginx:1.16.0 > RUN mkdir -p /usr/share/nginx/html > COPY index.html /usr/share/nginx/html > COPY default.conf /etc/nginx/conf.d/default.conf > > RUN apt-get update && \ > apt-get install -y --no-install-recommends \ > iproute2=4.9.0-1+deb9u1 \ > kmod=23-2 \ > curl=7.52.1-5+deb9u9 \ > && \ > apt-get clean && \ > rm -rf /var/lib/apt/lists/* Build with: # docker build -t nginx -f nginx.Dockerfile Example Nginx service --------------------- This is an example service and deployment that will create a service backed by a single pod running the nginx image: > apiVersion: v1 > kind: Service > metadata: > name: nginx > spec: > selector: > app: nginx > ports: > - port: 80 > --- > apiVersion: apps/v1 > kind: Deployment > metadata: > name: nginx-deployment > labels: > app: nginx > spec: > replicas: 1 > selector: > matchLabels: > app: nginx > template: > metadata: > labels: > app: nginx > spec: > nodeSelector: > node-role.kubernetes.io/compute: 'true' > containers: > - name: nginx > image: {{ docker_registry }}/nginx:latest > ports: > - containerPort: 80 Replace "{{ docker_registry }}" with the registry hostname (and :port if required). This deployment is not onloaded. There is no reason why this deployment could not be accelerated in a similar way to the client. However, effective acceleration of Nginx would require tuning that is not in the scope of this demo as all service IP resolution functionality is on the client side. Running the services test ------------------------- After setting up the Nginx service and the benchmark test image DaemonSet the cluster should have an nginx pod running on one node and one benchmark pod running on each node in the cluster. Make a note of the service IP address: # kubectl get service nginx -o=jsonpath='{.spec.clusterIP}{"\n"}' # 10.100.109.98 Check which node the nginx pod is on and open a terminal on a benchmark pod on a different node: # kubectl get pods -o wide # kubectl exec -it bash Then, on the benchmark pod: * set some environment variables to tune Onload for this use case * accelerate ApacheBench by prefixing its command with "onload --profile=latency". For example: # export EF_TCP_SHARED_LOCAL_PORTS=1 # export EF_TCP_SHARED_LOCAL_PORTS_REUSE_FAST=1 # export EF_UL_EPOLL=3 # export EF_EPOLL_SPIN=1 # onload --profile=latency ab -c 1 -n 10000 http://10.100.109.98:80/ Internal tests have shown request rates in the region of 2600 requests per second. For the purpose of comparison, a non-onloaded run of the same ApacheBench test reported a requests-per-second rate in the region of 1000 (i.e. ~2.5 times slower). NOTES: 1. The results of this test are not directly comparable with results from the Netperf tests as the tools used operates differently. 2. Unlike the Netperf tests, onload requires tuning to perform well in this test. Without the tuning parameters there is no gain in performance. 3. We have chosen not to accelerate the Nginx server in this test in order to demonstrate the performance gain from client and Service acceleration. Accelerating Nginx with appropriate tuning parameters would give an additional large performance increase. =========================== MetalLB load balancer =========================== MetalLB is a load-balancer for bare-metal Kubernetes clusters. It uses the BGP protocol to configure network routers to load-balance external traffic to Kubernetes services inside the cluster. Please refer to the MetalLB documentation on how to configure MetalLB. MetalLB support for Calico, when Calico is also configured to use BGP, is still experimental. See: https://metallb.universe.tf/configuration/calico/ for how to configure MetalLB with Calico. Onload can accelerate MetalLB load-balanced services but with some limitations: 1. Onload can only load-balance TCP services. UDP services are not accelerated. 2. Only one backend per service on a given node is supported. 3. Only "externalTrafficPolicy: Local" is supported. Default Cluster policy is not accelerated. Note that in unsupported configurations traffic will still be passed, but will not be accelerated. Example of nginx load-balanced service -------------------------------------- We can expose an Onloaded nginx service externally using the following manifest: > apiVersion: apps/v1 > kind: Deployment > metadata: > name: nginx > spec: > selector: > matchLabels: > app: nginx > template: > metadata: > labels: > app: nginx > spec: > containers: > - name: nginx > image: nginx:1 > ports: > - name: http > containerPort: 80 > env: > - name: LD_PRELOAD > value: libonload.so > resources: > limits: > solarflare.com/sfc: 1 > > --- > apiVersion: v1 > kind: Service > metadata: > name: nginx > spec: > ports: > - name: http > port: 80 > protocol: TCP > targetPort: 80 > selector: > app: nginx > externalTrafficPolicy: Local > type: LoadBalancer When the above manifest has been applied, Kubernetes will create a load-balanced service of type LoadBalancer. The MetalLB controller will automatically notice this service and assign external IP address from the pool of addresses that have been assigned to MetalLB. You can check this by running: $ kubectl get service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.1.0.1 443/TCP 5h nginx LoadBalancer 10.1.2.2 198.1.1.1 80:30096/TCP 4h Onload's OCKA agent will automatically notice services with external IP addresses and will configure the nodes with pods backing the service to pass accelerated external traffic to the pods. In the above example, "externalTrafficPolicy" is set to "Local" which means that MetalLB will configure the network routers to forward packets to nodes with pods backing the service only. When traffic arrives at a node Kubernetes will forward traffic to pods backing the service running on that node only. =========================== Building for custom kernels =========================== In order to build a docker image containing drivers for a custom kernel version, you will need three things: 1. A docker image containing the Onload source code, prebuilt userlevel libraries, and the tools required to build and install the kernel drivers. This image is included as a part of this bundle. 2. A docker image containing the tools required to build the Onload kernel drivers. 3. A Dockerfile to build the complete Onload driver image. A template is included in this bundle. Onload library/tools docker image (1) ------------------------------------- This image is included in the bundle with the name "onload-ul". Load it with: # docker load -i onload-ul-.tar NOTE: This image does not need to be pushed to a registry as it is only used for the purpose of building other images Driver builder docker image (2) ------------------------------- You will need to build a docker image capable of building drivers for your kernel. The steps required here will vary between distributions. As a starting point, here is an example Dockerfile for recent Red Hat kernels (note that you will need to provide a config file for a yum repo including these packages): FROM registry.access.redhat.com/ubi8/ubi:8.0 COPY my-config.repo /etc/yum/repos.d RUN dnf install -y \ python2 \ gcc \ hostname \ make \ dnf-utils \ elfutils-libelf-devel \ kernel-4.18.0-80.7.2.el8_0.x86_64 \ kernel-devel-4.18.0-80.7.2.el8_0.x86_64 Here is an example that would work on Debian 9 FROM debian:stretch RUN apt-get update RUN apt-get install -y \ gcc \ python2.7 \ hostname \ make \ elfutils \ linux-image-4.9.0-9-amd64 \ linux-headers-4.9.0-9-amd64 \ python-minimal NOTE: It is essential that this docker image contains "python2" in PATH. Onload driver Dockerfile (3) ---------------------------- Xilinx provides a Dockerfile template that can be used to build the Onload driver image. It is in the same bundle as this readme, named DriverImageDockerfile.tmpl 1. Fill in the {{onload_ul_image}} field with the image from (1) above 2. Fill in the {{builder_image}} field with the image from (2) above 3. Fill in the {{kernel_version}} field with your kernel version 4. Build the docker image 5. Tag and push this image to your registry so it is accessible from the cluster Installing the operator with the custom image --------------------------------------------- Before installing the operator, update the Custom Resource specification (which is in 9000_example_cr.yaml manifest if installing manually) to add the key "onloadDriverImageFqin" to the "spec" section, pointing at the image you built in (3) above: apiVersion: solarflare.com/v1beta1 kind: CloudOnloadOperator metadata: name: example namespace: cloud-onload-operator spec: version: ... onloadDriverImageFqin: "my_registry/custom-driver-image:mytag" ... ... ... NOTE: It it still necessary to set the "version" field when using a custom driver image. ====================== Upgrading the operator ====================== The operator can be upgraded in several ways: * Existing installations can be updated to use a newer version of Onload. * Existing installations can be upgraded to support newer kernel versions. * The operator itself can be upgraded to a new release. These operations can be done independently, or together, depending on the requirements of the cluster. NOTE: Only operator versions from v2.0.0 onwards are capable of being upgraded to newer releases. The v1.0.0 release of the operator does not have this functionality and must be uninstalled in order to install v2.0.0. Upgrading the Onload version used by the operator ------------------------------------------------- The version of Onload that the operator uses is a parameter on the operator CR and can be changed. This will cause the operator to deploy new driver images on every node that it is running on. As with any other change to the operator CR this can be performed as a rolling update to minimise disruption, see the "Performing rolling updates" section. If driver images with the new Onload version are not available for the kernels in use on the cluster then follow the steps in the "Building for custom kernels" section to build the required images. Once driver images with the new Onload version and required kernel are available: 1. Ensure the new driver images are pushed to a Docker registry accessible to the cluster. 2. Set up a PodDisruptionBudget (as described in "Altering Operator Settings") if performing as a rolling update. 3. Change the Onload version in the operator CR (9000_example_cr.yaml) to the new version and re-apply. 4. Wait for the operator to become ready again. It should now be running the driver images with the new Onload version. Upgrading the operator to support new kernels --------------------------------------------- The driver images that the operator uses are specific to a given kernel. If updating the kernel version of a node, or adding a node with a new kernel version to the cluster, a driver image for the current Onload version on the new kernel is required. If a suitable image is not currently available then follow the steps in the "Building for custom kernels" section to build images as required. Once suitable images are available, make them visible to the cluster as described in the "Installing the operator" section. The images should be made available before updating the kernel or adding a new node to the cluster. The operator will then use the new image automatically when the node is brought up. If a node is brought up without a suitable image available then the operator will not become ready on that node. After the image is built and made available the operator should pull it and become ready automatically. Upgrading to new operator releases ---------------------------------- Operator versions from v2.0.0 onwards can be upgraded to a newer release. To upgrade from v1.0.0 to v2.0.0 onwards, first uninstall the v1.0.0 operator, following the uninstallation instructions for that version, and then install the v2.0.0 operator following the installation instructions for that version. To upgrade from v2.0.0 onwards, follow the installation instructions provided with the new operator.