OpenOnload-201805 ================= This is a major feature release that adds new features to OpenOnload, in particular support for the new X2 adapters from Solarflare, extensions to TCPDirect and new Onload scalable filter capabilities. The support for X2 adapters includes the new cut-through PIO (CTPIO) mechanism, which reduces latency on the send path. The notes below describe some of the new features, including CTPIO and how it is used with OpenOnload, TCPDirect and ef_vi. See the Onload user guide for full details of new features and associated configuration options. See the accompanying ChangeLog for a list of bugs fixed. Linux distribution support -------------------------- This package is supported on: - Red Hat Enterprise Linux 6.7 - 6.9 - Red Hat Enterprise Linux 7.1 - 7.5 - SuSE Linux Enterprise Server 11 sp4 - SuSE Linux Enterprise Server 12 sp2 and sp3 - Canonical Ubuntu Server LTS 16.04, 18.04 - Canonical Ubuntu Server 17.10 - Debian 8 "Jessie" - Debian 9 "Stretch" - Linux kernels 3.0 - 4.16.6 CTPIO ----- CTPIO improves send latency by moving packets from the PCIe bus to network port with minimal latency. It can be used in three modes: 1) Cut-through: The frame is transmitted onto the network as it is streamed across the PCIe bus. This mode offers the best latency. 2) Store-and-forward: The frame is buffered on the adapter before transmiting onto the network. 3) Store-and-forward with poison disabled: As for (2), except that it is guaranteed that frames are never poisoned. When this mode is enabled on any VI, all VIs are placed into store-and-forward mode. Underrun, poisoning and fallback: When using cut-through mode, if the frame is not streamed to the adapter at at least line rate, then the frame is likely to be poisoned. This is most likely to happen if the application thread is interrupted while writing the frame to the adapter. In the underrun case, the frame is terminated with an invalid FCS -- this is referred to as "poisoning" -- and so will be discarded by the link partner. Cut-through mode is currently expected to perform well only on 10G links. CTPIO may be aborted for other reasons, including timeout while writing a frame, contention between threads using CTPIO at the same time, and the CPU writing data to the adapter in the wrong order. In all of the above failure cases the adapter falls-back to sending via the traditional DMA mechanism, which incurs a latency penalty. So a valid copy of the packet is always transmitted, whether the CTPIO operation succeeds or not. Normally only an underrun in cut-through mode will result in a poisoned frame being transmitted. In rare cases it is also possible for a poisoned frame to be emitted in store-and-forward mode. If it is necessary to strictly prevent poisoned packets from reaching the network, then poisoning can be disabled globally. CTPIO with OpenOnload --------------------- When using a suitable adapter, CTPIO is enabled by default in "sf-np" mode. The operation of CTPIO can be controlled via the following Onload configuration variables: EF_CTPIO: 0 (disabled) 1 (enable) (default) 2 (enable, fail if not available) EF_CTPIO_MODE: ct (cut-through mode) sf (store-and-forward mode) sf-np (store-and-forward with no poisoning) EF_CTPIO_MAX_FRAME_LEN: CTPIO is only used for packets whose length does not exceed this value. Packets exceeding this length are sent using another mechanism (DMA or PIO). CTPIO bypasses the main adapter datapath, and as a result, packets sent by CTPIO cannot be looped back in hardware. As a result, CTPIO will not be enabled by default on interfaces running full-featured firmware. This behaviour can be overridden using the EF_CTPIO_SWITCH_BYPASS configuration variable. CTPIO with TCPDirect -------------------- When using a suitable adapter CTPIO is enabled by default in "sf-np" mode. The operation of CTPIO can be controlled via the following TCPDirect attributes: ctpio: 0 (disable) 1 (enable) (default) 2 (enable, warn if not available) 3 (enable, fail if not available) ctpio_mode: ct (cut-through mode) sf (store-and-forward mode) sf-np (store-and-forward with no poisoning) CTPIO with ef_vi ---------------- A complete example showing the use of CTPIO with ef_vi is given in openonload/src/tests/rtt/rtt_efvi.c. N.B. CTPIO bypasses the main adapter datapath, and as a result does not support checksum offload: Frames sent with CTPIO are placed on the wire without modification, so checksum fields must be completed in software before sending. For the same reason, frames sent with CTPIO cannot be looped back by the adapter. Here is the sequence of steps needed: 1) When allocating a VI (ef_vi_alloc_from_pd()) set the EF_VI_TX_CTPIO flag. 2) To initiate a send, form a complete Ethernet frame (including L3/L4 checksums, but excluding FCS) in host memory. Initiate the send with ef_vi_transmit_ctpio() or ef_vi_transmitv_ctpio(). 3) Post a fall-back descriptor using ef_vi_transmit_ctpio_fallback() or ef_vi_transmitv_ctpio_fallback(). These calls are used just like the standard DMA send calls (ef_vi_transmit() etc.), and so must be provided with a copy of the frame in registered memory. The posting of a fall-back descriptor is not on the latency critical path, provided the CTPIO operation succeeds. However, it must be posted before posting any further sends on the same VI. CTPIO diagnostics ----------------- The adapter maintains counters that show whether CTPIO is being used, and any reasons for CTPIO sends failing. These can be inspected as follows: ethtool -S ethX | grep ctpio Note that some of these counters are maintained on a per-adapter basis, whereas others are per network port. Delegated sends in TCPDirect: technology preview ------------------------------------------------ This release adds new API functions to TCPDirect that allow the user to have TCPDirect handle the TCP state machine while performing critical sends through another mechanism (such as ef_vi) to achieve lower latency. This is analogous both in principle and in the arrangement of the API to the delegated sends feature of Onload. The new API is offered as a technology preview: it is functionally complete but has had limited testing, and the API may change if we find problems. Customers are invited to experiment with the feature and to direct feedback to support@solarflare.com. Zockets are created through the TCPDirect API as normal. The user can then request that TCPDirect delegate sending to their application. Once the application has completed a send through (for example) ef_vi, it updates TCPDirect and TCPDirect will handle the TCP state machinery, retransmissions, and so on. There is an example application at /src/tests/trade_sim/trader_tcpdirect_ds_efvi.c Please see the README file in the same directory for further details. The delegated sends API should be used carefully as there are windows of time (while the send has been delegated to the application) when the TCP sequence numbers used by TCPDirect (for ACKs) and the application (for messages) differ. Delegated sends has been implemented in a way that should make this harmless, but it is important to ensure the API is used correctly, and there is potential to confuse other TCP implementations. Onload scalable filters ----------------------- A major overhaul to clustering and scalable filters enables new data center use cases. Scalable filters direct all traffic for a given VI to Onload so that the filter table size is not a constraint on the number of concurrent connections supported efficiently. This release allows scalable filters to be combined for both passive and active open connections and with RSS, enabling very high transaction rates for use cases such as a reverse web proxy. The extended feature requires modern CPUs that support the CLMUL instruction. See the user guide on how to work with the new scalable filter modes using new and existing Onload configuration options including: EF_SCALABLE_FILTERS: any=rss:active:passive This option has been extended to support active-open sockets, and to support multiple interfaces. EF_SCALABLE_LISTEN_MODE: 1 This option avoids the performance limitations of kernel listening sockets when there is a very large number of listening sockets. The most effective way to use scalable filters is with a dedicated VI created with a MACVLAN. The new kernel option inject_kernel_gid controls the injection of packets not handled by Onload back to the kernel when the VI is instead shared with other functions. NGINX application support ------------------------- Onload's clustering capability can support the NGINX application's master process pattern and hot restart operation by correctly associating Onload stacks and application worker processes, e.g.: EF_SCALABLE_FILTERS_ENABLE: 2 EF_CLUSTER_HOT_RESTART: 1 See the user guide for full details. The Onload runtime profile "nginx_reverse_proxy.opf" has an example set of relevant configurations. Use of this mode is not compatible with use of the onload extensions stackname API. Improvements to socket caching and EF_TCP_SHARED_LOCAL_PORTS ------------------------------------------------------------ Applications such as web proxies that create and close large numbers of sockets can benefit significantly from Onload's reduced overhead compared to the kernel in the handling of the socket lifecycle. With these use cases in mind, some configuration options have been added to the EF_TCP_SHARED_LOCAL_PORTS feature: EF_TCP_SHARED_LOCAL_PORTS_REUSE_FAST allows the pool of shared local ports to be recycled more rapidly, and EF_TCP_SHARED_LOCAL_PORTS_PER_IP enables shared local ports to be reserved dynamically for the IP addresses that require them, avoiding exhaustion of the ephemeral port range. Similarly, socket caching has been extended to improve the support for such applications. Applications using many listening sockets with scalable filters can now use a common cache of sockets accepted from them, improving utilization of the cache. Also, when a listening socket is used by multiple processes simultaneously, file descriptors can now be cached per-process, whereas before, accepted sockets were cachable only in the process that originated them. This is of particular benefit to server applications such as NGINX that support dynamic reconfiguration by spawning a new process reusing existing listening sockets. This feature is not compatible with sockets that require O_CLOEXEC. The default value for EF_PER_SOCKET_CACHE_MAX has changed from 0 to -1, but the significance is unchanged: the default is still that there is no per-listen-socket limit to the number of cached accepted sockets. A value of 0 now means that no accepted sockets will be cached for any listening sockets; this allows active-open socket caching to be enabled without also enabling passive-open socket caching. Please see the User Guide for full details of these features. Changes to EF_UL_EPOLL=3 ------------------------ EF_UL_EPOLL=3 allows epoll_wait() to perform well on epoll sets that contain a large number of accelerated sockets. Previously, each socket could gain this scalability benefit in at most one epoll set at a time. Now, this is possible in up to four epoll sets for any given socket, provided that each epoll set is in a different process. Other new Onload configuration options -------------------------------------- In addition to those already mentioned, this release of Onload introduces the following new environmental options: - EF_KERNEL_PACKETS_BATCH_SIZE - EF_KERNEL_PACKETS_TIMER_USEC Controls how often packets destined for the kernel are forwarded there when kernel packet injection is enabled using the inject_kernel_gid module option. - EF_PERIODIC_TIMER_CPU Specifies CPU affinity for periodic kernel tasks. - EF_PREALLOC_PACKETS Allocates the maximum number of packet buffers when the stack is created. - EF_SCALABLE_ACTIVE_WILDS_NEED_FILTER Normally this option is set automatically according to the scalable-filter mode, but it may be used manually to override whether TCP shared local ports use IP hardware filters. - EF_SLEEP_SPIN_USEC Sleeps periodically during epoll_wait() after the initial spin timeout has elapsed. - EF_SYNC_CPLANE_AT_CREATE Controls how Onload queries network interface configuration when creating stacks. - EF_TCP_MIN_CWND Sets the minimum size of the congestion window for TCP connections. - EF_TCP_SHARED_LOCAL_PORTS_NO_FALLBACK Causes connect() to fail for TCP sockets if no shared local ports are available. Finally, the default value, but not the default behaviour, of the EF_PER_SOCKET_CACHE_MAX variable has changed: please see the section above on socket caching. Please see the User Guide for more details. 32-bit kernels -------------- This release drops support for 32-bit kernels. 32-bit userspace applications are still supported. CONFIG_KALLSYMS --------------- The CONFIG_KALLSYMS kernel configuration option is required for Onload to operate correctly. The dependency of Onload on this option is not new in this release, but very recent kernels -- namely, those containing the commit 21d375b6b34ff511a507de27bf316b3dde6938d9 that removes the x86 syscall-dispatch fast path -- will be unable to load the Onload driver when CONFIG_KALLSYSMS is not defined. All of the Linux distributions listed in the table at the start of this file set this option in their default kernel configurations and so are not affected. Monitoring tools ---------------- The 'orm_json' tool for monitoring Onload stacks and reporting statistics has been improved to include more comprehensive statistics and further details of Onload stacks' status. To ease integration with standard performance monitoring tools, the new version features an additional, more machine-friendly output format, as well as the facility to generate metadata for the reported statistics.