OpenOnload-201805
=================

 This is a major feature release that adds new features to OpenOnload, in
 particular support for the new X2 adapters from Solarflare, extensions
 to TCPDirect and new Onload scalable filter capabilities.

 The support for X2 adapters includes the new cut-through PIO (CTPIO)
 mechanism, which reduces latency on the send path.

 The notes below describe some of the new features, including CTPIO and
 how it is used with OpenOnload, TCPDirect and ef_vi.

 See the Onload user guide for full details of new features and associated
 configuration options.

 See the accompanying ChangeLog for a list of bugs fixed.


Linux distribution support
--------------------------

 This package is supported on:
 - Red Hat Enterprise Linux 6.7 - 6.9
 - Red Hat Enterprise Linux 7.1 - 7.5
 - SuSE Linux Enterprise Server 11 sp4
 - SuSE Linux Enterprise Server 12 sp2 and sp3
 - Canonical Ubuntu Server LTS 16.04, 18.04
 - Canonical Ubuntu Server 17.10
 - Debian 8 "Jessie"
 - Debian 9 "Stretch"
 - Linux kernels 3.0 - 4.16.6


CTPIO
-----

 CTPIO improves send latency by moving packets from the PCIe bus to network
 port with minimal latency.  It can be used in three modes:

   1) Cut-through: The frame is transmitted onto the network as it is
      streamed across the PCIe bus.  This mode offers the best latency.

   2) Store-and-forward: The frame is buffered on the adapter before
      transmiting onto the network.

   3) Store-and-forward with poison disabled: As for (2), except that it is
      guaranteed that frames are never poisoned.  When this mode is enabled
      on any VI, all VIs are placed into store-and-forward mode.

 Underrun, poisoning and fallback:

 When using cut-through mode, if the frame is not streamed to the adapter
 at at least line rate, then the frame is likely to be poisoned.  This is
 most likely to happen if the application thread is interrupted while
 writing the frame to the adapter.  In the underrun case, the frame is
 terminated with an invalid FCS -- this is referred to as "poisoning" --
 and so will be discarded by the link partner.  Cut-through mode is
 currently expected to perform well only on 10G links.

 CTPIO may be aborted for other reasons, including timeout while writing a
 frame, contention between threads using CTPIO at the same time, and the
 CPU writing data to the adapter in the wrong order.

 In all of the above failure cases the adapter falls-back to sending via
 the traditional DMA mechanism, which incurs a latency penalty.  So a valid
 copy of the packet is always transmitted, whether the CTPIO operation
 succeeds or not.

 Normally only an underrun in cut-through mode will result in a poisoned
 frame being transmitted.  In rare cases it is also possible for a poisoned
 frame to be emitted in store-and-forward mode.  If it is necessary to
 strictly prevent poisoned packets from reaching the network, then
 poisoning can be disabled globally.


CTPIO with OpenOnload
---------------------

 When using a suitable adapter, CTPIO is enabled by default in "sf-np"
 mode.  The operation of CTPIO can be controlled via the following
 Onload configuration variables:

   EF_CTPIO: 0 (disabled)
             1 (enable) (default)
             2 (enable, fail if not available)

   EF_CTPIO_MODE: ct      (cut-through mode)
                  sf      (store-and-forward mode)
                  sf-np   (store-and-forward with no poisoning)

   EF_CTPIO_MAX_FRAME_LEN: CTPIO is only used for packets whose length does
                           not exceed this value.  Packets exceeding this
                           length are sent using another mechanism (DMA or
                           PIO).

 CTPIO bypasses the main adapter datapath, and as a result, packets sent
 by CTPIO cannot be looped back in hardware.  As a result, CTPIO will not
 be enabled by default on interfaces running full-featured firmware.  This
 behaviour can be overridden using the EF_CTPIO_SWITCH_BYPASS configuration
 variable.


CTPIO with TCPDirect
--------------------

 When using a suitable adapter CTPIO is enabled by default in "sf-np"
 mode.  The operation of CTPIO can be controlled via the following
 TCPDirect attributes:

   ctpio: 0 (disable)
          1 (enable) (default)
          2 (enable, warn if not available)
          3 (enable, fail if not available)

   ctpio_mode: ct      (cut-through mode)
               sf      (store-and-forward mode)
               sf-np   (store-and-forward with no poisoning)


CTPIO with ef_vi
----------------

 A complete example showing the use of CTPIO with ef_vi is given in
 openonload/src/tests/rtt/rtt_efvi.c.

 N.B.  CTPIO bypasses the main adapter datapath, and as a result does not
 support checksum offload: Frames sent with CTPIO are placed on the wire
 without modification, so checksum fields must be completed in software before
 sending.  For the same reason, frames sent with CTPIO cannot be looped
 back by the adapter.

 Here is the sequence of steps needed:

   1) When allocating a VI (ef_vi_alloc_from_pd()) set the EF_VI_TX_CTPIO
      flag.

   2) To initiate a send, form a complete Ethernet frame (including L3/L4
      checksums, but excluding FCS) in host memory.  Initiate the send with
      ef_vi_transmit_ctpio() or ef_vi_transmitv_ctpio().

   3) Post a fall-back descriptor using ef_vi_transmit_ctpio_fallback() or
      ef_vi_transmitv_ctpio_fallback().  These calls are used just like the
      standard DMA send calls (ef_vi_transmit() etc.), and so must be
      provided with a copy of the frame in registered memory.

      The posting of a fall-back descriptor is not on the latency critical
      path, provided the CTPIO operation succeeds.  However, it must be
      posted before posting any further sends on the same VI.


CTPIO diagnostics
-----------------

 The adapter maintains counters that show whether CTPIO is being used, and
 any reasons for CTPIO sends failing.  These can be inspected as follows:

   ethtool -S ethX | grep ctpio

 Note that some of these counters are maintained on a per-adapter basis,
 whereas others are per network port.


Delegated sends in TCPDirect: technology preview
------------------------------------------------

 This release adds new API functions to TCPDirect that allow the user to
 have TCPDirect handle the TCP state machine while performing critical
 sends through another mechanism (such as ef_vi) to achieve lower
 latency.  This is analogous both in principle and in the arrangement of
 the API to the delegated sends feature of Onload.  The new API is
 offered as a technology preview: it is functionally complete but has
 had limited testing, and the API may change if we find problems.
 Customers are invited to experiment with the feature and to direct
 feedback to support@solarflare.com.

 Zockets are created through the TCPDirect API as normal.  The user can
 then request that TCPDirect delegate sending to their application.
 Once the application has completed a send through (for example) ef_vi,
 it updates TCPDirect and TCPDirect will handle the TCP state machinery,
 retransmissions, and so on.

 There is an example application at

   <onload_install_dir>/src/tests/trade_sim/trader_tcpdirect_ds_efvi.c

 Please see the README file in the same directory for further details.

 The delegated sends API should be used carefully as there are windows of
 time (while the send has been delegated to the application) when the TCP
 sequence numbers used by TCPDirect (for ACKs) and the application (for
 messages) differ.  Delegated sends has been implemented in a way that
 should make this harmless, but it is important to ensure the API is used
 correctly, and there is potential to confuse other TCP implementations.


Onload scalable filters
-----------------------

 A major overhaul to clustering and scalable filters enables new data
 center use cases.

 Scalable filters direct all traffic for a given VI to Onload so that
 the filter table size is not a constraint on the number of concurrent
 connections supported efficiently. This release allows scalable filters
 to be combined for both passive and active open connections and with RSS,
 enabling very high transaction rates for use cases such as a reverse web
 proxy.

 The extended feature requires modern CPUs that support the CLMUL
 instruction.

 See the user guide on how to work with the new scalable filter modes
 using new and existing Onload configuration options including:

 EF_SCALABLE_FILTERS:                  any=rss:active:passive

   This option has been extended to support active-open sockets, and to
   support multiple interfaces.

 EF_SCALABLE_LISTEN_MODE:              1

   This option avoids the performance limitations of kernel listening
   sockets when there is a very large number of listening sockets.

 The most effective way to use scalable filters is with a dedicated
 VI created with a MACVLAN. The new kernel option inject_kernel_gid
 controls the injection of packets not handled by Onload back to the
 kernel when the VI is instead shared with other functions.


NGINX application support
-------------------------

 Onload's clustering capability can support the NGINX application's
 master process pattern and hot restart operation by correctly
 associating Onload stacks and application worker processes, e.g.:

 EF_SCALABLE_FILTERS_ENABLE:           2
 EF_CLUSTER_HOT_RESTART:               1

 See the user guide for full details. The Onload runtime profile
 "nginx_reverse_proxy.opf" has an example set of relevant configurations.

 Use of this mode is not compatible with use of the onload extensions
 stackname API.


Improvements to socket caching and EF_TCP_SHARED_LOCAL_PORTS
------------------------------------------------------------

 Applications such as web proxies that create and close large numbers of
 sockets can benefit significantly from Onload's reduced overhead
 compared to the kernel in the handling of the socket lifecycle.  With
 these use cases in mind, some configuration options have been added to
 the EF_TCP_SHARED_LOCAL_PORTS feature:

 EF_TCP_SHARED_LOCAL_PORTS_REUSE_FAST allows the pool of shared local
 ports to be recycled more rapidly, and EF_TCP_SHARED_LOCAL_PORTS_PER_IP
 enables shared local ports to be reserved dynamically for the IP
 addresses that require them, avoiding exhaustion of the ephemeral port
 range.

 Similarly, socket caching has been extended to improve the support for
 such applications.  Applications using many listening sockets with
 scalable filters can now use a common cache of sockets accepted from
 them, improving utilization of the cache.  Also, when a listening
 socket is used by multiple processes simultaneously, file descriptors
 can now be cached per-process, whereas before, accepted sockets were
 cachable only in the process that originated them.  This is of
 particular benefit to server applications such as NGINX that support
 dynamic reconfiguration by spawning a new process reusing existing
 listening sockets.  This feature is not compatible with sockets that
 require O_CLOEXEC.

 The default value for EF_PER_SOCKET_CACHE_MAX has changed from 0 to -1,
 but the significance is unchanged: the default is still that there is
 no per-listen-socket limit to the number of cached accepted sockets.  A
 value of 0 now means that no accepted sockets will be cached for any
 listening sockets; this allows active-open socket caching to be enabled
 without also enabling passive-open socket caching.

 Please see the User Guide for full details of these features.


Changes to EF_UL_EPOLL=3
------------------------

 EF_UL_EPOLL=3 allows epoll_wait() to perform well on epoll sets that
 contain a large number of accelerated sockets.  Previously, each socket
 could gain this scalability benefit in at most one epoll set at a time.
 Now, this is possible in up to four epoll sets for any given socket,
 provided that each epoll set is in a different process.


Other new Onload configuration options
--------------------------------------

 In addition to those already mentioned, this release of Onload
 introduces the following new environmental options:

  - EF_KERNEL_PACKETS_BATCH_SIZE
  - EF_KERNEL_PACKETS_TIMER_USEC
    Controls how often packets destined for the kernel are forwarded
    there when kernel packet injection is enabled using the
    inject_kernel_gid module option.

  - EF_PERIODIC_TIMER_CPU
    Specifies CPU affinity for periodic kernel tasks.

  - EF_PREALLOC_PACKETS
    Allocates the maximum number of packet buffers when the stack is
    created.

  - EF_SCALABLE_ACTIVE_WILDS_NEED_FILTER
    Normally this option is set automatically according to the
    scalable-filter mode, but it may be used manually to override
    whether TCP shared local ports use IP hardware filters.

  - EF_SLEEP_SPIN_USEC
    Sleeps periodically during epoll_wait() after the initial spin
    timeout has elapsed.

  - EF_SYNC_CPLANE_AT_CREATE
    Controls how Onload queries network interface configuration when
    creating stacks.

  - EF_TCP_MIN_CWND
    Sets the minimum size of the congestion window for TCP connections.

  - EF_TCP_SHARED_LOCAL_PORTS_NO_FALLBACK
    Causes connect() to fail for TCP sockets if no shared local ports
    are available.

 Finally, the default value, but not the default behaviour, of the
 EF_PER_SOCKET_CACHE_MAX variable has changed: please see the section above
 on socket caching.

 Please see the User Guide for more details.


32-bit kernels
--------------

 This release drops support for 32-bit kernels. 32-bit userspace
 applications are still supported.


CONFIG_KALLSYMS
---------------

 The CONFIG_KALLSYMS kernel configuration option is required for Onload to
 operate correctly.  The dependency of Onload on this option is not new in
 this release, but very recent kernels -- namely, those containing the
 commit 21d375b6b34ff511a507de27bf316b3dde6938d9 that removes the x86
 syscall-dispatch fast path -- will be unable to load the Onload driver
 when CONFIG_KALLSYSMS is not defined.  All of the Linux distributions
 listed in the table at the start of this file set this option in their
 default kernel configurations and so are not affected.


Monitoring tools
----------------

 The 'orm_json' tool for monitoring Onload stacks and reporting statistics
 has been improved to include more comprehensive statistics and further
 details of Onload stacks' status.  To ease integration with standard
 performance monitoring tools, the new version features an additional, more
 machine-friendly output format, as well as the facility to generate
 metadata for the reported statistics.