Skip to content

IPFIX Performance

IPFIX

IPFIX is a standard for recording “flows”, which are packets with the same: source IP, destination IP, source port, destination port and protocol. It’s got a few uses but probably the most familiar is tracking usage for billing. The recording is done by something called a “probe” which gets the traffic in and builds a “flow table”, usually some form of hash-map which records a “flow” backed by additional data like how many bytes and packets have been transmitted. These flows are then exported by an “exporter”, usually just another process outside the performance critical “dataplane” (probe) to a collector which can store and display the information in a useful / pretty manner such as a web UI. IPFIX usually runs on a copy of the traffic to ensure not to hinder the traffic itself, which means we shouldn’t concern ourselves too much with what comes out at the end.

So, why am I telling you about IPFIX? My colleagues Asumu and Andy have already implemented this in Snabb, and it already exists as a plugin in VPP. What it does is not too tricky and thus it makes for a great example to try to re-implement as a method of learning VPP. We’re able to test performance and see how things work while hacking on something familiar. There are definitely more features it could have and that the stock VPP one probably does implement. The Snabb implementation and our VPP plugin are fairly bare-bones, but they do record some basic information and export it.

Measurements!

We’re going to use a server with two network cards connected in a pair. This allows us to attach a Snabb app we wrote to one end which will send packets and have those packets be received by another card by VPP. The network card works by having ring buffer with a read pointer and a write pointer. You advance the read pointer to access the next packets and the network card will advance the write buffer when writing new packets it receives. The card keeps track of how often it finds itself writing over the unread packets in a counter. You can see some of the counters, including the rx missed counter showing how many packets we’re dropping by being too slow. This is what it looks when you ask VPP to show those counters:

TenGigabitEthernet2/0/0            1     up   TenGigabitEthernet2/0/0
  Ethernet address 00:11:22:33:44:55
  Intel 82599
    carrier up full duplex speed 10000 mtu 9216 
    rx queues 1, rx desc 1024, tx queues 2, tx desc 1024
    cpu socket 0

    rx frames ok                                     2398280
    rx bytes ok                                    143896800
    rx missed                                         252816
    extended stats:
      rx good packets                                2398280
      rx good bytes                                143896800
      rx q0packets                                   2398280
      rx q0bytes                                   143896800
      rx size 64 packets                             2658721
      rx total packets                               2658694
      rx total bytes                               159520296
      rx l3 l4 xsum error                            2658762
      rx priority0 dropped                            253924
local0                             0    down  local0

We need to find the no-drop rate (NDR), which is the highest speed where we can keep up with the load. When the card shows that we’re dropping packets, we know we’re going too fast and need to back off. We’re able to find this speed by performing a binary search. Bisecting the speed and testing testing if we’re dropping packets, if so then we need to go lower, bisecting the remaining speed, otherwise bisect the higher range. This is an example of our snabb app performing the binary search, looking for the NDR:

Applying 5.000000 Gbps of load.
    TX 7440477 packets (7.440477 MPPS), 446428620 bytes (5.000001 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 7440477 packets lost (100.000000%)
Success.
Applying 7.500000 Gbps of load.
    TX 11160650 packets (11.160650 MPPS), 669639000 bytes (7.499957 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 11160650 packets lost (100.000000%)
Failed.
Applying 6.250000 Gbps of load.
    TX 9300576 packets (9.300576 MPPS), 558034560 bytes (6.249987 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 9300576 packets lost (100.000000%)
Failed.
Applying 5.625000 Gbps of load.
    TX 8370519 packets (8.370519 MPPS), 502231140 bytes (5.624989 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8370519 packets lost (100.000000%)
Success.
Applying 5.938000 Gbps of load.
    TX 8836291 packets (8.836291 MPPS), 530177460 bytes (5.937988 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8836291 packets lost (100.000000%)
Failed.
Applying 5.782000 Gbps of load.
    TX 8604158 packets (8.604158 MPPS), 516249480 bytes (5.781994 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8604158 packets lost (100.000000%)
Success.
Applying 5.860000 Gbps of load.
    TX 8720228 packets (8.720228 MPPS), 523213680 bytes (5.859993 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8720228 packets lost (100.000000%)
Success.
Applying 5.899000 Gbps of load.
    TX 8778268 packets (8.778268 MPPS), 526696080 bytes (5.898996 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8778268 packets lost (100.000000%)
Success.
Applying 5.919000 Gbps of load.
    TX 8808026 packets (8.808026 MPPS), 528481560 bytes (5.918993 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8808026 packets lost (100.000000%)
Failed.
Applying 5.909000 Gbps of load.
    TX 8793130 packets (8.793130 MPPS), 527587800 bytes (5.908983 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8793130 packets lost (100.000000%)
Failed.
Applying 5.904000 Gbps of load.
    TX 8785707 packets (8.785707 MPPS), 527142420 bytes (5.903995 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8785707 packets lost (100.000000%)
Success.
Applying 5.907000 Gbps of load.
    TX 8790150 packets (8.790150 MPPS), 527409000 bytes (5.906981 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8790150 packets lost (100.000000%)
Failed.
Applying 5.906000 Gbps of load.                                                                                                                                                                                                   
    TX 8788687 packets (8.788687 MPPS), 527321220 bytes (5.905998 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8788687 packets lost (100.000000%)
Failed.
Applying 5.905000 Gbps of load.
    TX 8787182 packets (8.787182 MPPS), 527230920 bytes (5.904986 Gbps)
    RX 0 packets (0.000000 MPPS), 0 bytes (0.000000 Gbps)
    Loss: 0 ingress drop + 8787182 packets lost (100.000000%)
Success.
5.905

The output shows applying a specific load and the success or failed state is based on if the rx missed counter from VPP is showing packets dropped or not. If we apply a load and it fails then we attempt it three times. The output of the failed retries has been removed for brevity.

The sample data is a test data which includes a lot of flows, we have 6 different sample data sets which we test three times to be able to avarage the data. The sample datasets range from the smallest possible ethernet frame size, 64 bytes to the largest standard ethernet frame size of 1522 bytes.

Reaching peek performance

The initial results without any performance tweaking on our VPP plugin are:

 

The first thing that grabbed my interest was that we weren’t getting to full speed with larger packets, something that shouldn’t be too tricky. As it turns out we had added some ELOG statements, a light-weight method of logging in VPP. These are pretty useful for getting a good look inside your function so that you can figure out what happens and when. Turns out they’re not as lightweight as I’d have liked. Removing those allowed us to get better performance across the board, including reaching line-rate (10 Gigabit per second) for frame sizes of 200 and over.

Then my attention turned to the 64 and 100 frame sizes. I was wondering how long we were actually taking on average for each packet and how it compared to the stock VPP plugin. It turned out that we’re taking around 131 nanoseconds when the stock one takes around 67 nanoseconds. That’s a big difference, and for those who read my first networking post, you’ll remember I mentioned that you have around 67.2 nanoseconds per packet. We were definitely exceeding our budget. Taking a bit of a closer look, it’s taking us around 103 ns just to look up in our flow table whether a record exists or if it’s a new flow.

I decided to look at what the stock plugin does. VPP has a lot of different data-types built in, and they had decided to go down a different route than us. They’re using a vector where the hash of the flow key is the index and the value is an index into a pool. VPP pools are fast blocks of memory to fit fixed sized stuff into, you add things and get back an index and use that to fetch it. We’re using a bihash which is a bounded-index hash map, it’s pretty handy as we just have the key as the flow key and then backing it the flow entry containing for example: amount of packets, bytes, and the start and end times for the flow. We had a similar problem in Snabb where our hashing algorithm was slow for the large key. I thought maybe the same thing was occurring but in actuality, the stock VPP was using the same xxhash algorithm that bihashes use by default and that was getting good performance. Not that then.

The bi-hash works by having a vector of buckets, the amount of which are specified when you initialise it. When you add an item it selects the bucket based on the hash. Each bucket maintains it’s own cache able to fit a specific number of bi-hash entries (both key and value). The size of the cache is based a constant, defined for various key and value sized bi-hashes. The more buckets you have, the fewer keys per bucket and thus, more caches per bi-hash entries. The entries are stored on the heap in pages which the bucket manages. When an item isn’t in the cache of the bucket then, it performs a linear search over the allocated pages until it’s found. Here’s what the bucket data structure looks like:

typedef struct
{
  union
  {
    struct
    {
      u32 offset;
      u8 linear_search;
      u8 log2_pages;
      u16 cache_lru;
    };
    u64 as_u64;
  };
  BVT (clib_bihash_kv) cache[BIHASH_KVP_CACHE_SIZE];
} BVT (clib_bihash_bucket);

We were only defining 1024 buckets which meant that we missing the cache a lot and falling back to slow linear searches. Increasing this value up to 32768, we start getting much better performance:

Now we seem to be outperforming both the stock VPP plugin and Snabb. The improvements over Snabb could be due to a number of things, however VPP does something rather interesting which may give it the edge. In VPP there is a loop which can handle two packets at once, it’s identical to the single packet loop but apparently operating on two packets at a time provides performance improvements. At a talk in FOSDEM they claimed it helped, and in some parts of VPP they have loops which handle 4 packets per iteration. I think that in the future exploring why we’re seeing the VPP plugin perform better compared to Snabb would be an interesting endeavour.

Published inIgaliaNetworking