In this article, Mike O’Hara, publisher of The Trading Mesh – talks to Mike Schonberg of Quincy Data, Laurent de Barry and Nicolas Karonis of Enyx and Henry Young of TS-Associates, about how and where FPGA technology is increasingly being used in low-latency trading operations, beyond the traditional areas of market data acquisition and distribution.
FPGA (Field Programmable Gate Array) technology, having been used by latency-sensitive practitioners in the financial markets for well over five years now, can probably be considered fairly mature. The performance and reliability of FPGAs is well proven, particularly around market data handling, as processes such as acquisition, filtering, normalization and distribution of market data can all now be handled in ultra-low latency using a variety of FPGA-based solutions.
But beyond market data, how is the usage of FPGA technology evolving in the world of low-latency trading? What else are firms doing – or trying to do – with FPGAs? And how are they going about it?
One area where FPGA technology is starting to become prevalent is in microwave networking, particularly as more and more such networks are springing up between major trading locations in both the US and Europe.
“ You can use FPGAs for implementing novel compressiontechniques, for compressing Ethernet headers for example, and for having different encodings of data on the network that are more efficient from a bandwidth perspective.”
Mike Schonberg, Director Market Data Technology at Quincy Data
Firms that rely on low-latency connectivity are attracted to microwave networks because they provide the fastest point-to-point path between trading venues. However, there are numerous challenges associated with microwave. Not least the fact that bandwidth is so constrained compared with fibre. So how can FPGAs help firms make better, more efficient use of that bandwidth?
efficient from a bandwidth perspective”, explains Mike Schonberg, Director Market Data Technology at Quincy Data, a specialist provider of ultra-low latency market data services.
“Also, you might potentially need to share this very limited amount of bandwidth in a completely fair way between multiple end-users. Although there are commercial networking hardware solutions available that address this problem, they don’t work particularly well in this environment because they’re not optimised for low latency, which is why we use FPGAs for this task”, he says.
“FPGAs provide the flexibility to implement your own hardware to addresses this issue. They can also work with the rather unique network topologies associated with microwave and wireless networks”, he adds.
With a range of FPGA-enabled network hardware now available – Arista and Solarflare are two examples of vendors offering switches and network adapters containing FPGAs – Schonberg points out the importance of how and where the FPGA fits into the overall network infrastructure.
“The way the FPGA ties into the architecture is an important factor because if you make the wrong decisions early on, you can potentially restrict what you are able do with the technology going forward”, he says.
“When you have an FPGA built into a switch, for example, much of the behaviour of the switch can’t necessarily be altered. If a switch is expecting normal Ethernet packets, you couldn’t do something like header compression for example. Whereas if you build an FPGA solution from the ground up, you have complete control over how the data moves through the FPGA and what you do with that data”, says Schonberg.
Using FPGAs to control and allocate bandwidth in this way can benefit both the operator of the microwave network and its customers, by enabling the limited bandwidth to be shared across multiple users and customers both fairly and transparently, SLAs offering greater performance can be clearly defined and met.
The key factor that allows FPGAs to offer such massive improvements in performance in electronic trading, is that they enable processes traditionally handled by software to run directly in hardware on the chip itself, effectively enabling these processes to run at wire speed. This contrasts with software, where there is typically an operating system (OS) to contend with and an OS kernel that controls access to CPU, memory, disk I/O, and networking. In software, if the OS decides it needs to do something important, it can interrupt running jobs and introduce unpredictable delays. This is one of the reasons why it is very difficult to obtain jitter-free, deterministic performance in software.
“ Most of today’s TCP offload solutions remove the operating system kernel from the critical path by putting all the stress of handling the network protocols on the server’s CPU. However, this approach to kernel bypass is not a miracle cure because it remains CPU-intensive.”
Laurent de Barry, Co-Founder & Head of Application Engineering at Enyx
With an FPGA on the other hand, there is no operating system, so those types of problems do not exist. Wherever the FPGA can be used to bypass the kernel, big improvements in performance – and more importantly, deterministic performance – can be achieved, for example when using an FPGA to bypass the OS kernel when handling the TCP/IP stack.
Laurent de Barry, Co-Founder & Head of Application Engineering at Enyx, a provider of ultra-low latency solutions based around FPGA technology, explains some of the recent advances in full TCP kernel bypass via FPGA.
“Most of today’s TCP offload solutions remove the operating system kernel from the critical path by putting all the stress of handling the network protocols on the server’s CPU”, he says.
“However, despite the marketing messages you might hear, this approach to kernel bypass is not a miracle cure because it remains CPU-intensive. Some network hardware vendors now use kernel bypass technology in their ‘low-latency’ NICs to try to avoid bottlenecks by taking the whole network stack out of the kernel and into the user space. But the problem with this approach is that the network stack is still running on the CPU and is therefore loading the CPU”, says de Barry. See figure one below.
“Everything you can offload from the CPU helps improve atency and – more importantly reduce jitter”, continues Barry. “So our solution to place the full TCP stack in hardware. That way the CPU doesn’t have to worry about TCP any more as all of those processes are offloaded to the FPGA”. See figure two below.
The main advantage with this approach is that we don’t use the CPU at all, it’s all done on the FPGA card”, says de Barry.
Another area where FPGAs are being used effectively is in pre-trade risk. Increasingly in today’s electronic markets, and particularly since the introduction in the US of the SEC’s “market access rule” 15c3-51 a few years ago, orders are required to go through multiple checks to satisfy risk profiles before they are sent on to trading venues. FPGAs provide the ideal architecture for this because dozens of different pre-trade checks on a single order can be computed in parallel, all in less than a microsecond.
Nicolas Karonis, Business Development Director at Enyx, explains how this works.
“When firms originally started doing pre-trade risk checks via FPGA around three or four years ago, they jumped on the SEC regulation 15c3-5, which mandated a simple list of ‘fat finger’ checks based only upon information that was held within the order itself, i.e. quantity, price, total value of the order and so on.”
“Our approach here is to have the full order book managed by the FPGA, allowing the complex compliance needs requiring calculations on positions, computing with external arrays or cross-correlating between assets, all to be handled pre-trade.”
Nicolas Karonis, Business Development Director at Enyx
“However, it gets more complex when you need more than that. What you can’t check by just looking at the order itself is what other orders are already on the book, what executions you’ve done previously, etc”.
This is the real challenge, according to Karonis.
“Our approach here is to have the full order book managed by the FPGA, allowing the complex compliance needs requiring calculations on positions, computing with external arrays or cross-correlating between assets, all to be handled pre-trade.”
“That’s not straightforward. You need to make sure that all of the information, all of the time, can be accessed within the FPGA without you having to go out and look up a database for example. Now with properly designed solutions in hardware (with clever integration of Quadruple Data Rate memories that allow for simultaneous Read and Write) and properly optimized VHDL code these problems are overcome allowing essentially the same flexibility as software solutions for Order Management”, he says.
With more and more processes moving from the CPU to the FPGA in the race to ever lower and more deterministic latency, one of the challenges that firms now face is how to measure point-to-point latencies in these nanosecond domains.
The answer lies in instrumentation at the FPGA firmware level, according to Henry Young, CEO of TS-Associates, a supplier of precision instrumentation solutions for latency sensitive trading systems.
“In the good old days of single core servers, you could have each functional block in the trade flow – your feed handler, client connectivity gateway, algos, SOR and execution gateway – all on their own dedicated servers”, he says.
“Then along came multicore and shared memory communication. So all of a sudden you lost visibility, because traditional instrumentation techniques are based around network taps or SPAN ports, i.e. looking at packets on network connections. If you don’t have physical network connections between these various components – because they’re doing shared memory communication on the multicore server – you can’t peer inside. That’s why we launched the Application Tap, which accurately timestamps metadata within the applications themselves”, he says.
“The idea is that you can now get instrumentation that zooms into the individual functional sub- components that are all sharing the same FPGA silicon. Which is tremendously exciting.”
Henry Young, CEO of TS-Associates
Young and his team at TS-A are now working with Enyx to embed the Application Tap functionality into the FPGA itself, defining instrumentation hooks in the FPGA firmware to get emit time-stamped events that are compatible with the Application Tap instrumentation format. The main advantage of this approach is that it gives clients real granularity of visibility right down at the FPGA level, says Young.
“The idea is that you can now have instrumentation that zooms into the individual functional sub-components that are all sharing the same FPGA silicon. Which is tremendously exciting for people who care about this stuff”, he says.
Ever since FPGAs were first introduced into high frequency and low latency trading environments, it has been the dream of many such firms to have the entire trading platform running on the FPGA, at wire speed.
The last bastion is having the trading algorithm itself running on the FPGA. This has always presented something of a challenge as, for a variety of reasons, FPGAs do not lend themselves to running any kind of complex trading logic.
This could be about to change as FPGA hardware evolves. The latest generation of FPGAs now come equipped with their own co-processor, the ARM core, which can be programmed – in an accessible way – to do a variety of tasks that until now have not been possible on FPGA.
It will be interesting to see how far firms can innovate utilising this new generation of FPGAs, particularly around the trading algorithms themselves.