Imperial-efx_Credit_Suisse_23Jan13_new

Download Report

Transcript Imperial-efx_Credit_Suisse_23Jan13_new

Applications of Computing in Industry:
What is Low Latency All About?
eFX – January 2014
Divyakant Bengani
Undergrad degree in Management and IT from Manchester
Vice President at CS, responsible for eFX Core Technologies
Working in the banking industry since 2003 & CS for ~3 years
2
EFX - What do we do?
Cash FX Only
Spot, Forwards and Swaps
Continuous Publication of Prices
Streaming Executable Rates
Response to Request for Quotes
Acceptance and Booking of Trades
3
Key Statistics
~200 Currency Pairs (E.g EURUSD / GBPJPY etc.)
3 billion prices broadcast a day
60000 trades a day
>200 client connections
4
Technologies Used
Java
C# for UIs
GWT for Web UIs
Oracle Coherence
Oracle DB
Derby DB
Azul Zing JVM
Low Latency Fix Engine
5
Protocols
Socket Connections
Asynchronous JMS
Java RMI
HTTP (JSON, HESSIAN)
6
Payloads
Google Protobuf
Fixed Length Byte Arrays
FIX - Industry Standard
JMS Map Messages
Java Serialization
7
EFX - Overall Architecture
8
Service Discovery
Zero Conf
Dynamically add and remove services
Applications do not need to know about each other - just pick up what’s
advertised
9
Automated Testing
10
Code Quality Analysis
11
Continuous Integration
12
How to Achieve Low Latency
Daniel Nolan-Neylan
Graduated from UCL in 2004
Started working at Credit Suisse in 2006
− First, networking for 4 years
− Now, Application Developer in FX IT
Different projects:
− Distributed caching system for static data
− Simplified credit checking library
− Pricing and trading gateway (now team lead)
Corporate Design, HCBC 1
November 2011
14
Wait a second!
Reminder:
1 second is:
− 1,000 milliseconds
− 1,000,000 microseconds
− 1,000,000,000 nanoseconds
Latency Numbers Every
Programmer Should Know
L1 cache reference
0.5 ns
Branch mispredict
5 ns
L2 cache reference
7 ns
14x L1 cache
Mutex lock/unlock
25 ns
Main memory reference
100 ns
20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy
3,000 ns
Send 1K bytes over 1 Gbps network
10,000 ns 0.01 ms
Read 4K randomly from SSD*
150,000 ns 0.15 ms
Read 1 MB sequentially from memory
250,000 ns 0.25 ms
Round trip within same datacenter
500,000 ns 0.5 ms
Read 1 MB sequentially from SSD*
1,000,000 ns 1 ms 4X memory
Disk seek
10,000,000 ns 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
By Jeff Dean:
http://research.google.com/people/jeff/
FX Trading – Latency Numbers
250ms – A human responding to price update
30ms – Bank accepting trade
10ms – Credit checking client
9ms – JVM Garbage Collecting
5ms – Persisting a trade to disk
2ms – JMS networking round-trip
1ms – Raw socket networking round-trip
0.5ms – Max wire-to-wire pricing latency
0.05ms – Min pricing latency
0.005ms – Writing price to FIX engine
Optimization Quotes
Michael A. Jackson:
“The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do
it yet.”
Rob Pike:
“Bottlenecks occur in surprising places, so don't try to second guess
and put in a speed hack until you have proven that's where the
bottleneck is.”
Where to Optimize? Use Profiler
Measuring Milliseconds and Nanoseconds in Java
Measure time taken for operations and log:
− System.currentTimeMillis()
Good for taking a time/date that can be compared against other
systems. Accuracy depends on OS, but 1ms accuracy achievable on
modern Unix-based OS (Linux)
Bad if more precise measurements are required
− System.nanoTime()
Good for sub-millisecond measurements
Bad if comparable time with other systems required
− Realistically, need to use both
Corporate Design, HCBC 1
November 2011
20
Quote Journalling – log latency of every price
Corporate Design, HCBC 1
November 2011
21
Our Soak Test Harness
Corporate Design, HCBC 1
November 2011
22
…and the graphs it can produce
Corporate Design, HCBC 1
November 2011
23
Removing Millisecond Delays
Identify the longest-running tasks
− Usually I/O delays
Disk
– Database activity
– Synchronous logging
– Writing files
Network
– Calling network services
– Remote services far away (e.g. Across Atlantic ~50ms)
Removing Millisecond Delays (2)
Analyze whether delays can be eliminated
− Disk
Database activity -> Use a cache
Synchronous logging -> Use asynchronous logging
Writing files -> Use buffers and write asynchronously
− Network
Calling network services -> Cache where possible
Remote services far away -> Co-locate in same place
FX Trading – RFQ Example
E.g. Incoming request for a price, target response time is 10ms
− Need to:
Validate request parameters
Internally subscribe for prices
Obtain a globally unique transaction ID
Perform a credit check
How to get all this done in just 10ms?
FX Trading – RFQ Example (2)
Credit check
− Old one took 30-200ms
− New one takes 5-10ms
Using Caching and Co-location
Parallelize all validation
Pre-cache prices
− by opening up price streams in advance of being required
Don’t Optimize Too Soon
Remember:
− Only optimize what you need to optimize
− Remove longest delays first
No point removing micros if you still have delays of millis or worse
− Always measure your operations carefully
Determine what minimum, maximum, mean, standard deviation, and
other percentiles are (99%, 99.9%, etc)
− Watch for jitter and solve separately
Removing Microsecond Delays
Intra-process delays
− Unbalanced / slow queues
− Slow algorithms
Expensive loops repeated many times
Poor use of object creation / memory allocation
Contented memory controlled with locks
Wasted effort calculating unwanted results
FX Trading – Pricing Example
Achieving wire-to-wire latencies of 50μs
− Google protobuf parsers replaced with low-garbage creating versions
each GC stops the JVM for 9,000μs (i.e. 9ms)
− LMAX Disruptors used instead of queues
Busy spin consumer threads / single-write principle
− “PriceBigDecimal” class to replace Java BigDecimal class
BigDecimal slow to instantiate and impossible to mutate
− No synchronous logging or network calls
− Pre-cache static data before starting price stream
Disruptor or Blocking Queues?
Corporate Design, HCBC 1
November 2011
31
Java BigDecimal or use Low Latency replacement?
Corporate Design, HCBC 1
November 2011
32
Removing Nanoseconds?
Use specialist hardware (such as FPGA)
Understand low-level CPU interconnectivity with memory, and how CPU
caching works (including cache-lines)
http://mechanical-sympathy.blogspot.com
eFX – No need to pursue this level of performance at the moment
Latency vs Throughput
Latency - time taken (typically mean, percentile or worst case) to
complete a task
Throughput – the number of tasks completed in a given time period
(typically, per second)
Throughput is 1/latency (per pipeline)
Increasing Throughput
Identify delays
− Throughput constrained by latency
− Blocking I/O calls delay unprocessed messages
Data bursts
− What’s the peak throughput required?
− What’s the gap typically between bursts?
Techniques to Increase Throughput
Batching
− Sometimes latent calls are unavoidable
− Using batching can strip overhead of making call per transaction
− Cost of batching is the delay incurred waiting for new items to add to
batch
− More difficult to accurately measure delay per item when multiple items
are in a batch
FX Trading – Batching Example
Legacy global server in London
Regional trade acceptance components
Latency between New York and London - 50ms
Per thread: 1/0.05 = 20 trades per second max
How to increase?
− More threads
− Add batching per thread
Now, with batch size of 5, 100 trades per second per
thread.
Techniques to Increase Throughput(2)
Use Asynchronous callbacks
− Synchronous calls:
boolean doCall()
Wait for response
Can be delayed for varying time
− Asynchronous calls:
void doCall(Callback callback)
Do not wait and keep processing more events
Can additionally overlay timeouts to improve resilience
FX Trading – Asynchronous Callbacks
Submission of trade to price service for verification – was originally
synchronous
Call blocks for 50ms – max 20 trades per second per thread
After converting to asynchronous callbacks, the only delay is putting
packets on network buffer (μs), so effectively no delay – max numbers of
trades is very high!
Q&A
eFX – January 2014