The post GPU Accelerated Deep Learning also Wanted for .NET appeared first on QuantAlea Blog.
]]>There exist already several deep learning libraries but none of them targets .NET. Alea TK is a new open source project that aims to develop a complete GPU accelerated deep learning stack for .NET. It is built on top of Alea GPU and is designed from ground up to be GPU accelerated, easy to extend and to deploy. It is in an early phase. Contributors and developers are welcome! The recording explains why we started the project and what we plan to do with it in the future.
Check out Alea TK on our project web site and on GitHub.
The post GPU Accelerated Deep Learning also Wanted for .NET appeared first on QuantAlea Blog.
]]>The post Radically Simplified GPU Programming with C# appeared first on QuantAlea Blog.
]]>GPU computing is all about number crunching and performance. Do you have a lot of parallel calculations? Then try to use GPU with C#. With the new Alea GPU parallel GPU methods it is as easy as changing a few lines of code to utilize the power of GPUs. No GPU in your box? Don’t worry, you can get them from Azure or other cloud providers. I explained how easy it is to run C# code on the GPU, with full debugging support in Visual Studio.
Check out Alea GPU on our product web site.
The post Radically Simplified GPU Programming with C# appeared first on QuantAlea Blog.
]]>The post A new Deep Learning Stack for .NET appeared first on QuantAlea Blog.
]]>I gave a talk at GTC Europe 2016 in Amsterdam about our new open source project Alea TK.
Alea TK is a library for general purpose numerical computing and Deep Learning based on tensors and tensor expressions supporting imperative calculations as well as symbolic calculations with auto-differentiation. It is designed from ground up with CUDA acceleration in mind. It is easy to extend, install and deploy and is perfectly suited for rapid prototyping and new model development. Alea TK is built entirely in .NET and C#. It relies on the Alea GPU compiler and uses NVIDIA’s cuDNN library to accelerate many standard deep neural networks primitives.
Alea TK is still a young project. I explained the main design principles and presented the framework, with a particular focus on GPU kernel fusion technology.
Check out the slides and our poster.
The post A new Deep Learning Stack for .NET appeared first on QuantAlea Blog.
]]>The post F# [on GPUs] for Quant Finance appeared first on QuantAlea Blog.
]]>July 14th 2016 I gave a talk at the Swiss FinteCH Meetup on open source technologies in fintech. The Swiss FinteCH Meetup group is a great and growing community interested in technology applied to financial problems. Thanks to Swati for organizing the event.
Check out the slides for more information.
The post F# [on GPUs] for Quant Finance appeared first on QuantAlea Blog.
]]>The post Deficiencies of .NET CLR JIT Compilers appeared first on QuantAlea Blog.
]]>I recently gave a talk at an F# meetup hosted by Jet.com about deficiencies of .NET CLR JIT compilers.
We know that often C# or F# does not perform at the level of native C++ because the CLR JIT compiler is not optimizing the code well enough. In worst cases we loose a factor of 2 to 4 against native code. To investigate this problem in more depth you can check how .NET CLR JIT compilers compile simple loops and nested loops. It is not enough to just look at MSIL code. We have to dig deep into the optimized assembly code, generated by the CLR JIT compilers. We find that the CLR JIT compilers are not capable to remove array bound checks or optimize array access patterns of the form a[i*lda + j] in simple nested loops. This is very bad news for performance critical code in .NET.
Fortunately, you can get around these problems by moving performance critical code to the GPU. The Floyd-Warshall all shortest path algorithm serves as an example: an advanced GPU implementation fully written in C# and compiled with Alea GPU gives a significant speedup. It runs at the same speed as a native GPU version coded in C++ and 125 times faster than the initial C# version!
Developing such efficient algorithms is not straightforward at all and requires some experience. We therefore take a step back and show that simpler problems can often be solved efficiently with parallel-for and parallel aggregate patterns running on the GPU with a dramatic performance increase of a factor of 50 to 100.
Here are the slides.
The post Deficiencies of .NET CLR JIT Compilers appeared first on QuantAlea Blog.
]]>The post GPU Computing on .NET at Speed of CUDA C++ appeared first on QuantAlea Blog.
]]>In the last post we gave a sneak preview of the upcoming Alea GPU version 3. Alea GPU version 3 sets completely new standards for GPU development on .NET. The highlights of the new version are:
Automatic memory management and data transfer, which makes GPU programming easier and reduces a lot of boiler plate code.
Integration with the .NET type system so that .NET arrays can be used directly in GPU kernels.
CUDA programming model integrated with C# and F#.
In the last post we discussed the new API and its usability. This time we benchmark Alea GPU against multi-threaded .NET and CUDA C++. For the performance study we use an application from quantitative finance that calculates the price of an Asian option with Monte Carlo simulation based on the celebrated Black-Scholes model. Nowadays Monte Carlo simulations are a common practice to calculate the price of complex financial products and risk figures. GPUs significantly reduce the calculation time.
We look at different implementation versions using different features of Alea GPU.
Explicit memory management: we use the Alea GPU parallel-for and parallel aggregate but we explicitly copy the data to the GPU.
Transform-reduce: we combine the parallel-for and parallel aggregate into a single parallel transform-reduce and use implicit memory management.
LINQ work-flow: we express the calculation using Alea GPU LINQ, which allows us to do kernel fusing and in a future version also more aggressive memory transfer optimizations.
Constant memory: we use constant memory to store data that is read-only on the GPU.
The CPU implementation is fully multi-threaded and also uses the fast CPU random number generators that come with the NVIDIA cuRand library. We run the computations on an Intel i7-4770 CPU at 3.5 GHz and on different GPU configurations. Let us look at the timings in milliseconds:
As expected the Kepler K40 is the fastest GPUs thanks to its high number of 2880 CUDA cores and very good double precision support. The version using implicit memory management is pretty close to the explicit one. Using constant memory also slightly improves performance, because it allows to cache data reads. Now we compare the speed-up relative to a multi-threaded C# implementation:
Already with a rather weak GeForce GT 750M mobile GPU we gain a performance increase of roughly a factor of 4. The high-end K40 GPU boosts performance by a factor 100. The version using explicit memory management scales best to multiple GPUs as shown by running it on two GeForce GTX Titan Black GPUs as well as on an AWS instance with four GRID K520 GPUs. In the current version of Alea GPU the implicit memory management uses a global tracker, which is not yet optimized for multiple GPUs. This explains the why the implicit memory management version scales poorly to multiple GPUs.
Now let us look at the performance of Alea GPU compared to a CUDA C++ implementation. Again we run the computations on GPUs of different hardware generations. Here are the timings in milliseconds of the Asian option pricing in C# using explicit management and a CUDA C++ implementation:
Investigating the runtime behavior in more details with the Visual Profiler reveals a few interesting details: The CUDA C++ Asian option pricing kernel uses 37 to 40 registers, depending on the CUDA compute architecture, where the .NET version compiled with Alea GPU only uses 30 registers. This has some consequences on the number of blocks in-flight and the overall occupancy. The .NET version can use 52 thread blocks and achieves 100% occupancy. The CUDA C++ version can only launch 39 thread blocks and achieves 75% occupancy. However, the kernel execution time is almost the same. Regarding the overall execution time, the .NET version is slightly slower because of in .NET we have a small overhead to launch a kernel.
For those who are interested in further details we sketch the C# as well as the CUDA C++ implementation now.
We like share the core algorithmic code between the CPU and GPU implementation. We achieve this by encapsulating the main calculation in a delegate:
public static double AsianCall(int i, double spot0, double strike, double[] dt, double[] rates, double[] volas, double[] gaussian) { var sum = 0.0; var spot = spot0; for (var k = 0; k < dt.Length; k++) { var sigma = volas[k]; var drift = dt[k] * (rates[k] - sigma * sigma / 2); spot = spot * DeviceFunction.Exp(drift + DeviceFunction.Sqrt(dt[k]) * sigma * gaussian[k * dt.Length + i]); sum += spot; } return DeviceFunction.Max(sum / dt.Length - strike, 0.0); }
This delegate calculates the payoff along path i
. The price computation simulates multiple sample batches so that we can scale-up the number of simulations. For each batch we generate numSamplesPerBatch
normally distributed random numbers with the NVIDIA cuBLAS library. Then we use the Alea GPU parallel-for to generate the paths and the prices using the AsianCall
delegate. Then we apply the Alea GPU average reduction to calculate the batch mean. The final price is the average of all batch prices.
[GpuManaged] public static double PriceGpu(int numBatches, int numSamplesPerBatch, double spot0, double strike, double[] dt, double[] rates, double[] volas) { using (var rng = Generator.CreateGpu(gpu, RngType.PSEUDO_XORWOW)) { var nt = dt.Length; var gaussian = gpu.Allocate(numSamplesPerBatch*nt); var prices = gpu.Allocate(numSamplesPerBatch); var batchPrices = new double[numBatches]; for (var batch = 0; batch < numBatches; batch++) { rng.SetGeneratorOffset((ulong)batch * (ulong)(numSamplesPerBatch * nt)); rng.GenerateNormal(gaussian, 0, 1); gpu.For(0, numSamplesPerBatch, i => prices[i] = AsianCall(i, spot0, strike, dt, rates, volas, gaussian)); batchPrices[batch] = gpu.Average(prices); } Gpu.Free(gaussian); Gpu.Free(prices); return batchPrices.Average(); } }
Note that PriceGpu
has the attribute [GpuManaged]
because we let Alea GPU to do the memory transfer for us. We only create storage for the random numbers and the prices on the device. All the input data, such as dt
, rates
and volas
are captured in a closure and moved automatically to the GPU. The explicit deallocation of the GPU memory with Gpu.Free
is optional but good practice: GPU memory is pagelocked, so to avoid unnecessary GPU memory limitations, it is good to release unused GPU memory immediately and not to wait for the next garbage collection run.
We can very easily program a CPU version:
public static double PriceCpu(int numBatches, int numSamplesPerBatch, double spot0, double strike, double[] dt, double[] rates, double[] volas) { using (var rng = Generator.CreateCpu(rngType)) { var nt = dt.Length; var gaussian = new double[numSamplesPerBatch * nt]; var prices = new double[numSamplesPerBatch]; var batchPrices = new double[numBatches]; for (var batch = 0; batch < numBatches; batch++) { rng.SetGeneratorOffset((ulong)batch * (ulong)(numSamplesPerBatch * nt)); rng.GenerateNormal(gaussian, 0, 1); Parallel.For(0, numSamplesPerBatch, i => prices[i] = AsianCall(i, spot0, strike, dt, rates, volas, gaussian)); batchPrices[batch] = prices.Average(); } return batchPrices.Average(); } }
This time we use the .NET Parallel.For
and reuse the core path generation and payoff logic from AsianCall
. The other versions also share the same core algorithm AsianCall
. We skip further code listings.
The CUDA C++ version looks very similar, execpt that there is no GPU parallel for, so we program the kernel ourselves. Because in CUDA C++ there are no array types as in .NET we have to pass the array length as additional arguments to the kernel.
__global__ void asianCall(int numSamplesPerBatch, double spot0, double strike, int nt, double* dt, double* rates, double* volas, double* gaussian, double* prices) { unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x; unsigned int step = gridDim.x * blockDim.x; for (unsigned int i = tid ; i < numSamplesPerBatch ; i += step) { double sum = 0.0; double spot = spot0; for (int k = 0; k < nt; k++) { double sigma = volas[k]; double drift = dt[k] * (rates[k] - sigma * sigma / 2); spot = spot * exp(drift + sqrt(dt[k]) * sigma * gaussian[k * nt + i]); sum += spot; } prices[i] = max(sum / (double)nt - strike, 0.0); } }
The orchestration of the calculations are now a bit more involved because there is no automatic memory management and transfer. Also because cudaMemcpy
takes the number of bytes to copy, we have to calculate the size ourselves. Typecasts further complicate the code. The average price per batch is calculated with the reduction function of CUDA Thrust.
double priceGpu(int gridSize, int blockSize, int numBatches, int numSamplesPerBatch, double spot0, double strike, int nt, double* dtHost, double* ratesHost, double* volasHost) { curandGenerator_t gen; double *gaussian, *prices, *batchPrices, *dt, *rates, *volas; batchPrices = (double *)calloc(numBatches, sizeof(double)); clock_t t; t = clock(); CUDA_CALL(cudaMalloc((void **)&gaussian, numSamplesPerBatch*nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&prices, numSamplesPerBatch*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&dt, nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&rates, nt*sizeof(double))); CUDA_CALL(cudaMalloc((void **)&volas, nt*sizeof(double))); CUDA_CALL(cudaMemcpy(dt, dtHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CUDA_CALL(cudaMemcpy(rates, ratesHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CUDA_CALL(cudaMemcpy(volas, volasHost, nt*sizeof(double), cudaMemcpyHostToDevice)); CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_XORWOW)); CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen, 1234ULL)); thrust::device_ptr pricesPtr(prices); for (int batch = 0; batch < numBatches; batch++) { CURAND_CALL(curandSetGeneratorOffset(gen, (unsigned long long)batch * (unsigned long long)(numSamplesPerBatch * nt))); CURAND_CALL(curandGenerateNormalDouble(gen, gaussian, numSamplesPerBatch*nt, 0.0, 1.0)); asianCall<<<gridSize, blockSize>>>(numSamplesPerBatch, spot0, strike, nt, dt, rates, volas, gaussian, prices); batchPrices[batch] = thrust::reduce(pricesPtr, pricesPtr + numSamplesPerBatch, 0.0) / (double)numSamplesPerBatch; } double batchSum = 0.0; for (int batch = 0; batch < numBatches; batch++) { batchSum += batchPrices[batch]; } double price = batchSum / (double)numBatches; cudaFree(prices); cudaFree(gaussian); cudaFree(dt); cudaFree(rates); cudaFree(volas); free(batchPrices); return price; }
We have seen that Alea GPU is very competitive compared to CUDA C++. Using our higher level abstractions such as parallel-for and parallel aggregate, C# programmers can implement performance critical algorithms targetng GPUs without explicit knowlege of CUDA C++ but with the same performance benefits as native CUDA C++.
We are interested to hear your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.
If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.
The post GPU Computing on .NET at Speed of CUDA C++ appeared first on QuantAlea Blog.
]]>The post “Auto-magic” GPU Programming appeared first on QuantAlea Blog.
]]>We believe this does not need to be so. This blog is a sneak preview of the new upcoming version 3 of Alea GPU which sets completely new standards for GPU development on managed platforms such as .NET or the JVM. So what is so exciting about version 3?
It makes GPU programming much easier and often “auto-magic” – we manage memory allocation and transfer to and from the GPU in an economic way. It is built for developers who grew up without pointers, malloc and free.
We integrate well with the .NET type system, which means for example that .NET arrays can be used directly in GPU kernels.
Of course for all the hard core CUDA programmers with .NET affinity, we still support CUDA in C# and F# as we did already with our previous versions – even better.
Let’s look at the API from a usability point of view and at the underlying technology and implementation details.
LINQ is a technology that extends languages such as C# with powerful query capabilities for data access and transformation. It can be extended to support virtually any kind of data store, also data and collections that reside on a GPU. Alea GPU LINQ introduces new LINQ extensions to express GPU computations with LINQ expressions that are optimized and compiled to efficient GPU code. The main advantages of coding whole GPU workflows with LINQ expressions are:
GPU LINQ workflows provide many standard operations such as parallel aggregation or parallel map, which makes them more expressive and reduces boiler plate code.
We can apply various kernel optimization techniques, such as GPU kernel fusing, which results in fewer GPU kernel launches and more compact GPU code.
Having the full code as an expression allows us to better optimize memory management and data transfer.
Here is a simple but interesting example, which determines the largest value of a array of values on the GPU in two steps. First we index the sequence and then we reduce the new array of indexed values with a custom binary operator that just compares the values to find the maximal value and its index. A priori this would require two GPU kernels. The first is a parallel map, the second is a parallel reduction. Alea GPU can fuse the two kernels into one.
var workflow = GpuWorkflow .Bind(data) .Select((v, i) => new ValueAndIndex {Value = v, Index = i}) .Aggregate((x, y) => x.Value > y.Value ? x :y); var argMax = gpu.Run(workflow);
A more complex example is the calculation of the fair value of an Asian option with Monte Carlo simulation based on the celebrated Black-Scholes model.
var workflow = GpuWorkflow.MapReduce(numBatches, batch => { return GpuWorkflow .Bind(gaussian) .GenerateRandomNormal((ulong)batch * (ulong)(numSamplesPerBatch * nt), 0, 1) .Range(0, numSamplesPerBatch) .Select(i => { var sum = 0.0; var spot = spot0; for (var k = 0; k < dt.Length; k++) { var sigma = volas[k]; var drift = dt[k] * (rates[k] - sigma * sigma / 2.0); spot = spot * DeviceFunction.Exp(drift + DeviceFunction.Sqrt(dt[k]) * sigma * gaussian[i * dt.Length + k]); sum += spot; } return DeviceFunction.Max(sum / dt.Length - strike, 0.0); }) .Aggregate(Double.IterativeMean); }, Double.IterativeMean); var result = gpu.Run(workflow);
The Monte Carlo simulation runs in multiple batches, each batch consists of numSamplesPerBatch
samples. Workflows are composable. The outer workflow is a map-reduce, which launches the batch sampling and reduces the match mean to the mean across all batches. The inner workflow does the actual Monte Carlo simulation. It first binds storage to the workflow which is then populated with normally distributed random numbers. The core algorithm is in the Select
. For each sample index i
it generates a path and prices the option along the path. The Aggregate
method of the inner workflow calculates the batch sample mean with a parallel reduction.
An alternative abstraction is provided with the GPU parallel-for and parallel aggregate pattern. Together with automatic memory management, they allows us to write parallel GPU code as if you would write serial CPU code. The usability is very simple. We select a GPU device and pass a delegate to the gpu.For
method. All the variables used in the delegate are captured in a closure that is then passed to the parallel-for body. The data is automatically moved to the GPU and the results are brought back automatically to the host.
The element-wise sum on the GPU is now as simple as this:
var gpu = Gpu.Default; var arg1 = Enumerable.Range(0, Length).ToArray(); var arg2 = Enumerable.Range(0, Length).ToArray(); var result = new int[Length]; gpu.For(0, result.Length, i => result[i] = arg1[i] + arg2[i]);
The delegate accesses data elements arg1
and arg2
that are defined outside of the loop body and writes the result
directly to a .NET array. The runtime system takes care of all the memory management and transfer. Because the delegate does not rely on any GPU specific features such as shared memory,the delegate can execute on the CPU and the GPU. The runtime system also takes care of selecting the thread block size based on the occupancy of the generated kernel.
The parallel aggregate works in the same way. It requires a binary associative operator which is used to reduce the input collection to a single value. Our implementation does not require that the operator is commutative. The following code calculates the sum of the array elements on the GPU:
var gpu = Gpu.Default; var arg = Enumerable.Range(0, Length).ToArray(); var result = gpu.Aggregate(arg, (x, y) => x + y);
Our automatic memory management system handles memory allocation and data movement between the different memory spaces of the CPU and the GPU without the programmer having to manage this manually. It is efficient – unnecessary copy operations are avoided by analyzing the memory access. The implementation is based on code instrumentation, a technique that inserts additional instructions into an existing execution path. Alea GPU modifies the CPU code by inserting instructions that monitor array accesses and perform minimum data transfers between the CPU and GPU. As these runtime checks generate a slight performance overhead, the scope of analysis is limited to the code carrying the attribute [GpuManaged]
. Leaving out this attribute never means that data will not be copying – it may only affect unnecessary intermediate copying.
To illustrate the automatic memory management in more detail, we look at an example. We iterate 100 times a parallel-for loop that increments the input by one. First of all, we consider the situation without the [GpuManaged]
attribute. In this case, the data is automatically copying, although more frequently than necessary due to a limited scope of analysis.
public static void Unmanaged() { var data = new int[Length]; for (var k = 0; k < 100; k++) Gpu.Default.For(0, data.Length, i => data[i] += 1); var expected = Enumerable.Repeat(100, Length).ToArray(); ; Assert.That(data, Is.EqualTo(expected)); }
We check the memory copy operations by using the NVIDIA Visual Nsight profiler. As expected the low level CUDA driver functions cuLaunchKernel
, cuMemcpyHtoD_v2
and cuMemcpyDtoH_v2
to launch the kernel and to perform memory copy are called 100 times each. This means that the data is copied in and out for each of the 100 sequential parallel for launches. Let us add the attribute [GpuManaged]
to turn on automatic memory management.
[GpuManaged] public static void Managed() { var data = new int[Length]; for (var k = 0; k < 100; k++) Gpu.Default.For(0, data.Length, i => data[i] += 1); var expected = Enumerable.Repeat(100, Length).ToArray(); ; Assert.That(data, Is.EqualTo(expected)); }
We see that cuMemcpyHtoD_v2
and cuMemcpyDtoH_v2
are now called just once. The reason is that result data of a preceding GPU parallel-for loop can stay on the GPU for the succeeding parallel-for loop without need of copying the intermediate data back and forth to CPU. Copying is only involved for the input of the first GPU execution as well as for the output of the last GPU computation.
For a C# developer it would be very convenient to use .NET arrays and other standard .NET types also directly in a GPU kernel and that all the memory management and data movement is handled automatically. .NET types are either reference types or value types. Value types are types that hold both data and memory at the same location, a reference type has a pointer which points to the memory location. Structs are value types and classes are reference types. Blittable types are types that have a common representation in both managed and unmanaged memory, in particular reference types are always non-blittable. Copying non-blittable types from one memory space to another requires marshalling, which is usually slow.
From the point of view of efficiency we made the decision to only support .NET arrays with blittable element types as well as jagged arrays thereof. This is a good compromise between usability and performance. To illustrate the benefits let’s look at how to write an optimized matrix transpose. With Alea GPU version 2 you have to work with device pointers and all the matrix index calculations have to be done by hand.
// Alea GPU version 2 public class MatrixTransposeModule : ILGPUModule { public static void TransposeNoBankConflictsAleaV2(int width, int height, deviceptr a, deviceptr at) { var tile = __shared__.Array(TileDim*(TileDim + 1)); var col = blockIdx.x * TileDim + threadIdx.x; var row = blockIdx.y * TileDim + threadIdx.y; var index_in = col + row * width; for (var i = 0; i < TileDim; i += BlockRows) { tile[(threadIdx.y + i) * (TileDim + 1) + threadIdx.x] = a[index_in + i * width]; } Intrinsic.__syncthreads(); var colt = blockIdx.y * TileDim + threadIdx.x; var rowt = blockIdx.x * TileDim + threadIdx.y; var index_out = cot + rowt * height; for (var i = 0; i < TileDim; i += BlockRows) { at[index_out + i*height] = tile[threadIdx.x * (TileDim + 1) + threadIdx.y + i]; } } // ... }
Alea GPU version 2 requires that kernels and other GPU resources are in a class that inherits from ILGPUModule
. Apart from this the kernel implementation resembles the CUDA C implementation very closely.
With Alea GPU V3 you don’t need to inherit from a base module class anymore. You can directly work with .NET arrays in the kernel, also for the shared memory tile. We save the error prone matrix element index calculations and only need to map the thread block to the matrix tile.
// Alea GPU version 3 public class MatrixTranspose { public static void TransposeNoBankConflictsAleaV3(T[,] a, T[,] at) { var tile = __shared__.Array2D(TileDim, TileDim + 1); var col = blockIdx.x * TileDim + threadIdx.x; var row = blockIdx.y * TileDim + threadIdx.y; for (var k = 0; k < TileDim; k += BlockRows) { tile[threadIdx.y + k, threadIdx.x] = a[row + k, col]; } DeviceFunction.SyncThreads(); var colt = blockIdx.y * TileDim + threadIdx.x; var rowt = blockIdx.x * TileDim + threadIdx.y; for (var k = 0; k < TileDim; k += BlockRows) { at[rowt + k, colt] = tile[threadIdx.x, threadIdx.y + k]; } } // ... }
Alea GPU version 2 requires explicit memory allocation, data copying and calling the kernel with device pointers. An additional inconvenience is that matrices stored in two-dimensional arrays first have to be flatten.
public class MatrixTransposeModule : ILGPUModule { // ... public T[] Run(int width, int height, T[] A) { using (var dA = GPUWorker.Malloc(A)) using (var dAt = GPUWorker.Malloc(width*height)) { var lp = LaunchParams(width, height); GPULaunch(TransposeNoBankConflictsAleaV2, lp, width, height, dA.Ptr, dAt.Ptr); return dAt.Gather(); } }
Here is the kernel launch code that relies on automatic memory management. The developer allocates a .NET array for the result, passes that, together with the input matrix directly to the kernel.
// Alea GPU version 3 - automatic memory management public static class MatrixTranspose { // ... private static readonly Gpu gpu = Gpu.Default; public T[,] Run(T[,] A) { var At = new T[A.GetLength(1), A.GetLength(0)]; gpu.Launch(NoBankConflictsKernel, lp, A, At); return At; } }
Without compromizing the usability, the programmer can also work with explicit memory management.
// Alea GPU version 3 - explicit memory management public static class MatrixTranspose { // ... private static readonly Gpu gpu = Gpu.Default; public T[,] Run(T[,] A) { var a = gpu.Allocate(A); var at = gpu.Allocate(A.GetLength(1), A.GetLength(0)); gpu.Launch(NoBankConflictsKernel, lp, a, at); return Gpu.CopyToHost(at); } }
Here the arrays a
and at
are fake arrays representing arrays on the GPU device and he can use them in a GPU kernel the same way as an ordinary .NET array. The only difference is that he is now responsible to copy back the result explicitly with CopyToHost
. Of course the deviceptr
API is still available and often useful for low level primitives or to write highly optimized code.
Alea GPU version 3 also has better support for delegates and lambda expressions. Here is a simple generic transform that takes a binary function object as argument and applies it to arrays of input data:
public static void Kernel(T[] result, T[] arg1, T[] arg2, Func<T, T, T> op) { var start = blockIdx.x*blockDim.x + threadIdx.x; var stride = gridDim.x*blockDim.x; for (var i = start; i < result.Length; i += stride) { result[i] = op(arg1[i], arg2[i]); } } ``` We can launch it with a lambda expression as follows: ```{.cs} gpu.Launch(TransformGeneric.Kernel, lp, result, arg1, arg2, (x, y) => x + y);
The next example defines a struct representing a complex number which becomes a blittable value type.
public struct Complex { public T Real; public T Imag; public override string ToString() { return $"({Real}+I{Imag})"; } }
We define a delegate that adds two complex numbers. It creates the result directly with the default constructor. Note that this delegate is free of any GPU specific code and can be executed on the CPU and GPU alike.
var add = (x, y) => new Complex { Real = x.Real + y.Real, Imag = x.Imag + y.Imag };
It can be used in the parallel Gpu.For
to perform element-wise complex addition
Gpu.Default.For(0, result.Length, i => result[i] = add(arg1[i], arg2[i]));
or in the above generic transform kernel.
From an implementation point of view a challenge is that delegates are runtime objects. This means we have to JIT compile the delegate code at runtime. Fortunately our compiler has this feature since its initial version. For a delegate such as
i => result[i] = arg1[i] + arg2[i]
the C# compiler will generate a closure class with fields and an Invoke
method:
class { public int[] result; public int[] arg1; public int[] arg2; public void Invoke(int i) { result[i] = arg1[i] + arg2[i]; } }
To instantiate the delegate instance, the C# compiler generates code to instantiate the closure class, set its fields, and to create the delegate instance with the closure instance and the method’s function pointer:
var closure = new CompilerGenerated(); closure.result = result; closure.arg1 = arg1; closure.arg2 = arg2; var fp = ldftn(methodof(CompilerGenerated.Invoke)); var del = new del(closure, fp); gpu.For(0, result.Length, del);
The above code is just illustrative and not legal C# code. Both ldftn
and methodof
are in fact the
real IL instructions that C# compiler generates.
Whenever the Alea GPU compiler finds this delegate, it translates the closure class into a kernel struct, and JIT compiles the GPU kernel code that comes from the Invoke
method of the compiler generated class. Alea GPU caches the result of JIT compilations in a dictionary using the key methodof(CompilerGenerated.Invoke)
, so it will not compile delegates with same method multiple times.
There is one thing needs to be noted. Since we translate the closure class into a struct and pass it to the GPU as a kernel argument, it is not possible to change the values of the fields. For example a delegate like i => result = arg1
does not work.
The core component of our automatic memory management system is a memory tracker. It tracks .NET arrays and their counterparts residing in GPU device memory. Every array has a flag that indicates if it is out of date. The tracking of an array starts the first time it is used (implicitly in a closure or explicitly as an argument) in a GPU kernel launch. A weak reference table stores for every tracked array the host-out-of-date flag and for every GPU the corresponding device memory, together with the device-out-of-date flag.
(array, hostOutOfDate) : [(gpu0, deviceMem0, deviceOutOfDate0), (gpu1, deviceMem1, deviceOutOfDate1), ....]
The memory tracker has the following methods:
The default procedure is as follows: If an array is used in a kernel launch on a GPU the tracker makes the array up to date on that GPU by copying it to device memory just before the kernel is launched. After the kernel launch the tracker makes the array again up to date on the host by copying it back to the CPU memory. This very simple strategy always works but often leads to unnecessary memory transfers. The basic idea of our automatic memory management system is to defer the synchronization of a host arrays with its device counterpart to the point when the host array is actually accessed again. We implement this deferred synchronization strategy with code instrumentation, which inserts additional checks and memory tracker method calls at the right place.
Because instrumentation adds additional overhead we narrow down the ranges of instrumentation. A function can either be GpuManaged
or GpuUnmanaged
. By default, a function is GpuUnmanaged
, which means that it does not defer the memory synchronization and thus its code is not instrumented. If a function has the GpuManaged
attribute, we insert code and method calls to track the array access and defer the synchronization. At least, the functions Alea.Gpu.Launch
and Alea.Gpu.For
are GpuManaged
.
Methods with the attribute GpuManaged
are inspected in a post-build process. We check if a function contains IL instructions such as ldelem
, ldelema
, stelem
, call Array.GetItem()
, call Array.SetItem()
, etc. to access a specific array. In this case we extract the array operand and insert code to defer its synchronization. A standard use case is a loop over all the array elements to set or modify them. In such a case we can optimize the tracking by creating local cache flags. Here is an example:
for(var k = 0; k < m; k++) { for (var i = 0; i < n; i++) array[i] = i; gpu.Launch(kernel, lp, array); }
Instrumention produces code that is functionally equivalent to the following source code:
object insync1 = null; for(var k = 0; k < m; k++) { for (var i = 0; i < n; i++) { if (insync1 != array) { MemoryTracker.HostUpToDateFor(array); insync1 = array; } array[i] = i; } SetGpuManagedSession(); gpu.Launch(kernel, lp, array); insync1 = null; }
Calling a method like MemoryTracker.HostUpToDateFor()
many times to check if an array has to be synchronized is generating a huge overhead. We use the flag to bypass the call once we know the array is synchronized and reset the flag after kernel launches. At the end of GpuManaged
method, we will insert code to bring all out-of-date implicitly traced array back to host. A frequent case is calling other functions from a GpuManaged
function. These other functions could be either GpuManaged
or GpuUnmanaged
. We need to notify the callee to defer memory synchronization. We use some mechanism to pass the managed session to the callee, so that it won’t bring back all out-of-date array to host, because it is not the end of the GpuManaged
session.
The implementation relies on Mono.Cecil and Fody. There is a sketch of the full code instrumentation that is executed in a post build step:
GpuManaged
function– for every array element access add cache flag and call to HostUpToDateFor()
– for GpuManaged
functions call SetFlagOnCurrentThread()
before, reset all cache flags after
– for GpuUnmanaged
functions call HostUpToDateForAllArrays()
before calling them
1. Add try
finally
block and in finally
call HostUpToDateForAll()
if the caller is GpuUnmanaged
1. Weave the modified assembly via Fody
We hope that after reading this post you share the same excitement for the new upcoming version 3 of the Alea GPU compiler for .NET as we do.
Of course we are interested to hear all of your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.
The features that we presented here are still in preview and might slightly change until we finally release version 3.
If you would like to already play around with Alea GPU V3 come and join us April 4 at GTC 2016 and attend our tutorial on simplified GPU programming with C#.
The post “Auto-magic” GPU Programming appeared first on QuantAlea Blog.
]]>The post GPUs and Domain Specific Languages for Life Insurance Modeling appeared first on QuantAlea Blog.
]]>Calculating the economic value of the liabilities and capturing the dependence of the liabilities to different scenarios such as movements of the interest rate or changes of mortality cannot be achieved without detailed models of the underlying contracts and requires a significant computational effort.
The calculations have to be executed for millions of pension and life insurance contracts and have to be performed for thousands of interest rate and mortality scenarios. This is an excellent case for the application of GPUs and GPU clusters.
In addition variations in the products have to be captured. While implementing a separate code for many products is possible, a lot can be gained from abstractions at a higher level.
To solve these problem, we use the following technologies:
Armed with these technologies we can significantly improve the level of abstraction, and increase generality. Our system will allow actuaries to be more productive and to harness the power of GPUs without any GPU coding. The performance gain of GPU computing makes it much more practical and attractive to use proper stochastic models and to experiment with a large and diverse set of risk scenarios.
The Actulus Modeling Language (AML) is a domain specific language for rapid prototyping in which actuaries can describe life-based pension and life insurance products, and computations on them. The idea is to write declarative AML product descriptions and from these automatically generate high-performance calculation kernels to compute reserves and cash flows under given interest rate curves and mortality curves and shocks to these.
AML allows a formalized and declarative description of life insurance and pension products. Its notation is based on actuarial theory and reflects a high-level view of products and reserve calculations. This has multiple benefits:
The AML system is based on continuous-time Markov models for life insurance and pension products. A continuous-time Markov model consists of a finite number of states and transition intensities between these states. The transition intensity $\mu_{ij}(t)$ from state $i$ to state $j$ at time $t$, when integrated over a time interval, gives the transition probability from state $i$ to state $j$ during the time interval. The Markov property states that future transitions depend on the past only through the current state.
Life insurance products are modeled by identifying states in a Markov model and by attaching payment intensities $b_i(t)$ to the states and lump-sum payments $b_{ij}(t)$ to the transitions.
As an example we consider a product that offers disability insurance. The product can be modeled with three states: active labor market participation, disability, and death. There are transitions from active participation to disability and to death, and from disability to death. Another example is a collective spouse annuity product with future expected cashflows represented by a seven-state Markov model as follows:
Additionally, some products may allow for reactivation, where a previously disabled customer begins active labor again. The product pays a temporary life annuity with repeated payments to the policy holder until some expiration date $n$, provided that he or she is alive. The disability sum pays a lump sum when the policy holder is declared unable to work prior to some expiration $m$.
The state-wise reserve $V_j(t)$ is the reserve at time $t$ given that the insured is in state $j$ at that time. It is the expected net present value at time $t$ of future payments of the product, given that the insured is in state $j$ at time $t$. The principle of equivalence states that the reserves at the beginning of the product should be zero, or the expected premiums should equal the expected benefits over the lifetime of the contract.
The state-wise reserves can be computed using Thiele’s differential equation
$$ \frac{d}{dt} V_j(t) = \left(r(t) + \sum_{k, \, k\neq j} \mu_{jk}(t) \right) V_j(t) – \sum_{k, \, k\neq j} \mu_{jk}(t) V_k(t) – b_j(t) – \sum_{k, \, k\neq j} b_{jk}(t) \mu_{jk}(t) $$
where $r(t)$ is the interest rate at time $t$. Note that the parameters can be divided into three categories: those that come from a product ($b_j$ and $b_{jk}$), those that come from a risk model ($\mu_{jk}$) and the market variables ($r$).
Traditionally, it has often been possible to obtain closed-form solutions to Thiele’s differential equations and then use tabulations of the results. With the more flexible products expressible in AML, closed-form solutions are in general not possible. In particular, by allowing reactivation from disability to active labor market participation mentioned above, one obtains a Markov model with a cycle, and in general this precludes closed-form solutions.
Good numerical solutions of Thiele’s differential equations can be obtained using a Runge-Kutta 4 solver. A reserve computation typically starts with the boundary condition that the reserve is zero (no payments or benefits) after the insured’s age is 120 years, when he or she is assumed to be dead. Then the differential equations are solved, and the reserves computed, backwards from age 120 to the insured’s current age in fixed time steps.
Here is a code fragment of the inner loops of a simplistic RK4 solver expressed in C#.
for (int y=a; y>b; y--) { double[] v = result[y-b]; v = daxpy(1.0, v, bj_ii(y)); double t = y; for (int s=0; s<steps; s++) { // Integrate backwards over [y,y-1] double[] k1 = dax(h, dV(t, v)); double[] k2 = dax(h, dV(t + h/2, daxpy(0.5, k1, v))); double[] k3 = dax(h, dV(t + h/2, daxpy(0.5, k2, v))); double[] k4 = dax(h, dV(t + h, daxpy(1.0, k3, v))); v = daxpy(1/6.0, k1, daxpy(2/6.0, k2, daxpy(2/6.0, k3, daxpy(1/6.0, k4, v)))); t += h; } Array.Copy(v, result[y-1-b], v.Length); }
It computes and stores the annual reserve backwards from y = a = 120 to y = b = 35, where dV is an array-valued function expressing the right-hand sides of Thiele’s differential equations, and h is the within-year step size, typically between 1 and 0.01.
The computations in the Runge-Kutta code have to be performed sequentially for each contract, consisting of the products relating to a single insured life. However, it is easily parallelized over a portfolio of contracts, of which there are typically hundreds of thousands, one for each customer. Thus, reserve and cash flow computations present an excellent use case for GPUs. Using GPUs for reserve or cash-flow computations is highly relevant in practice, because such computations can take dozens or hundreds of CPU hours for a reasonable portfolio size. Even with cloud computing this results in slow turnaround times; GPU computing could make it much more practical and attractive to use proper stochastic models and to experiment with risk scenarios.
The Runge-Kutta 4 solver fits the GPU architecture very well because it uses fixed step sizes and therefore causes little thread divergence provided the contracts are sorted suitably before the computations are started. By contrast adaptive-step solvers such as the Runge-Kutta-Fehlberg 4/5 or Dormand-Prince are often faster on CPUs. They are more likely to cause thread divergence on GPUs because different input data will lead to iteration counts in inner loop. Moreover, the adaptive-step solvers deal poorly with the frequent discontinuities in the derivatives that appear in typical pension products, which require repeatedly stopping and then restarting the solver to avoid a deterioration of the convergence rate.
In preliminary experiments, we have obtained very good performance on the GPU over a CPU-based implementation. For instance, we can compute ten thousand reserves for even the most complex insurance products in a few minutes. The hardware we use is an NVIDIA Tesla G40 GPU Computing Module (Kepler GK110 architecture). The software is a rather straightforward implementation of the Runge-Kutta fixed-step solver, using double precision (64 bit) floating-point arithmetics. The kernels are written, or in some experiments automatically generated, in the functional language F#, and compiled and run on the GPU using the Alea GPU framework.
The F# language is widely used in the financial industry, along with other functional languages. We use it for several reasons:
F# is an ideal language for writing program generators, such as generating problem-specific GPU kernels from AML product descriptions.
The project’s commercial partner uses the .NET platform for all development work, and F# fits well with that ecosystem.
For these reasons the Actulus project selected Quantalea’s Alea GPU platform to develop our GPU code. We find that the Alea GPU platform offers excellent performance and robustness. An additional benefit of Alea GPU is its cross platform capability: the same assemblies can execute on Windows, Linux and Mac OS X.
The chief performance-related problems in GPU programming are the usual ones: How to lay out data (for instance, time-dependent interest rate curves and mortality rates) in GPU memory for coalesced memory access; whether to pre-interpolate or not in such time-indexed tables; how to balance occupancy, thread count and GPU register usage per thread; and so on. Alea GPU is feature complete so that we can implement all the required optimizations to tune the code for maximal performance.
The following graphics shows the number of products processed per second as a function of the batch size, i.e., the number of products computed at the same time:
The product in question is a collective spouse annuity product with future expected cashflows calculated for a 30-year old insured represented by the seven-state Markov model depicted above. This product is among the most complex to work with. Depending on the modelling details, the current CPU-based production code, running on a single core at 3.4 GHz, can process between 0.75 and 1.03 collective spouse annuity insurance products per second. If we compare this with the GPU throughput of 30 to 50 insurance products per second we arrive at a speed-up factor in the range of 30 to 65.
The computation kernels are implemented in F# using work flows (also known as computation expressions or monads) and code quotations, a feature-complete and flexible way of using the Alea GPU framework. In our experience the resulting performance is clearly competitive with that of raw CUDA C code.
Using F# through Alea GPU permits much higher programmer productivity, both because F#’s concise mathematical notation suits the application area, and because F# has a better compiler-checked type system than C. For instance, the confusion of device pointers and host pointers that may arise in C is avoided entirely in F#. Hence much less time is spent chasing subtle bugs and mistakes, which is especially important for experimentation and exploration of different implementation strategies. The core Runge-Kutta 4 solver looks like this, using code quotations and imperative F# constructs:
let! kernel = <@ fun (input:deviceptr) (output:deviceptr) ... -> ... while iterDir y do bj.Invoke (float(y)) (input+inputIndexing.Invoke i boundaryConditionLength) tmp nV daxpy.Invoke 1.0 v tmp v nV for s = 0 to steps-1 do let t = float(y) + (float(s)/float(steps)) * dir; // k1 = ... dV.Invoke steps ... v k1 dax.Invoke h2 k1 k1 nV daxpy.Invoke 0.5 k1 v tmp nV // k2 = ... dV.Invoke steps ... tmp k2 dax.Invoke h2 k2 k2 nV daxpy.Invoke 0.5 k2 v tmp nV // k3 = ... dV.Invoke steps ... tmp k3 dax.Invoke h2 k3 k3 nV daxpy.Invoke 1.0 k3 v tmp nV // k4 = ... dV.Invoke steps ... tmp k4 dax.Invoke h2 k4 k4 nV // v(n+1) = v + k1/6 + k2/3 + k3/3 + k4/6 daxpy.Invoke (1.0/6.0) k4 v tmp nV daxpy.Invoke (2.0/6.0) k3 tmp tmp nV daxpy.Invoke (2.0/6.0) k2 tmp tmp nV daxpy.Invoke (1.0/6.0) k1 tmp v nV output.[(pos+y+int(dir)-a)] y ...
At the same time, F#’s code quotations, or more precisely the splicing operators, provide a simple and obvious way to inline a function such as GMMale into multiple kernels without source code duplication:
let stocasticMarriageProbabilityODE_technical = <@ fun (x:float) ... (res:deviceptr) -> let GMMale = %GMMale ...
While similar effects can be achieved using C macros, F# code quotations and splice operators do this in a much cleaner way, with better type checking and IDE support. What is more, F# code quotations allow kernels to be parametrized with both “early” (or kernel compile-time) arguments such as map, and late (or kernel run-time) arguments such as n and isPremium:
let Reserve_GF810_dV_Technical (map:Map<Funcs,Expr>) = <@ fun (n:int) ... (isPremium : int) -> let GMMale = %%map.[Funcs.GMFemale] ...
An additional reason for using F# is that in the longer term we want to automatically generate the GPU kernels that solve Thiele’s differential equations. The input to the code generator is a description of the underlying state models (describing life, death, disability and so on) and the functions and tables that express age-dependent mortalities, time-dependent future interest rates, and so on. As a strongly typed functional language with abstract data types, pattern matching, and higher-order functions, the F# language is supremely suited for such code generation processes. The state models and auxiliary functions are described by recursive data structures (so-called abstract syntax), and code generation proceeds by traversing these data structures using recursive functions.
Also, the F# language supports both functional programming, used to express the process of generating code on the host CPU, and imperative programming, used to express the computations that will be performed on the GPU. In other words, high-level functional code generates low-level imperative code, both within the same language, which even supports scripting of the entire generate-compile-load-run cycle:
let compileAndExecute template (inputArray:float[]) ... (b:float[]) = let irm = Compiler.Compile(template).IRModule use program = Worker.Default.LoadProgram(irm) let resultsWithTime = program.Run inputArray ... b resultsWithTime
The code generation approach will help support a wide range of life insurance and pension products. There are of course alternatives to code generation: First, one might hand-write the differential equations for each product, but this is laborious and error-prone and slows down innovation and implementation, or severely limits the range of insurance products supported. Secondly, one might take an interpretive approach, by letting the (GPU) code analyze the abstract syntax of the product description, but this involves executing many conditional statements, for which the GPU hardware is ill-suited as it may lead to branch divergence. Hence code generation is the only way to support generality while maintaining high performance.
This work was done in the context of the Actulus project, a collaboration between Copenhagen University, the company Edlund A/S, and the IT University of Copenhagen, funded in part by the Danish Advanced Technology Foundation contract 017-2010-3. Thanks are due to the many project participants who contributed to AML and in particular due to Christian Gehrs Kuhre and Jonas Kastberg Hinrichsen for their many competent experiments with GPU computations for advanced insurance products. Quantalea graciously provided experimental licenses for Alea GPU and supported us in various GPU related aspects.
Dr. Peter Sestoft is professor of software development at the IT University of Copenhagen. His research focuses on programming language technology, functional programming (since 1985), and parallel programming, in particular via declarative and generative approaches.
Dr. Daniel Egloff is partner at InCube Group and Managing Director of QuantAlea, a Swiss software engineering company specialized in GPU software development. He studied mathematics, theoretical physics and computer science and worked for more than 15 years as a quant in the financial service industry.
Follow @EgloffDaniel and @QuantAlea on Twitter.
[Alea GPU] http://www.quantalea.com
[Christiansen 2014] Christiansen, Grue, Niss, Sestoft and Sigtryggsson: An Actuarial Programming Language for Life Insurance and Pensions. International Congress for Actuaries 2014, Washington DC.
The post GPUs and Domain Specific Languages for Life Insurance Modeling appeared first on QuantAlea Blog.
]]>The post Algo Trading with F# and GPUs appeared first on QuantAlea Blog.
]]>QuantAlea has specialized in GPU computing with a strong focus on the financial services industry. You might have heard already of Alea GPU, QuantAlea’s professional GPU development environment for .NET. It allows to compile F#, C# or VB code directly to executable GPU code. Alea GPU has some unique features:
Other solutions are far less complete and not as convenient to use as Alea GPU. For instance CudaFy generates C code under the hood. It requires a post build step to compile the generated C code with a CUDA C compiler, breaking the integrity of the development process. The library managedCuda is even simpler, it is just a wrapper for the CUDA Driver API and requires that all the GPU code is programmed in C or C++.
Instead of giving an introduction to Alea GPU and describing how to program GPUs with F# in general I thought it would be more illustrative show how we use F# and GPUs for our own algorithmic trading at InCube Capital. F# and GPUs are a very good combination to develop sophisticated algorithmic trading strategies and analyze and optimize their profitability. I’d like to illustrate that with a few examples.
Let me begin by explaining how we develop the core trading logic in F#. For this we assume that our strategy decides to buy and sell based on a signal which has sufficient predictive power to forecast price movements and turning points. The more complex the trading logic becomes, the easier it is to miss a case, which then generates an error during paper trading or even worse, produces wrong orders in live execution. We extensively rely on exhaustive pattern matching. It is a very powerful tool to improve the correctness and robustness of our trading logic. We first define some active patterns. Here is an active pattern for the signal transition from an previous value to an actual value:
let (|FromZero|ToZero|NoChange|Change|) (f, s) = match UnitSignalValue.OfFloat f, UnitSignalValue.OfFloat s with | Zero, Pos -> FromZero | Zero, Neg -> FromZero | Pos , Zero -> ToZero | Neg , Zero -> ToZero | Neg , Pos -> Change | Pos , Neg -> Change | Pos , Pos -> NoChange | Neg , Neg -> NoChange | Zero, Zero -> NoChange
Yet another active pattern determines the different trading phases during the day. We have three time intervals. We wait until the trading start date, then we trade and after some time we try to liquidate the positions. If we do not manage to get rid of the position during the liquidation time we do a forced liquidation.
let (|Wait|Trade|Liquidate|Stop|) (input:(TimeSpan*TimeSpan*TimeSpan)*DateTime) = let (t1, t2, t3), d = input let dt = d.TimeOfDay if dt < t1 then Wait else if t1 <= dt && dt <= t2 then Trade else if t2 < dt && dt <= t3 then Liquidate else Stop
With these active patterns we can now formulate the core trading logic as follows:
let newSignal = filterStep past filter.Coeffs match (periods, quote.TimeStamp), pnlTransactions.HasPosition, (signal, newSignal) with | Wait, _ , _ -> () | Trade, _ , ToZero -> clearOpenPosition "on signal to zero" quote.Quote | Trade, false, FromZero -> () | Trade, false, NoChange -> () | Trade, false, Change -> enterNewPosition newSignal quote.Quote | Trade, true , Change -> switchSide newSignal quote.Quote | Trade, true , NoChange -> () | Trade, true , FromZero -> failwithf "inconsistent zero unit signal with position %d" pnlTransactions.Position | Liquidate, true , ToZero | Liquidate, true , Change -> clearOpenPosition (sprintf "on signal change after %A" filter.StopTime) quote.Quote | Liquidate, _ , _ -> () | Stop , true , _ -> clearOpenPosition (sprintf "after exit time %A" filter.LiquidationTime) quote.Quote | Stop , _ , _ -> ()
You can see that we match on the trading periods, the existence of a position and the pair of previous and new signal value. The trading logic is very compact, descriptive and easy to understand: we open a position for the first time if there is a signal change, switch positions on a signal change, liquidate in the liquidation period only on a signal change and clear all remaining position after the stop date. The crucial point however is that the pattern match is exhaustive so that we know for sure that all cases are properly handled.
Once the strategy is coded it has to be configured and validated with back-testing. I already blogged on our GPU cluster framework for bootstrapping walk forward optimization of trading strategies and machine learning, which I also presented at the NVIDIA GTC conference early this year.
Instead of going into more details about bootstrap walk forward optimization or machine learning I would like to present another technique to choose suitable strategies from a collection of candidates. Let’s assume that we somehow identified 20 to 50 signal configurations. Each signal gives rise to a trading strategy. We can back-test these strategies and deduce various characteristics and statistics such as the daily returns, Sharp ratio, draw down, mean, skewness, kurtosis and the daily return correlation between them. The idea is to select out of these 20 to 50 candidates a handful of strategies which have the least amount of correlation, the best overall sharp ratio and are as Gaussian as possible.
To keep it simple we just consider the case of selecting a fixed number of strategies which have the least amount of correlation. Unfortunately, there is no known closed form solution to this problem and the straightforward is a search algorithm. As soon as we add additional filters and constraints as mentioned above a full blown search is inevitable anyway, so we didn’t even bother about finding an alternative solution approach. Assuming that we have $n$ candidate strategies and we want to find the $k$ strategies with the least amount of correlation we would need to examine
$$ \binom{n}{k} = \frac{n!}{k! \cdot (n-k)!} $$
different combinations of choosing $k$ elements from a set of $n$ elements. This becomes very soon a prohibitively large number.
$k$ | $\binom{20}{k}$ | $\binom{50}{k}$ |
---|---|---|
2 | 190 | 1225 |
3 | 1140 | 19600 |
4 | 4845 | 230300 |
5 | 15504 | 2118760 |
6 | 38760 | 15890700 |
7 | 77520 | 99884400 |
8 | 125970 | 536878650 |
9 | 167960 | 2505433700 |
10 | 184756 | 10272278170 |
First we have to implement a function to calculate $\binom{n}{k}$. For this we should rely on the product formula
$$ \binom{n}{k} = \prod_{i=1}^{l} \frac{n – k + i}{i} $$
The change between multiplication and division prevents temporary values from unnecessary growth. Of course we have to use
uint64because the range of
intis not sufficient. The implementation is pretty straightforward. The reason why we add the attribute
[]will become clear soon.
[] let choose (n:int) (k:int) = if k > n then 0UL else if k = 0 || k = n then 1UL else if k = 1 || k = n - 1 then uint64 n else let delta, iMax = if k < n - k then uint64 (n - k), k else uint64 k, n - k let mutable res = delta + 1UL for i = 2 to iMax do res
Next we have to find an algorithm to list all the $\binom{n}{k}$ different combination. A serial algorithm is straightforward. Here is an example with $n = 5$ and $k=3$. We start with the first combination [0; 1; 2] and then increase the last element until it cannot be increased any further. Then the preceding element is increased, always keeping the elements in increasing order.
$m$ | Combination |
---|---|
0 | [0; 1; 2] |
1 | [0; 1; 3] |
2 | [0; 1; 4] |
3 | [0; 2; 3] |
4 | [0; 2; 4] |
5 | [0; 3; 4] |
6 | [1; 2; 3] |
7 | [1; 2; 4] |
8 | [1; 3; 4] |
9 | [2; 3; 4] |
For a parallel implementation we need a way to generate the $m$-th combination directly. Here is an implementation which does the job.
[] let largest a b x = let mutable v = a - 1 while choose v b > x do v
The function
choosedid not require any memory allocation or object creation and can run as such on a GPU. This is not the case for
subset. We refactor it as follows
[] let subset (getSub : int -> int) (setSub : int -> int -> unit) (n:int) (k:int) (m:uint64) = let mutable x = (choose n k) - 1UL - m let mutable a = n let mutable b = k for i = 0 to k-1 do largest a b x |> setSub i x setSub i
The functions
getSub : int -> intand
setSub : int -> int -> unitabstract the memory for storing the combination. The attribute
[]instructs the F# compiler to generate a quotation expression so that our Alea GPU compiler can transform it to GPU code. The curious reader can find more details about our compilation process in the manual.
Having solved the combinatoric problem we can continue with the calculation of the correlation of a combination. For efficiency reasons we store the correlation matrix in a compact format, which only stores the lower triangular part as an array. We do not store the diagonal as it is always 1, nor the upper triangular part because the matrix is symmetric. Let’s take as an example a 3×3 correlation matrix
r/c | 0 | 1 | 2 |
---|---|---|---|
0 | 1.0000 | 0.9419 | 0.5902 |
1 | 0.9419 | 1.0000 | 0.0321 |
2 | 0.5902 | 0.0321 | 1.0000 |
We only have to store the 3 matrix elements
r/c | 0 | 1 | 2 |
---|---|---|---|
0 | * | * | * |
1 | 0.9419 | * | * |
2 | 0.5902 | 0.0321 | * |
which we flatten into a vector [0.9419; 0.5902; 0.0321]. Here is the F# implementation:
let lowerTriangularPacked (A:float[][]) = A |> Array.mapi (fun i row -> Array.sub row 0 i) |> Array.concat
The length of the packed vector for a matrix of dimension $n$ is
$$ \mathbf{size}(n) = \frac{n(n+1)}{2} – n = \frac{n(n-1)}{2}, $$
because we do not store the diagonal. The offset in the packed vector of an element $(i,j)$ with $i > j$ is calculated as
$$ \mathbf{offset}(i, j) = \frac{i(i-1)}{2} + j = \mathbf{size}(i) + j.$$
Given a combination $c$ as calculated by the function
subsetwe build the correlation sub-matrix again in packed storage format as follows:
[] let subMatixPacked (matrix: int -> float) (subMatrix: int -> float -> unit) (selection:int -> int) selLength = let mutable l = 0 for i = 0 to selLength - 1 do for j = 0 to i - 1 do //printfn "(%d, %d) -> (%d, %d)" i j s.[i] s.[j] let offset = offset (selection i) (selection j) matrix offset |> subMatrix l l
As in the previous code fragment we externalize the correlation matrix access and the selection indices with functions
matrix: int -> float, respectively
selection:int -> intand provide a function
subMatrix: int -> float -> unitto set the elements of the sub-matrix. Now we have everything together to calculate the amount of correlation of a sub-matrix determined by a combination. Mathematically, we want to measure how far the sub-matrix is from the $k$ dimensional identity matrix
$$ | C_{sub}(c) – \mathbf{Id}_k |, $$
where $|\cdot|$ is a suitable matrix norm. We decided to choose the Frobenius norm
$$ | C |_{F} = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n} |C_{ij}|^2 }. $$
There are of course different choices such as the $L_1$ or $L_{\infty}$ norm. In our special case the implementation of the Frobenius distance to the identity is very simple. It is just the square root of the sum of squares of all the elements in the packed vector
[] let distToIdentity n (A: int -> float) = let mutable dist = 0.0 for i = 0 to n-1 do dist
We would like to stress that all of the critical numerical code runs on CPU and GPU, so that we have 100% code reuse. Before we present the final GPU implementation of the least correlated subset problem, let’s see how we can use our numerical building blocks to implement the algorithm for CPU execution. All what we have to do is to manage array creation and call the numerical building blocks with a lambda function to access the data. Here is the code:
let distToIdentity (A:float[]) = distToIdentity A.Length (fun i -> A.[i]) let subMatixPacked (A:float[]) (selection:int[]) = let subMatrix = Array.zeroCreate (size selection.Length) subMatixPacked (fun i -> A.[i]) (fun i v -> subMatrix.[i] selection.[i]) selection.Length subMatrix let allDistToIdentity (C : float[]) n k = let cnk = Binomial.choose n k Array.init (int cnk) (fun i -> let selection = Binomial.Cpu.subset n k (uint64 i) i, subMatixPacked C selection |> distToIdentity ) let leastCorrelated (C : float[]) n k = let best = allDistToIdentity C n k |> Array.minBy snd Binomial.Cpu.subset n k (fst best |> uint64)
Note that the function
allDistToIdentityonly works for small values of $n$ and $k$ where $\binom{n}{k}$ is in the range of
int. The function
leastCorrelatedreturns the best combination.
Lets move on to the final GPU implementation. Up to now no particular GPU knowledge was required appart from adding the attribute []
. Now we need to know some basic CUDA terms. For those of you who are GPU freshmen I recommend that you briefly look at this example, where you can learn the basic GPU coding principles. To be able to follow the example you should know that the GPU has a memory space that is different from the CPU and that the GPU runs functions, called kernels, in parallel in many threads. These threads are grouped into block of threads. Each thread in a block is identified with its
threadIdxand each block with its
blockIdx. The size of the block is given by
blockDim, whereas the number of blocks can be read from
gridDim.
We start with a very simple implementation of the calculation of the distance to the identity for many combinations in parallel. We write a so called GPU kernel, that is a function which will be executed on the GPU in parallel in many different instances.
[] let distToIdentityKernelDevice (n:int) (k:int) (cnk:int) (ns:int) (C:deviceptr) (dist:deviceptr) (selection:deviceptr) (subMatrix:deviceptr) = let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = selection + start * k let subMatrix = subMatrix + start * ns let mutable m = start while m < cnk do subset (fun i -> selection.[i]) (fun i v -> selection.[i] C.[i]) (fun i v -> subMatrix.[i] selection.[i]) k dist.[m] subMatrix.[i]) m
Note that we use the type
deviceptrand
deviceptrto access data on the GPU. You immediately see that we call our functions
subsetand
subMatrixPackedwith lambda functions such as
fun i -> selection.[i]or
fun i v -> subMatrix.[i] . In this implementation the temporary storage for selection
and subMatrixis in the so called global device memory, which is the largest memory pool on a GPU, but also the slowest in terms of access latency. How do we now call such a kernel function? For this we have to allocate storage on the GPU, copy data to the GPU, define launch parameters and call the kernel function. Here is the code which does all these steps:
let allDistToIdentityDevice (C : float[]) n k = let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let cnk = choose n k |> int let ns = size k let blockSize = 128 let gridSize = 8 * numSm use dC = worker.Malloc(C) use dDist = worker.Malloc(cnk) use dSelection = worker.Malloc(gridSize * blockSize * k) use dSubMatrix = worker.Malloc(gridSize * blockSize * ns) let lp = new LaunchParam(gridSize, blockSize) worker.Launch <@ distToIdentityKernelDevice @> lp n k cnk ns dC.Ptr dDist.Ptr dSelection.Ptr dSubMatrix.Ptr dDist.Gather()
You see from the
usebinding that all our allocated memory on the GPU is disposable. The kernel function
distToIdentityKernelDeviceis passed as an expression to the kernel launch
worker.Launch <@ distToIdentityKernelDevice @>. Here
workeris essentially a GPU. For simplicity we choose 128 threads per block and launch 8 times more blocks than streaming multiprocessors on the GPU. The last step
dDist.Gather()copies the calculated distances back to the CPU memory.
Let’s do a first optimization. As already mentioned global device memory has a high access latency. Registers and thread local memory, as long as it does not spill over to device memory, are faster. Yet another memory type is shared memory. As suggested by its name it can be accessed from multiple threads, i.e. shared between all the threads of a block. This is very useful for the original correlation matrix $C$. Here is the implementation using shared memory for
Cas well as for the temporaries
selectionand
subMatrix:
[] let distToIdentityKernelShared (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (C:deviceptr) (dist:deviceptr) = let sizeC = size n let sizeSelection = blockDim.x * k let shared = __shared__.ExternArray() |> __array_to_ptr let sharedC = shared.Reinterpret() let selection = (sharedC + sizeC).Reinterpret() let subMatrix = (selection + sizeSelection).Reinterpret() let mutable i = threadIdx.x while i < sizeC do sharedC.[i] selection.[i]) (fun i v -> selection.[i] sharedC.[i]) (fun i v -> subMatrix.[i] selection.[i]) k dist.[m] subMatrix.[i]) m
Variable amount of shared memory is allocated as part of the kernel launch. We show how to do that later. In this new kernel we first get the externally allocated shared memory
shared = shared.ExternArray(), calculate the offsets for the temporaries and then copy the matrix
Cto
sharedCin shared memory in parallel and then call
__syncthreads()to wait for its completion. The rest of the kernel is identical. This performs much better. Note also the small change that we added an additional argument
m0for the offset where to start the calculations.
So far we just calculated the distance to the identity but did not yet do the minimization. Determining the smallest element in a vector is a classical reduction problem. Alea GPU provides a very efficient and generic reduction implementation. Because we have to find the index of the smallest element we have to create a special type.
[] type IndexAndValue = val Index : uint64 val Value : float [] new(i, v) = { Index = i; Value = v } override this.ToString() = sprintf "(%d, %f)" this.Index this.Value [] static member Min (a:IndexAndValue) (b:IndexAndValue) = if a.Value < b.Value then a else b
Because instances of this type have to live on the GPU it must be a value type, as enforced by the attribute
[]. Now we can easily find the index of the minimal value in an array with the parallel reduction from Alea GPU Unbound library. First create an instance of
DeviceReduceModule:
let minReductionModule = new DeviceReduceModule(GPUModuleTarget.Worker(worker), <@ IndexAndValue.Min @>)
Then we create a reducer for arrays of length up to
lenand launch the reduction
use minimum = minReductionModule.Create(len) use idxAndValues = worker.Malloc(len) minimum.Reduce(idxAndValues.Ptr, len)
Currently, because of some library limitations you have to set the context type to “threaded” as follows:
Alea.CUDA.Settings.Instance.Worker.DefaultContextType <- "threaded"
Let us now combine this with the kernel to calculate the distance to the identity.
let leastCorrelatedWithTransform (partitionSize:int) (C : float[]) n k = let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let maxSharedMem = worker.Device.Attributes.MAX_SHARED_MEMORY_PER_BLOCK let ns = size k let cnk = choose n k let blockSize = 128 let gridSize = 8 * numSm let sharedSize = __sizeof() * size n + blockSize * __sizeof() * k + blockSize * __sizeof() * ns if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize use dC = worker.Malloc(C) use dDist = worker.Malloc(partitionSize) use idxAndValues = worker.Malloc(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let lpr = LaunchParam(gridSize, blockSize) let findBest m = worker.Launch <@ distToIdentityKernelShared @> lpd m partitionSize n k ns dC.Ptr dDist.Ptr worker.Launch <@ transform @> lpr dDist.Ptr idxAndValues.Ptr m partitionSize minimum.Reduce(idxAndValues.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions - 1UL] |> List.map (fun p -> let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions |> List.minBy (fun v -> v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection
This implementation of
leastCorrelatedWithTransformfirst calculates the amount of shared memory
sharedSizethat is required. It then partitions the $\binom{n}{k}$ different combinations into blocks of size
partitionSizeand loops over all the partitions
let bestInPartitions = [0UL..numPartitions - 1UL] |> List.map (...)
The final reduction is done on the CPU. To call the reduction kernel we have to transform the vector of distances calculated by
distToIdentityKernelSharedto a vector of
IndexAndValue.
[] let transform (inputs:deviceptr) (outputs:deviceptr) (m0:uint64) (n:int) = let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do outputs.[i]
We can avoid the transform and directly generate a vector of
IndexAndValue. Here is the code:
[] let distToIdentityKernelSharedIndexAndValue (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (C:deviceptr) (dist:deviceptr) = let sizeC = size n let sizeSelection = blockDim.x * k let shared = __shared__.ExternArray() |> __array_to_ptr let sharedC = shared.Reinterpret() let selection = (sharedC + sizeC).Reinterpret() let subMatrix = (selection + sizeSelection).Reinterpret() let mutable i = threadIdx.x while i < sizeC do sharedC.[i] selection.[i]) (fun i v -> selection.[i] sharedC.[i]) (fun i v -> subMatrix.[i] selection.[i]) k dist.[m] subMatrix.[i])) m () * size n + blockSize * __sizeof() * k + blockSize * __sizeof() * ns if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize use dC = worker.Malloc(C) use dDist = worker.Malloc(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let findBest m = worker.Launch <@ distToIdentityKernelSharedIndexAndValue @> lpd m partitionSize n k ns dC.Ptr dDist.Ptr minimum.Reduce(dDist.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions - 1UL] |> List.map (fun p -> let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions |> List.minBy (fun v -> v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection
Of course the GPU implementation is more difficult than the CPU version. But it can handle much larger problems faster. The important point is that it shares all the core functions with the CPU implementation. Let us have a look at the performance of our implementation.
n | k | CPU | GPU | Speedup |
---|---|---|---|---|
20 | 5 | 33.65 ms | 2.70 ms | 12.40 |
20 | 6 | 69.40 ms | 6.41 ms | 10.81 |
20 | 7 | 169.02 ms | 22.57 ms | 7.49 |
20 | 8 | 334.72 ms | 40.40 ms | 8.28 |
20 | 9 | 513.01 ms | 57.76 ms | 8.88 |
50 | 3 | 25.52 ms | 2.93 ms | 8.70 |
50 | 4 | 369.69 ms | 57.76 ms | 12.32 |
50 | 5 | 5455.00 ms | 329.20 ms | 16.57 |
This is not the end of the optimization, there is still room for improvement. First, because shared memory is always very scarce we should try to save it. Because the algorithm is only reading from C
we can as well place it in constant memory, which has a read latency comparable to shared memory. For this we have to use the cuda
computational workflow to define the constant memory resource of a GPU module. Note that the constant memory size must be a GPU compile time constant. This is not a big problem because we can always JIT-compile a new configuration. The next optimization is to store some temporary data, such as the selection indices, into thread local memory. Note that the size of the local memory has to be also a GPU compile time constant. Here is the code of the version that uses constant memory and local memory:
let leastCorrelatedModuleUsingConstAndLocalMem maxDim maxK = cuda { let sizeC = size maxDim let! constC = Compiler.DefineConstantArray(sizeC) let! kernel = <@ fun (m0:uint64) (partitionSize:int) (n:int) (k:int) (ns:int) (dist:deviceptr) -> let subMatrix = __shared__.ExternArray() |> __array_to_ptr let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let selection = __local__.Array(maxK) let subMatrix = subMatrix + threadIdx.x * ns let mutable m = start while m < partitionSize do subset (fun i -> selection.[i]) (fun i v -> selection.[i] constC.[i]) (fun i v -> subMatrix.[i] selection.[i]) k dist.[m] subMatrix.[i])) m |> Compiler.DefineKernel return Entry(fun (program:Program) -> let worker = program.Worker let constC = program.Apply(constC) let kernel = program.Apply(kernel) let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT let maxSharedMem = worker.Device.Attributes.MAX_SHARED_MEMORY_PER_BLOCK let run (partitionSize:int) (C : float[]) n k = if size n > sizeC then failwithf "dimension %d of C is too large only support dimension up to %d" n maxDim let ns = size k let cnk = choose n k let blockSize = 128 let gridSize = 8 * numSm let sharedSize = blockSize * __sizeof() * ns printfn "block size %d, shared memory size %d (%d), const memory size %d" blockSize sharedSize maxSharedMem sizeC if sharedSize > maxSharedMem then failwithf "too much shared memory required: max shared mem = %d, required shared memory size = %d" maxSharedMem sharedSize constC.Scatter C use dDist = worker.Malloc(partitionSize) use minimum = minReductionModule.Create(partitionSize) let lpd = new LaunchParam(gridSize, blockSize, sharedSize) let findBest m = kernel.Launch lpd m partitionSize n k ns dDist.Ptr minimum.Reduce(dDist.Ptr, partitionSize) let numPartitions = divup cnk (uint64 partitionSize) let bestInPartitions = [0UL..numPartitions - 1UL] |> List.map (fun p -> let m = p * (uint64 partitionSize) in findBest m) let best = bestInPartitions |> List.minBy (fun v -> v.Value) let bestSelection = Binomial.Cpu.subset n k best.Index bestSelection run ) }
You see that the kernel also becomes slightly shorter. The performance of the smaller problems is pretty much the same, but for larger problems we see an improvement.
n | k | CPU | GPU | Speedup |
---|---|---|---|---|
50 | 3 | 25.52 ms | 2.19 ms | 11.65 |
50 | 4 | 369.69 ms | 10.31 ms | 35.85 |
50 | 5 | 5455.00 ms | 170.89 ms | 31.92 |
There is still more optimization potential which can increase performance by another factor. Namely, each single distance calculation can be done with a parallel reduction too. We leave this for a future improvement.
To enable profiling of JIT compiled code you have to set the JIT compile level
Alea.CUDA.Settings.Instance.JITCompile.Level <- "Diagnostic"
Then all that is required is to build the code in debug mode and place a breakpoint in the GPU code. Start the NVIDIA debugger from the Visual Studio menu NSIGHT and choose Start CUDA Debugging. Data across different warps are best analyzed in the CUDA warp watch window. Here is an example
Profiling is crucial to identify further optimization potential. To facilitate the profiling of your GPU code Alea GPU is well integrated with the NVIDIA profiler. For details we refer to the profiling chapter in our tutorial.
We have shown how we use F# and GPUs for our own internal algorithmic trading. F# has a lot of extremely helpful language features, such as active patterns and exhaustive pattern matching, which allow us to program strategies faster, more expressive and more robust.
The usability of GPUs is illustrated with the practically relevant example of finding the subset of least correlated strategies. We went through some optimization and refactoring to illustrate how we develop GPU kernels, starting with simple implementations and adding more and more optimizations. The key point is that with the right level of abstraction and by using lambda functions we can share critical code between CPU and GPU in an elegant manner. This improves development time and reduces maintainability costs.
The post Algo Trading with F# and GPUs appeared first on QuantAlea Blog.
]]>The post F# for Industrial Applications – Worth a Try? appeared first on QuantAlea Blog.
]]>F# is indeed a very compelling technology. But is it also suitable for main stream and enterprise application development? What are the advantages of F# over C#? When should I use F# and what would be the business value of adopting it? In this talk we look at F# and its ecosystem more from a strategic point of view. We highlight the strength and unique features of F# and how they can be used to build sophisticated enterprise applications. We illustrate the business relevance of F# with concrete projects and solutions that we completed over the last few years.
The post F# for Industrial Applications – Worth a Try? appeared first on QuantAlea Blog.
]]>