How to use Alea.cuBase in Python


Python is often used for scripting and rapid prototyping. In this post we illustrate how we can integrate Alea.cuBase and Python so that we can call GPU algorithms coded with Alea.cuBase conveniently in Python.

In this post we rely on Python for .NET. It provides a nearly seamless integration of Python with the .NET Common Language Runtime (CLR). Note that it does not implement Python as a first-class CLR language, nor does it translate Python code to managed code IL code. It is rather an integration of the C Python engine with the .NET runtime.

An alternative approach would be to use IronPython, which is a is an implementation of the Python programming language targeting the .NET Framework, entirely written in C#. However, because IronPython has some limitations in using very useful Python libraries such as matplotlib, we prefer to work with C Python and Python for .NET.

Setting up the Environment

We suggest that you install the Python tools for Visual Studio from which turn Visual Studio into a nice Python IDE, supporting both CPython and IronPython.


If you are going to use IronPython, all that is needed is to install IronPython from

Python for .NET

Python for .NET consists of two components:

  1. clr.pyd, a Python module interfacing with the .NET world
  2. Python.Runtime.dll, an assembly used by clr.pyd

We need to compile Python for .NET to use .NET 4.0 framework and the proper Python versions. Currently, Python for .NET surpported Python version from 2.3 to 2.7 Checkout the source of Python for .NET from

It contains one solution file for VS 2008. Open it with VS 2010, the conversion will succeed without errors. To compile Python for .NET to use Python 2.7 and .NET 4.0 the following steps are required:

Right-click on project “Python.Runtime” and select “Properties”, select “Application” tab and change the “Target framework” to “.NET Framework 4”. Then open the file pythonnet\pythonnet\src\runtime\buildclrmodule.bat and change the following command:


Attention, it appears two times. Next, open the file and change the lines with the version number in the following piece of code:



To change the Python interpreter version, right-click on project “Python.Runtime” and select “Properties”. In the “Build” tab, “Conditional compilation symbols”, change “PYTHON26” to “PYTHON27″ to select the Python 2.7 interpreter.

The last step is to patch methodbinder.cs. Replace the method MatchParameters with the following code:

[sourcecode language=”csharp”]
private static bool _RetrieveGenericArguments(List<Type> gts, Type pt, Type it)
bool ok = true;
if (pt.GUID == new Guid())
else if (pt.IsGenericType && it.IsGenericType && it.GetGenericTypeDefinition().GUID == pt.GUID)
var pts = pt.GetGenericArguments();
var its = it.GetGenericArguments();
for (int i = 0; i < pts.Length; ++i)
ok &= _RetrieveGenericArguments(gts, pts[i], its[i]);
else if (!pt.IsGenericType && !it.IsGenericType && pt.GUID == it.GUID)
// nothing
ok = false;
return ok;

internal static MethodInfo MatchParameters(MethodInfo[] mis, Type[] its)
foreach (var mi in mis)
if (!mi.IsGenericMethodDefinition) continue;

var pts = (from p in mi.GetParameters() select p.ParameterType).ToArray();
if (pts.Length != its.Length) continue;

var n = pts.Length;
var gts = new List<Type>();
bool ok = true;
for (int i = 0; i < n; ++i)
ok &= _RetrieveGenericArguments(gts, pts[i], its[i]);
if (!ok) continue;
if (gts.Count != mi.GetGenericArguments().Length) continue;
return mi.MakeGenericMethod(gts.ToArray());

return null;

Now recompile the project “Python.Runtime”.

After a sucessful build you can test it with the following simple Python script:

[sourcecode language=”python”]
import sys


import clr, System

print System.Environment.

you can also print out the sys.path

print ‘—–‘
for p in sys.path:
print p
print ‘—–‘

Note that the path C:\dev\pythonnet\pythonnet\src\runtime\bin\Release has to point to the location of the module clr.pyd and the assembly Python.Runtime.dll.

Interfacing Python and .NET

In order to use a private assembly, use clr.AddReference() function. For example to use the assembly “Test.dll” call clr.AddReference(“Test”) to load it.

We refer to for how to interoperate with .NET from Python.

Preparing a .NET Assembly with GPU Code

We create an F# library project, referencing Alea.CUDA. Make sure that you set the “Copy Local” property of the Alea.CUDA assembly refernce to true. The example below provides a simple kernel adding two arrays and some helper class DeviceWorkerHelper, which exposes some module load functions to get around some limitations of Python for .NET with class extension methods.

[sourcecode language=”fsharp”]
module Lib.Test

open Alea.CUDA

let a = [| 1.0; 2.0 |]

let pfunct = cuda {
let! kernel =
<@ fun (C:DevicePtr<float>) (A:DevicePtr<float>) (B:DevicePtr<float>) ->
let tid = threadIdx.x
C.[tid] <- A.[tid] + B.[tid] @>
|> defineKernelFunc

return PFunc(fun (m:Module) (A:float[]) (B:float[]) ->
let n = A.Length
use A = m.Worker.Malloc(A)
use B = m.Worker.Malloc(B)
use C = m.Worker.Malloc(n)
let lp = LaunchParam(1, n)
kernel.Launch m lp C.Ptr A.Ptr B.Ptr
C.ToHost()) }

type DeviceWorkerHelper(worker:DeviceWorker) =
member this.LoadPModule(f:PFunc<‘T>, m:Builder.PTXModule) = worker.LoadPModule(f, m)
member this.LoadPModule(fm:PFunc<‘T> * Builder.PTXModule) = worker.LoadPModule(fm)
member this.LoadPModule(f:PFunc<‘T>, m:Builder.IRModule) = worker.LoadPModule(f, m)
member this.LoadPModule(fm:PFunc<‘T> * Builder.IRModule) = worker.LoadPModule(fm)
member this.LoadPModule(t:PTemplate<PFunc<‘T>>) = worker.LoadPModule(t)

Calling a GPU Kernel from Python

The following Python script shows how to call the kernel from the Test assembly:

[sourcecode language=”python”]
import sys
import clr, System



from Alea.CUDA import Engine, Framework
from Lib import Test

worker = Engine.workers.DefaultWorker
print worker.Name
worker = Test.DeviceWorkerHelper(worker)

A = System.Array[System.Double]([1.0, 2.0, 3.0, 4.0])
B = System.Array[System.Double]([1.5, 2.5, 3.5, 4.5])

def test(pfuncm):
C = pfuncm.Invoke.Invoke(A).Invoke(B)
for x in C: print x,
print “”

print “Loading into worker”
pfuncm = worker.LoadPModule(Test.pfunct)

print “Invoking GPU kernel”

Executing the script produces the following output:

Unfortunately this script cannot be executed in the Python Interactive inside Visual Studio, because the Python REPL process exits with a StackOverflowException at the import of Alea.CUDA.


We have show how to use Alea.cuBase in Python with a suitable modification of Python for .NET. If you just want to do rapid prototyping together with some simple plotting and visualization we suggest that you also take a look at the F# interactive and the FSharpChart library.


Install Alea.cuBase License on Amazon Elastic Compute Cloud

Alea.cuBase licenses rely on a fingerprint of the running hardware. For an Amazon EC2 cloud machine, the hardware keeps changing. In this case, in order to install an Alea.cuBase license you have to create a Virtual Private Cloud (VPC), then use a Elastic Network Interface (ENI) in that VPC, which retains a static MAC address. In this article, I will show you how to setup a VPC, and start a EC2 instance in that VPC, which has an ENI attached. With this setting, we can generate a meaningful fingerprint, and you can install an Alea.cuBase license.

STEP 1: Create VPC

Login into your AWS console, click “VPC”. It will bring you to the VPC dashboard. If there are no VPC created, there will be a button “Get started creating a VPC” in your dashboard. Just click it, and a VPC configuration selection window will popup:

It provides 4 typical network topologies. In this example, I choose the first one, which is the simplest one.

Click “Continue” to do the network settings. We need set a proper availability zone for the subnet, because some zones don’t support GPU. In our example we choose “us-east-1a”:

Click “Create VPC”. The system will create a set of objects: 1 VPC, 1 Subnet, 1 Internet Gateway, 1 Network ACL, 2 Route Tables, and 1 Security Group. We keep all default settings, except for the security group.

For the security group we need to modify it so that it allows the Remote Desktop Protocol (RDP), so that we can connect to it. Navigate to “Security Groups” in the VPC console. We have to add one Inbound rule for the RDP protocol. Click on the drop down list to select the RDP protocol. Then add this rule and apply the changes as follows:


We have created the VPC. Go back to the main console.

STEP 2: Allocate EIP

In order to connect to your EC2 machine from the outside, we need to allocate one elastic IP (EIP). Go back to the main console, and select “EC2″, which will bring us to the EC2 console. On the left, we can navigate to “Elastic IPs”, and click “Allocate New Address”. This will popup a dialog where we can choose the purpose of the EIP. Set it to be used in VPC:


STEP 3: Create ENI

EIP is just a resource. We need to create an ENI and attach that EIP to it. On the left, navigate to “Network Interfaces”, and click “Create Network Interface”. In the creation dialog select the subnet to the one that VPC created:


We have created an ENI, but it has only a private IP address. We need to attach the EIP to it. Right click the ENI that we have created and a context menu will popup, choose “Associate Address”. A dialog will popup to let you select which IP you want to associate with it. Select the EIP that we just created:


STEP 4: Start Instance

Now have created all necessary objects and we can launch an EC2 instance. Navigate to “Instance” and click “Launch Instance”, choose “Classic Wizard”, then select proper AMI:


On the next page, the instance type should be “CG1 Cluster GPU” in order to get a GPU machine. We need to set it to be launched in the VPC we have created:


On the next page we need to select the network interface to the one we just created:


We also have to select the security group to be the one we just created:


After clicking “Launch”, it will take some time (10 to 30 minutes) to start the machine.

STEP 5: Install license and test

Now, you can connect to the instance via RDP. Install Alea.cuBase and use License Manager to install a license:


It is strongly recommended that you save your license once it is authenticated.

We run an example script to verify the setup:


If no evaluation window pops up the license is correctly installed. This license is bound to the ENI. This means you should keep the ENI as long as you want to use the license. If you changed the hardware configuration, you have to re-authenticate your license again.


Alea.cuBase 1.0.401 released

Alea.cuBase 1.0.401 is released and can be downloaded from our web page.

Alternatively you can get it as a NuGet package on

In this release we change the product name from Alea.CUDA to Alea.cuBase in order conform to existing trademarks. The new name also reflects the fact that Alea.cuBase is a base technology on which you can build your own GPU accelerated .NET applications.

The new release also improves the kernel launch time.


Segmented Warp Scan with Packed Head Flags

Here is a simple test implementation of a segmented scan in warp with packed head flag. The idea is from the segmented warp scan function.

Here is the implementation in F#.

[sourcecode language=”fsharp”]
open NUnit.Framework
open Alea.Interop.LLVM
open Alea.CUDA
open Alea.CUDA.Extension

// define an alias
module IRB = Alea.CUDA.IRBuilderUtil
module DF = Alea.CUDA.DeviceFunction

// get the worker
let worker = getDefaultWorker()

let bfi(x:int, y:int, bit:int, numBits:int):int = failwith “Device Only!”
let bfi [BUILDER](ctx:IRB.LLVMFunctionBuilderContext) =
let args = ctx.LLVMValueArgs // arguments LLVM values
let i32t = ctx.LLVMHelper.i32_t // int LLVM type
let rett = i32t // return type
let argst = [| i32t; i32t; i32t; i32t |] // argument type list
let funct = LLVMFunctionTypeEx(rett, argst, 0)
let funcp = LLVMConstInlineAsm(funct, “bfi.b32 \t$0, $2, $1, $3, $4;”, “=r,r,r,r,r”, 0, 0)
IRB.Value(LLVMBuildCallEx(ctx.Builder, funcp, args, “”))

let segscanWarp (inputs:DevicePtr<int>) (outputs:DevicePtr<int>) (distances:DevicePtr<int>) =
let tid = threadIdx.x
let packed = inputs.[tid]

// the start flag is in the high bit
let flag = 0x80000000 &&& packed

// get the start flags for each thread in the warp
let flags = __ballot(flag)

// mask out the bits above the current thread
let flags = flags &&& bfi(0, 0xffffffff, 0, tid + 1)

// find the distance from the current thread to the thread at the start of
// the segment
let distance = DF.__clz(flags) + tid – 31

let shared = shared<int>(Util.WARP_SIZE).Ptr(0).Volatile()

let x0 = 0x7fffffff &&& packed
let mutable x = x0
shared.[tid] <- x

// perform the parallel scan. Note the conditional if(offset < distance)
// replaces the ordinary scan conditional if(offset <= tid)
for i = 0 to Util.LOG_WARP_SIZE – 1 do
let offset = 1 <<< i
if offset <= distance then x <- x + shared.[tid – offset]
shared.[tid] <- x

// turn inclusive scan into exclusive scan
x <- x – x0

outputs.[tid] <- x
distances.[tid] <- distance

let test() =
let blockSize = 256
let numWarps = blockSize / Util.WARP_SIZE
let s x = x ||| (1 <<< 31)
let hInputs =
s 3; 0; 3; 3; 0; s 1; 2; 0; 3; 3; 3; 2; 3; 0; 3; 1;
0; 0; 2; 3; 2; s 3; 1; 0; 2; 1; 2; 1; 1; 0; 1; s 3;

let pfunct = cuda {
let! segscanWarp = <@ segscanWarp @> |> defineKernelFunc

return PFunc(fun (m:Module) ->
let worker = m.Worker
let segscanWarp = segscanWarp.Apply m
pcalc {
let! dInputs = DArray.scatterInBlob worker hInputs
let! dOutputs = DArray.createInBlob worker hInputs.Length
let! dDistances = DArray.createInBlob worker hInputs.Length

do! PCalc.action (fun hint ->
let lp = LaunchParam(1, Util.WARP_SIZE) |> hint.ModifyLaunchParam
segscanWarp.Launch lp dInputs.Ptr dOutputs.Ptr dDistances.Ptr)

let! hOutputs = dOutputs.Gather()
let! hDistances = dDistances.Gather()

printfn “outputs:”
for i = 0 to 15 do printf “%2d; ” hOutputs.[i]
printfn “”
for i = 16 to 31 do printf “%2d; ” hOutputs.[i]
printfn “”

printfn “distances:”
for i = 0 to 15 do printf “%2d; ” hDistances.[i]
printfn “”
for i = 16 to 31 do printf “%2d; ” hDistances.[i]
printfn “” } ) }

let calc = worker.LoadPModule(pfunct).Invoke
calc |>


CUDA Scripting in Excel Part I

In this video, we use the Tsunami IDE to do CUDA scripting inside Excel. We show how to code a CUDA kernel, launch and modify it.

Here is the test code:

[sourcecode language=”fsharp”]
#r @”C:\Users\Xiang\Documents\FCell\Alea.CUDA.dll”
#r @”C:\Users\Xiang\Documents\FCell\FCell.ManagedXll.dll”

namespace Demo

open Alea.CUDA
open FCell.ManagedXll

module UDF =

let exc (e:exn) = XlScalar(XlString(e.Message))

let private worker = Engine.workers.DefaultWorker
let DeviceName() = worker.Device.Name
let DeviceMemory() = worker.Device.Memory |> int

let private pfunct = cuda {
let! kernel =
<@ fun (A:DevicePtr<float>) (B:DevicePtr<float>) (C:DevicePtr<float>) ->
let tid = threadIdx.x
C.[tid] <- A.[tid] + B.[tid] @>
|> defineKernelFunc

return PFunc(fun (m:Module) (A:float[]) (B:float[]) ->
use A = m.Worker.Malloc(A)
use B = m.Worker.Malloc(B)
use C = m.Worker.Malloc(A.Length)
let lp = LaunchParam(1, A.Length)
kernel.Launch m lp A.Ptr B.Ptr C.Ptr
C.ToHost()) }
let private pfuncm = worker.LoadPModule(pfunct)
let Test value idx =
let A = [| 1.1; 2.2; 3.3 |]
let B = Array.init A.Length (fun _ -> value)
let C = pfuncm.Invoke A B