Barracuda: An OpenCL Library for Ruby

By Loren Segal on August 08, 2009 at 830:518:396 PM

D846~Plymouth-Barracuda-Catch-a-Cuda-Posters

Barracuda is my latest experiment. I wanted to toy around with the new features of Snow Leopard, and as far as I’ve seen, there are no CUDA or OpenCL wrapper libraries out there for Ruby.

OpenCL, if you don’t know, is an “open” version of CUDA, which is NVIDIA’s proprietary architecture for GPGPU programming. Okay, acronyms out of the way, this means that OpenCL is a way to run small but heavy computations in parallel on your GPU’s hundred or so cores. There’s a great demo of why this is cool at MacResearch (skip to the end of the video).

I should point out that while the library is called Barracuda (did you get it?), there’s currently no CUDA support. I want to add support for CUDA down the road after I figure out what’s involved, though.

Show and Tell Time

Anyway, long story short, I wrote a very basic wrapper for OpenCL which currently only supports signed integers and floats, but will hopefully add some more functionality soon. I know you’re all dying for a demo, so, here’s a benchmark of some integer to float related computation in Ruby, versus the same computation using Barracuda:

require 'barracuda'
require 'benchmark'

include Barracuda

prog = Program.new <<-eof
  __kernel sum(__global float *out, __global int *in, int total) {
    int i = get_global_id(0);
    if (i < total) out[i] = ((float)in[i] + 0.5) / 3.8 + 2.0;
  }
eof

arr = (1..3333333).to_a
input = Buffer.new(arr)
output = OutputBuffer.new(:float, arr.size)
 
Benchmark.bmbm do |x|
  x.report("cpu") { arr.map {|x| (x.to_f + 0.5) / 3.8 + 2.0 } }
  x.report("gpu") { prog.sum(output, input, arr.size) }
end

The way OpenCL works is you compile a C-like program on the GPU, then run it on the GPU with your data input and get some output; it’s a standard Pipes and Filters architecture. The compilation of the program happens through LLVM, meaning it doesn’t require GCC, so this is not like running RubyInline. Also we’re running these programs on like 200 cores, so there’s that part too. Anyway, “where are the benchmark results?” you must be yelling right now, so here:

Rehearsal ---------------------------------------
cpu   5.510000   0.250000   5.760000 (  5.767632)
gpu   0.630000   0.260000   0.890000 (  0.930966)
------------------------------ total: 6.650000sec

          user     system      total        real
cpu   3.860000   0.020000   3.880000 (  3.931644)
gpu   0.310000   0.040000   0.350000 (  0.366625)

Okay, it’s only about a 10x speed increase. That’s not too impressive. Keep in mind though, (x + 0.5) / 3.8 + 2.0 is a freaking simple computation. I dare someone to hook this up to some Hadoop map-reduce and see some real world results, because they’d probably be much more impressive.

Wanna Try It?

The catch: you can install Barracuda if you’re on OSX 10.6 (Snow Leopard) or have a way to get OpenCL on your machine and hack up the Makefiles. I’d like to add support for other environments but I just finished this build.

If you have the right stuff you can grab it right from rubyforge:

sudo gem install barracuda

Or visit the git repo: http://github.com/lsegal/barracuda

Questions? Comments? Follow me on Twitter (@lsegal) or email me.