This vignette summarizes various important considerations in order to write efficient programs using {anvl}. Some of the topics are covered in more depth elsewhere, but here we gather everything in one place, albeit sometimes only briefly.
- Eager vs. JIT – comparison of the two execution modes.
- CUDA – running on a GPU instead of CPU (Linux x86 / WSL2 only).
-
Data types – using
f32(the default) instead off64. - Compilation cost – padding inputs so more calls hit the same cache entry.
- Asynchronous execution – how to keep the device (e.g. a GPU) busy while R prepares the next call.
- Memory – avoiding unnecessary copies via donation and jit-internal optimization.
Eager vs. JIT
In eager mode we don’t compile the whole program but compile many tiny XLA programs. This is more convenient to debug as you can inspect intermediate results and step through the program with the R debugger. This comes at a performance cost, however. Specifically:
- More per-primitive call overhead (such as launching a kernel on the
GPU). When
jit()ting the whole function, you only have the per-call overhead once. - The compiler has less context for optimization, as it compiles many small programs individually instead of one large program.
To get the best performance, you will therefore usually want to
jit() the whole function. There might be some exceptions,
as compiling one large function can sometimes be more costly than
compiling several smaller ones, even though splitting it up gives the
compiler less room for joint optimization.
CUDA
Provided your computation is large enough to be worth shipping over to the GPU, running on a CUDA GPU is the single biggest speed-up {anvl} offers – typically 10x to 100x over CPU on linear algebra and large elementwise work.
GPU support currently only works on Linux (amd64/x86-64) or via WSL2
on Windows. See the GPU
Installation section of the installation vignette for setup; once
that’s done, any AnvlArray constructor and
jit() accept device = "cuda":
library(anvl)
x <- nv_array(1:8, dtype = "f32", device = "cuda")
y <- nv_array(9:16, dtype = "f32", device = "cuda")
x + y
# Or pin a jitted function to a specific GPU:
f_cuda <- jit(function(x, y) x + y, device = "cuda:0")A few things to keep in mind when moving to GPU:
-
Copying data between R and the GPU is slow. Each
transfer has a fixed overhead plus a per-byte cost. Once you’ve moved an
array onto the GPU, keep it there; avoid
as_array()(which copies the value back into R) inside loops you want to run fast (see Asynchronous execution below). - Each call to the GPU has more overhead than a CPU call. Small operations on small arrays can actually be slower on GPU than on CPU because the per-call overhead dominates the actual work.
Data types
The data type of an AnvlArray (f32,
f64, i32, i64, …) affects both
how much memory it uses and how fast operations on it run. The default
is to use f32 for R doubles and
i32 for R integers. Currently, we don’t
support other floating-point data types such as f16 or
bf16. There is of course a trade-off between efficiency and
precision here. Speedups from using f32 over
f64 also depend on the specific hardware. Commonly,
f32 is the default data type for floating-point operations
in GPU-accelerated frameworks, so unless you have a specific reason, we
recommend sticking with the defaults.
Compilation cost
Compiling an {anvl} function can take anywhere from milliseconds to seconds (or even minutes) depending on the size of the function. The compilation cache reuses the compiled function across calls with matching input types – see The compilation cache in the JIT Deep Dive for the details. What follows is the complementary trick of reshaping inputs so that more calls land on the same cache entry.
Padding inputs to avoid recompilation
Every new combination of input shapes triggers a fresh recompile. If the input shapes vary often, this can considerably degrade performance and you might spend more time compiling than computing.
A common workaround is to round each varying dimension up to a fixed bucket size and pad the data to that size. All calls then share one compiled function, at the cost of doing some wasted work on the padded positions.
For example, suppose we want to sum a vector whose length varies between calls:
sum_jit <- jit(function(x) {
cat("compiling for length ", shape(x), "\n", sep = "")
nv_reduce_sum(x, dims = 1L)
})
for (n in c(7, 11, 13, 17)) {
sum_jit(nv_array(seq_len(n), dtype = "f32"))
}
#> compiling for length 7
#> compiling for length 11
#> compiling for length 13
#> compiling for length 17Each new length triggers a recompile. By padding each input up to the next multiple of 32 with zeros, all four calls share one compiled function:
pad_to_multiple <- function(x, m) {
n <- length(x)
c(x, rep(0, ceiling(n / m) * m - n))
}
for (n in c(7, 11, 13, 17)) {
sum_jit(nv_array(pad_to_multiple(seq_len(n), 32), dtype = "f32"))
}
#> compiling for length 32Here, we only compile once for input size 32 and then reuse the
compiled executable. One common strategy is to split up the varying
lengths into buckets of size 2^n to balance recompilation
with wasted computation. The example above only works because the padded
zeros are the neutral value of the performed computation
(addition). For computations where the pad value is not neutral –
e.g. the mean, where the padded positions would skew the average – the
padded entries have to be masked out explicitly. See the Static Shape Restriction vignette for the
masking patterns and a table of neutral values for common
reductions.
Asynchronous execution
One surprising fact about {anvl} is that calls into compiled functions (as long as we have a cache hit) return almost immediately. In R terms, we would say what is actually returned is a promise. The actual computation keeps running in the background while the R interpreter can do something else. This is especially important when working with a GPU, because we want to use both CPU and GPU simultaneously. But it also matters on CPU as XLA runs many threads in parallel, while R itself is single-threaded. Note that you need to be aware of this when benchmarking your {anvl} functions.
The natural question is how we avoid reading wrong results when
functions always return immediately, before the computation has
finished. This works out because {anvl} will await the computation
whenever it actually needs to read the data. This happens, for example,
when we print the array or convert it back to R via
as_array(). However, some operations such as accessing
dtype(), shape() or device() do
not require awaiting the result.
One classic mistake that can degrade GPU performance is to always
force this synchronization. For example, in a typical ML training loop,
we have some batch preparation that runs on the CPU and then some
expensive computation that runs on the GPU. Ideally, the CPU and the GPU
run simultaneously. This is what will happen when you write code like
the one below. Because the GPU call returns immediately, the R loop will
continue preparing the next batch while the GPU computation from the
previous iteration is still running. In the next iteration we will call
into the step() function again and the computation will be
scheduled to run on the GPU. Note that the GPU computation from the
previous step() call might not yet be done at this point.
This is exactly the situation we want to be in, as we don’t want the GPU
to idle.
for (i in seq_len(n_steps)) {
batch <- prepare_batch(i) # CPU-work
weights <- step(batch$X, weights, batch$y) # GPU-work
}One way to make this slower is to always print the result after each
step():
# Bad: blocks every step to print the loss.
for (i in seq_len(n_steps)) {
batch <- prepare_batch(i) # CPU-work
weights <- step(batch$X, weights, batch$y) # GPU-work
print(loss(batch$X, weights, batch$y))
}Because print() requires the actual data, we now have to
wait for the GPU to finish its computation so the result can be sent
back to the CPU. But this means that while the next batch is being
prepared, the GPU will be idle. Solutions include logging less often, or
printing the loss from the previous iteration.
Memory
Value semantics and the fact that subset-assignment always copies
were introduced in the Get Started
vignette. This section covers the one extra lever jit()
gives you: telling XLA it can reuse an input buffer for the output.
Donation
Inside a jitted function XLA avoids unnecessary copies on its own.
Across the jit boundary, though, it cannot tell whether the R caller
still needs the input arrays, so by default it has to leave them intact.
If you know a particular input will not be used after the call, you can
pass its name to donate so XLA is free to reuse its memory
for the output:
update <- jit(function(x, delta) x + delta, donate = "x")
x <- nv_array(c(1, 2, 3), dtype = "f32")
x <- update(x, nv_array(c(0.1, 0.1, 0.1), dtype = "f32"))
x
#> AnvlArray
#> 1.1000
#> 2.1000
#> 3.1000
#> [ CPUf32{3} ]One common scenario for this pattern is weight updates in training loops.
Eager-mode subset-assignment always copies
In plain R, y[i] <- val may happen in place when
y has only one reference. Unfortunately, this is not
possible in {anvl} and evaluating such calls in eager mode (outside of
jit()) will always produce a copy. This is because we are
essentially calling into a jit()ted function and we can’t
always assume it’s okay to donate. In the future, we might add an option
to do this in place.