This vignette lists various things to be aware of, specifically in relation to base R.
Row-major vs column-major ordering
R stores matrices and arrays in column-major order, while
{anvl} (following XLA) uses row-major order. For most
operations, this is an internal implementation detail that does not
change the semantics. However, for reshaping operations such as
nv_flatten() there is a difference.
Consider the 2x2 matrix below:
m <- matrix(1:4, nrow = 2)
m## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
In base R, as.vector() flattens it column-by-column, so
we get 1, 2, 3, 4:
as.vector(m)## [1] 1 2 3 4
In {anvl}, reshaping to a length-4 vector traverses the data
row-by-row, so we get 1, 3, 2, 4:
nv_flatten(m)## AnvlArray
## 1
## 3
## 2
## 4
## [ CPUi32?{4} ]
If you need column-major flattening in {anvl}, transpose first:
nv_flatten(t(m))## AnvlArray
## 1
## 2
## 3
## 4
## [ CPUi32?{4} ]
No recycling
Base R recycles the shorter operand when two vectors of different lengths are combined elementwise:
## [1] 2 4 4 6
{anvl} only auto-broadcasts scalars (operands with shape
integer()). Adding a scalar to an array works as you would
expect:
nv_array(1:4) + 10L## AnvlArray
## 11
## 12
## 13
## 14
## [ CPUi32{4} ]
But combining two non-scalar arrays of different shapes errors, even when one shape is a “tile” of the other:
## Error in `nv_broadcast_scalars()`:
## ! All non-scalar arrays must have the same shape, but got (4), (2). Use
## `nv_broadcast_arrays()` for general broadcasting.
When two non-scalar arrays differ only by size-1 dimensions
(numpy-style broadcasting, e.g. shape (2, 3) and
(1, 3)), use nv_broadcast_arrays() to align
them explicitly first:
## [1] 2 3
## [1] 1 3
xs <- nv_broadcast_arrays(a, b)
lapply(xs, shape)## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 2 3
xs[[1]] + xs[[2]]## AnvlArray
## 11 23 35
## 12 24 36
## [ CPUf32{2,3} ]
Note that even nv_broadcast_arrays() cannot replicate
R’s recycling for shapes like (4) and (2) –
the shapes must be broadcast-compatible in the numpy sense.
No NAs
R has a dedicated missing-value marker (NA) for every
atomic type. {anvl} arrays do not – there is no representation of
“missing” at the XLA level, only NaN for floating point
numbers. When you convert R values containing NA into an
AnvlArray, the NAs are silently turned into
NaNs.
nv_array(NA_real_)## AnvlArray
## nan
## [ CPUf32{1} ]
## AnvlArray
## 1
## nan
## 3
## [ CPUf32{3} ]
Round-tripping back to R is not guaranteed to produce
NA, but can also yield NaN:
## [1] 1 NaN 3
For other data types, the situation is even worse, especially for integers, where R uses the smallest possible value to represent missingness:
nv_scalar(NA_integer_)## AnvlArray
## -2.1475e+09
## [ CPUi32{} ]
However, when you convert it back, you get a missing value again:
as.integer(nv_scalar(NA_integer_))## [1] NA
When creating logicals, NA will be interpreted as
TRUE:
nv_scalar(NA)## AnvlArray
## 1
## [ CPUbool{} ]
as.logical(nv_scalar(NA))## [1] TRUE
In order to avoid these pitfals, array creators such as
nv_array() have a check argument to prevent
the above problems. It is FALSE by default, because it
needs to scan the complete data.
## Error in `nv_array()`:
## ! Input `data` contains 1 "NA" value, which has no representation at the
## XLA level.
## ℹ Replace or drop missing values before transferring, or set `check = FALSE` to
## skip this check.
The same flag is available for converters like
as_array():
## Error in `tengen::as_array()`:
## ! Materialized <i32> buffer contains a value that R cannot distinguish
## from "NA".
## ℹ "i32" reserves the bit pattern "-2147483648" (`INT_MIN`); "i64" reserves
## "-9223372036854775808" (`INT64_MIN`).
## ℹ Set `check = FALSE` to skip this check.
No unsigned integers
R’s integer type is signed 32-bit (range
-2147483648 to 2147483647). {anvl} also
exposes unsigned integer dtypes (ui8, ui16,
ui32, ui64) backed by XLA, but R has no native
counterpart. For values that fit into R’s signed integer range, the
round-trip works as expected:
## [1] 0 200 255
Because ui32 does not fit into R’s native integer type,
it will be converted to bit64::integer64 data type:
## integer64
## [1] 2147483648
However, for ui64, we also convert to
integer64, which does not cover the whole range, so
overflow is possible, but can be detected via the check
flag:
big <- nv_array(0L, dtype = "ui64") - 1L
big## AnvlArray
## 1.8447e+19
## [ CPUui64{1} ]
as_array(big)## integer64
## [1] -1
as_array(big, check = TRUE)## Error in `tengen::as_array()`:
## ! Materialized <ui64> buffer contains a value `>= 2^63` that wrapped
## through R's signed <integer64>.
## ℹ Exactly `2^63` becomes `NA_integer64_`; larger values become negative
## <integer64>.
## ℹ Set `check = FALSE` to skip this check.