LibNC is a C library for tensor manipulation. It supports automatic differentiation and can be used to implement machine learning models such as LSTM and Transformers. It has the following features:
The library is provided as a DLL for Linux or Windows. It has a C API so it is easily usable from any application.
LibNC requires an x86 CPU with AVX2 support.
The CUDA support is currently only available for Linux. CUDA version 11.x must be installed. Only Ampere GPUs are currently supported.
nctest.c provides simple examples and auto differentiation testing code. ncspeed.c and matmul_test.c are benchmarks. dump_coefs.c documents the parameter dumps.
Larger programs using it are NNCP (LSTM and Transformer models for lossless text compression) and GPT2TC (GPT-2 implementation).
The API is defined in the libnc.h
file.
NCContext
represents an instance of the library. There is usually one by project. It is created with nc_context_init
.
NCTensor
represents a tensor (multi-dimensional array). It may
be created with nc_new_tensor
. Each tensor references a tensor
buffer (NCTensorBuffer
which contains its raw data (array of
bytes). NCTensorBuffer
reside on a compute device (e.g. CPU or
GPU) represented by NCDevice
.
NCTensor
and NCTensorBuffer
objects are reference
counted. By default, each function consumes (=decrements) its
arguments and return a live object. const
function parameters
indicate that the object is not consumed. Use nc_free_tensor()
(resp. nc_dup_tensor()
) to decrement (resp. increment) the
reference count of a tensor.
The operands of most operations must reside on the same
device. Tensors can be moved between devices with
nc_tensor_to_device()
. When the tensor is on the CPU device, it
is possible to have a pointer on its raw data with
nc_tensor_get_ptr()
.
Unlike Pytorch, tensor operations don’t do automatic
broadcast. However, for convenience, nc_add()
and
nc_mul()
broadcast their second argument in some common cases.
In a newly created tensor the elements are contiguous in memory. The offset of an element [a1, a0] in a tensor of shape (d1, d0) is given by (a1 * d0 + a0).
LibNC functions enumerate shapes using the smallest dimension first,
i.e. d0
first, then d1
:
nc_new_tensor_2d(device, NC_TYPE_F32, d0, d1);
Matrices are stored in the column-first representation, e.g. the matrix:
[ 1 3 5 ] [ 2 4 6 ]
of 2 rows and 3 columns is represented as a tensor of shape (3, 2). Its memory representation is:
[ 1, 2, 3, 4, 5, 6 ]
Similarly to Pytorch, LibNC dynamically builds a computation
graph. This graph is used to compute a gradient with
nc_backward()
. More precisely, Each NCTensor
may have a
reference to a NCNode
object representing a computation graph
node. Operations applied on tensors having an associated node return a
tensor associated to a new node.
Newly user created tensors do not have an associated
node. nc_set_param
adds a user defined node to a tensor. It is
used to create function parameters. Then nc_backward()
calls a
callback for each parameter with the calculated gradient.
Higher level APIs are normally used to create parameters such as
nc_new_param()
. LibNC provides built-in optimizers such as ADAM
but the user is free to provide his own. nc_backward()
can be
used to compute higher order derivatives too (Hessian vector product,
see example in nctest.c).
As LibNC is used for lossless data compression in NNCP, a fully deterministic behavior is required. It means the same result is returned at each run for the same computation on the same system. It is provided for both the CPU and GPU backends.
The results are not modified when using a different number of threads, CPU brand or OS. Hence the code does not rely on CPU floating point instructions having a implementation defined behavior and does not use the transcendental functions of the C library.
However, in the current implementation, the CPU and GPU backends do not provide the same exact results mainly due to the use of the NVidia Tensor Core which has a device dependent rounding behavior.
LibNC does various optimizations on the compute graph such as matrix product factorisation.
Functions are provided to manually optimize the graph in the case of
online learning. For this case, evaluation is done sequentially but
the model parameters are still updated by training. In the training
phase it is beneficial to combine all the steps of the sequential
evaluation to make a better usage of the compute device
parallelism. The function nc_concat_optimization()
is employed
for this purpose.
NVIDIA CUDA support is optional and fully contained in the libnc_cuda DLL. This DLL depends on the CUDA and the CUBLAS libraries. Only Ampere GPUs are currently supported in order to have hardware bfloat16 support. The LibNC custom CUDA memory allocator allocates memory by chunks of 500 MB.
bfloat16 (truncated IEEE 32 bit floats to 16 bits) are supported on both the CPU and GPU backends. The ADAM optimizer internally keeps the low 16 bit part of the parameters so that no precision is lost during the gradient update.
The LibNC library is free to use as a binary shared library. Contact the author if access to its source code is required.