As LLAMA tries to stay independent from specific compiler vendors and extensions, C preprocessor macros are used to define some directives for a subset of compilers but with a unified interface for the user. Some macros can even be overwritten from the outside to enable interoperability with libraries such as alpaka.
We frequently have to deal with dialects of C++ which allow/require do specify to which target a function is compiled.
To support his use, every method which can be used on offloading devices (e.g. GPUs) uses the
LLAMA_FN_HOST_ACC_INLINE macro as attribute.
By default it is defined as:
#define LLAMA_FN_HOST_ACC_INLINE inline
When working with cuda it should be globally defined as something like
__host__ __device__ inline.
Please specify this as part of your CXX flags globally.
When LLAMA is used in conjunction with alpaka, please define it as
ALPAKA_FN_ACC __forceinline__ (with CUDA) or
Compilers usually cannot assume that two data regions are independent of each other if the data is not fully visible to the compiler
(e.g. a value completely lying on the stack or the compiler observing the allocation call).
One solution in C is the
restrict keyword which specifies that the memory pointed to by a pointer is not aliased by anything else.
However this does not work for more complex data structures containing pointers, and easily fails in other scenarios as well.
Another solution are compiler specific
#pragmas which tell the compiler that
each memory access through a pointer inside a loop can be assumed to not interfere with other accesses through other pointers.
The usual goal is to allow vectorization.
#pragmas are handy and work with more complex data types, too.
LLAMA provides a macro called
LLAMA_INDEPENDENT_DATA which can be put in front of loops to communicate the independence of memory accesses to the compiler.