numa − NUMA policy library
cc ... −lnuma
size, nodemask_t *nodemask);
void numa_interleave_memory(void *start,
size_t size, nodemask_t
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the thread is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind threads to specific nodes.
Numa memory allocation policy is a per-thread attribute, but is inherited by children.
For setting a specific policy globally for all memory allocations in a process and its children it is easiest to start it with the numactl(8) utility. For more finegrained policy inside an application this library can be used.
All numa memory allocation policy only takes effect when a page is actually faulted into the address space of a process by accessing it. The numa_alloc_* functions take care of this automatically.
A node is defined as an area where all memory has the same speed as seen from a particular CPU. A node can contain multiple CPUs. Caches are ignored for this definition.
This library is only concerned about nodes and their memory and does not deal with individual CPUs inside these nodes (except for numa_node_to_cpus )
Before any other calls in this library can be used numa_available() must be called. If it returns −1, all other functions in this library are undefined.
numa_max_node() returns the highest node number available on the current system. If a node number or a node mask with a bit set above the value returned by this function is passed to a libnuma function, the result is undefined.
numa_node_size() returns the memory size of a node. If the argument freep is not NULL, it used to return the amount of free memory on the node. On error it returns −1. numa_node_size64() works the same as numa_node_size() except that it returns values as long long instead of long. This is useful on 32-bit architectures with large nodes.
Some of these functions accept or return a nodemask. A nodemask has type nodemask_t. It is an abstract bitmap type containing a bit set of nodes. The maximum node number depends on the architecture, but is not larger than numa_max_node(). What happens in libnuma calls when bits above numa_max_node() are passed is undefined. A nodemask_t should only be manipulated with the nodemask_zero(), nodemask_clr(), nodemask_isset(), and nodemask_set() functions. nodemask_zero() clears a nodemask_t. nodemask_isset() returns true if node is set in the passed nodemask. nodemask_clr() clears node in nodemask. nodemask_set() sets node in nodemask. The predefined variable numa_all_nodes has all available nodes set; numa_no_nodes is the empty set. nodemask_equal() returns non-zero if its two nodeset arguments are equal.
numa_preferred() returns the preferred node of the current thread. This is the node on which the kernel preferably allocates memory, unless some other policy overrides this.
numa_set_interleave_mask() sets the memory interleave mask for the current thread to nodemask. All new memory allocations are page interleaved over all nodes in the interleave mask. Interleaving can be turned off again by passing an empty mask (numa_no_nodes). The page interleaving only occurs on the actual page fault that puts a new page into the current address space. It is also only a hint: the kernel will fall back to other nodes if no memory is available on the interleave target. This is a low level function, it may be more convenient to use the higher level functions like numa_alloc_interleaved() or numa_alloc_interleaved_subset().
numa_get_interleave_mask() returns the current interleave mask.
numa_bind() binds the current thread and its children to the nodes specified in nodemask. They will only run on the CPUs of the specified nodes and only be able to allocate memory from them. This function is equivalent to calling numa_run_on_node_mask(nodemask) followed by numa_set_membind(nodemask). If threads should be bound to individual CPUs inside nodes consider using numa_node_to_cpus and the sched_setaffinity(2) syscall.
numa_set_preferred() sets the preferred node for the current thread to node. The preferred node is the node on which memory is preferably allocated before falling back to other nodes. The default is to use the node on which the process is currently running (local policy). Passing a −1 argument is equivalent to numa_set_localalloc().
numa_set_localalloc() sets a local memory allocation policy for the calling thread. Memory is preferably allocated on the node on which the thread is currently running.
numa_set_membind() sets the memory allocation mask. The thread will only allocate memory from the nodes set in nodemask. Passing an argument of numa_no_nodes or numa_all_nodes turns off memory binding to specific nodes.
numa_get_membind() returns the mask of nodes from which memory can currently be allocated. If the returned mask is equal to numa_no_nodes or numa_all_nodes, then all nodes are available for memory allocation.
numa_alloc_interleaved() allocates size bytes of memory page interleaved on all nodes. This function is relatively slow and should only be used for large areas consisting of multiple pages. The interleaving works at page level and will only show an effect when the area is large. The allocated memory must be freed with numa_free(). On error, NULL is returned.
numa_alloc_interleaved_subset() is like numa_alloc_interleaved() except that it also accepts a mask of the nodes to interleave on. On error, NULL is returned.
numa_alloc_onnode() allocates memory on a specific node. This function is relatively slow and allocations are rounded up to the system page size. The memory must be freed with numa_free(). On errors NULL is returned.
numa_alloc_local() allocates size bytes of memory on the local node. This function is relatively slow and allocations are rounded up to the system page size. The memory must be freed with numa_free(). On errors NULL is returned.
numa_alloc() allocates size bytes of memory with the current NUMA policy. This function is relatively slow and allocations are rounded up to the system page size. The memory must be freed with numa_free(). On errors NULL is returned.
numa_free() frees size bytes of memory starting at start, allocated by the numa_alloc_* functions above.
numa_run_on_node() runs the current thread and its children on a specific node. They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to numa_run_on_node_mask(). Passing −1 permits the kernel to schedule on all nodes again. On success, 0 is returned; on error −1 is returned, and errno is set to indicate the error.
numa_run_on_node_mask() runs the current thread and its children only on nodes specified in nodemask. They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to numa_run_on_node_mask(). Passing numa_all_nodes permits the kernel to schedule on all nodes again. On success, 0 is returned; on error −1 is returned, and errno is set to indicate the error.
numa_get_run_node_mask() returns the mask of nodes that the current thread is allowed to run on.
numa_interleave_memory() interleaves size bytes of memory page by page from start on nodes nodemask. This is a lower level function to interleave not yet faulted in but allocated memory. Not yet faulted in means the memory is allocated using mmap(2) or shmat(2), but has not been accessed by the current process yet. The memory is page interleaved to all nodes specified in nodemask. Normally numa_alloc_interleaved() should be used for private memory instead, but this function is useful to handle shared memory areas. To be useful the memory area should be several megabytes at least (or tens of megabytes of hugetlbfs mappings) If the numa_set_strict() flag is true then the operation will cause a numa_error if there were already pages in the mapping that do not follow the policy.
numa_tonode_memory() put memory on a specific node. The constraints described for numa_interleave_memory() apply here too.
numa_tonodemask_memory() put memory on a specific set of nodes. The constraints described for numa_interleave_memory() apply here too.
numa_setlocal_memory() locates memory on the current node. The constraints described for numa_interleave_memory() apply here too.
numa_police_memory() locates memory with the current NUMA policy. The constraints described for numa_interleave_memory() apply here too.
numa_node_to_cpus() converts a node number to a bitmask of CPUs. The user must pass a long enough buffer. If the buffer is not long enough errno will be set to ERANGE and −1 returned. On success 0 is returned.
numa_set_bind_policy() specifies whether calls that bind memory to a specific node should use the preferred policy or a strict policy. The preferred policy allows the kernel to allocate memory on other nodes when there isn’t enough free on the target node. strict will fail the allocation in that case. Setting the argument to specifies strict, 0 preferred. Note that specifying more than one node non strict may only use the first node in some kernel versions.
numa_set_strict() sets a flag that says whether the functions allocating on specific nodes should use use a strict policy. Strict means the allocation will fail if the memory cannot be allocated on the target node. Default operation is to fall back to other nodes. This doesn’t apply to interleave and default.
numa_distance() reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. Reporting the distance requires a Linux kernel version of 2.6.10 or newer.
numa_error() is a weak internal libnuma function that can be overridden by the user program. This function is called with a char * argument when a libnuma function fails. Overriding the weak library definition makes it possible to specify a different error handling strategy when a libnuma function fails. It does not affect numa_available().
The num_error() function defined in libnuma prints an error on stderr and terminates the program if numa_exit_on_error is set to a non-zero value. The default value of numa_exit_on_error is zero.
numa_warn() is a weak internal libnuma function that can be also overridden by the user program. It is called to warn the user when a libnuma function encounters a non-fatal error. The default implementation prints a warning to stderr.
The first argument is a unique number identifying each warning. After that there is a printf(3)-style format string and a variable number of arguments.
numa_set_bind_policy and numa_exit_on_error are process global. The other calls are thread safe.
Memory policy set for memory areas is shared by all threads of the process. Memory policy is also shared by other processes mapping the same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not shared for disk backed file mappings right now although that may change in the future.
Copyright 2002, 2004, Andi Kleen, SuSE Labs. libnuma is under the GNU Lesser General Public License, v2.1.
get_mempolicy(2), getpagesize(2), mbind(2), mmap(2), set_mempolicy(2), shmat(2), numactl(8), sched_setaffinity(2)