
AML areas represent places where data can be read and written. Typically, an area is an object from which you can query and free memory. AML lends these actions mmap() and munmap() after Linux system calls to map physical memory in the virtual address space.

On a NUMA processor, mapping of data can be as complex as interleaving chunks of several physical memories and making them appear as contiguous in user land. On accelerator, physical memory of the device can be mapped with virtual address range that is mapped on the host physical memory. AML builds areas on top of libnuma and CUDA to reach these levels of customization while keeping the memory queries/free as simple as a function call.

As compute nodes get bigger, building relevant areas to manage locality is likely to improve performance of memory-bound applications. AML areas provide functions to query memory from specific places materialized as areas. Available area implementations dictate the way such places can be arranged and determine their properties. AML area is a low-level/high-overhead abstraction and is not intended to be optimized for fragmented, small allocations. Instead, it is intended to be a basic block for implementing better allocators.

The API of AML Area is broken down into two levels.

  • The high-level API provides generic functions that can be applied to all areas. It also describes the general structure of an area for implementers.

  • Implementation-specific methods, constructors, and static area declarations reside in the second level of headers <aml/area/*.h>.

Use Cases

  • Building custom memory mapping policies: high-bandwidth memory only, interleave custom block sizes.

  • Abstracting allocators memory mapping.


You need to include at least two headers.

#include <aml.h> // General high level API
#include <aml/area/linux.h> // One area implementation.

From there, you can already query memory from the processor. Linux area implementation provides a static declaration of a default area: aml_area_linux.

void *data = aml_area_mmap(&aml_area_linux, 4096, NULL);

Here we have allocated 4096 bytes of data and stored ite in the data variable. This data can later be freed as follows.

aml_area_munmap(aml_area_linux, data, 4096);

Linux Area

If you are working on a NUMA processor, you eventually want more control over your memory provider. For instance, you might want your data to be spread across all memories to balance the load. One way to achieve it is to use the interleave Linux policy. This policy can be applied when building a custom Linux area.

struct aml_area *interleave_area;
aml_area_linux_create(&interleave_area, NULL, AML_AREA_LINUX_POLICY_INTERLEAVE);

Now we have an “allocator” of interleaved data.

void *data = aml_area_mmap(interleave_area, 4096*8, NULL);

Here we have allocated 8*4096 bytes of data across system memories.

Hwloc Area

If you compiled AML with hwloc backend support and the supports hwloc library at runtime, then you can use aml area features built on top of hwloc.

#include <aml/utils/features.h>
#include <aml/area/hwloc.h>
if (aml_support_backends(AML_BACKEND_HWLOC)) {

The backend provides static areas, e.g for interleaving data on all NUMA nodes:

void *data = aml_area_mmap(&aml_area_hwloc_interleave, size, NULL);

and areas constructor based on hwloc memory binding policies and nodeset. The backend may require that you provide objects from the current system topology. Such a topology may be obtained through hwloc API.

Additionally, AML provides a performance based area aml_area_hwloc_preferred. This type of area will allocate data on available NUMA nodes with the highest performance from an initiator perspective. In this context, an initiator is a topolgy object with a cpuset. aml_area_hwloc_preferred areas are initialized respectively to an initiator and a performance criterion, e.g HWLOC_DISTANCES_KIND_MEANS_BANDWIDTH, HWLOC_DISTANCES_KIND_FROM_OS.

// The object from which the bandwidth is maximized.
hwloc_obj_t initiator = hwloc_get_obj_by_type(aml_topology, HWLOC_OBJ_CORE, 0);

aml_area_hwloc_preferred_create(&area, initiator,
                                HWLOC_DISTANCES_KIND_FROM_OS |
                                HWLOC_DISTANCES_KIND_MEANS_BANDWIDTH |


If you compiled AML on a CUDA-capable node, you will be able to use AML CUDA implementation of its building blocks. It is possible to allocate CUDA device memory with AML, in a very similar way to the Linux implementation.

#include <aml.h> // General high level API
#include <aml/area/cuda.h> // CUDA area implementation.
void *data = aml_area_mmap(&aml_area_cuda, 4096, NULL);

The pointer obtained from this allocation is a device-side pointer. It can’t be directly read and written from a host processor.

Exercise: CUDA Mirror Allocation

As an exercise, dive into <aml/cuda/linux.h> header and create an area that will hand out pointer that can be read and written both on the host and the device side. Check that modifications on the host side are mirrored on the device side.

 * Copyright 2019 UChicago Argonne, LLC.
 * This file is part of the AML project.
 * For more info, see
 * SPDX-License-Identifier: BSD-3-Clause

#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <aml.h>
#include <aml/area/cuda.h>

void test_default_area(const size_t size, void *host_buf)
	void *device_buf;

	// Map data on current device.
	device_buf = aml_area_mmap(&aml_area_cuda, size, NULL);
	if (device_buf == NULL) {
	// Check we can perform a data transfer to mapped device memory.
			  size, cudaMemcpyHostToDevice) == cudaSuccess);

	printf("Default cuda area worked!\n");

	// Cleanup
	aml_area_munmap(&aml_area_cuda, device_buf, size);

void test_custom_area(const size_t size, void *host_buf)
	int err;
	void *device_buf;
	struct aml_area *area;

	// When calling mmap, we are going to specify a device
	// and host side memory to map to.
	// We want to map host_buf to device_buf.
	struct aml_area_cuda_mmap_options opts = {
		.device = 0,
		.ptr = host_buf

	// Create an area that will map data on device 0, with host memory.
	err = aml_area_cuda_create(&area, 0, AML_AREA_CUDA_FLAG_ALLOC_MAPPED);
	if (err != AML_SUCCESS) {
		fprintf(stderr, "aml_area_cuda_create: %s\n",
	// Map host memory with device memory.
	if (aml_area_mmap(area, size,
			  (struct aml_area_mmap_options *)&opts) == NULL) {
	// Get device memory that is mapped with host memory.
	assert(cudaHostGetDevicePointer(&device_buf, host_buf, 0) ==

	// Set data from host.
	memset(host_buf, '#', size);

	// Check that data on device has been set to the same value,
	// i.e., mapping works.
			  size, cudaMemcpyDeviceToHost) == cudaSuccess);
	for (size_t i = 0; i < size; i++)
		assert(((char *)host_buf)[i] == '#');

	printf("Custom cuda area worked!\n");

	// Cleanup
	aml_area_munmap(area, device_buf, size);

int main(void)
	// Skip tutorial if this is not supported.
	if (!aml_support_backends(AML_BACKEND_CUDA))
		return 77;

	const size_t size = (2 << 16);	// 16 pages
	void *host_buf = malloc(size);
	if (host_buf == NULL)
		return 1;

	test_default_area(size, host_buf);
	test_custom_area(size, host_buf);

	return 0;

You can find this solution in doc/tutorials/area/.

Implementing a Custom Area

You might want to use AML blocks with a different area behaviour that is not part of AML. This is achievable by implementing the area building block to match the desired behavior. In short, all AML building blocks consist of attributes stored in the data field and methods stored in the ops field. In the case of an area, struct aml_area_ops requires that custom mmap, munmap, and fprintf fields are implemented. Let’s implement an empty area. This area will have no attributes, i.e., data is NULL and its operation will print a message. We first implement area methods.

#include <aml.h> // General high level API

void* _mmap(const struct aml_area_data *data, size_t size, struct aml_area_mmap_options *opts) {
  (void) data; (void) size; (void) opts; // ignore arguments
  printf("mmap called.\n");
  return NULL;

int _munmap(const struct aml_area_data *data, void *ptr, size_t size) {
  (void) data; (void) ptr; (void) size; // ignore arguments
  printf("munmap called.\n");
  return AML_SUCCESS;

int _fprintf(const struct aml_area_data *data, FILE *stream, const char *prefix) {
  (void) data; // ignore argument
  fprintf(stream, "%s: fprintf called.\n", prefix);

Now we can declare the area methods and area itself.

// Area methods declaration
struct aml_area_ops _ops = {
  .mmap = _mmap,
  .munmap = _munmap,
  .fprintf = _fprintf,

// Area declaration
struct aml_area _area = {
  .data = NULL,
  .ops = &_ops,

Let’s try it out:

aml_area_mmap(&_area, 4096, NULL);
// "mmap called."
aml_area_minmap(&_area, NULL, 4096);
// "munmap called."

Exercise: interleaving in blocks of 2 pages

With the use of mbind() function from libnuma, implement an area that will interleave blocks of 2 pages on the system memories. For instance, let’s assume a system with 4 NUMA nodes and a buffer of 16 pages. Pages have to be allocated as follows:

page: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]
NUMA: [ 0, 0, 1, 1, 2, 2, 3, 3, 0, 0,  1,  1,  2,  2,  3,  3 ]

You can retrieve the size of a page in the following way:

#include <unistd.h>

You can test if your data is interleaved as requested using the below code.

// Function to get last NUMA node id on which data is allocated.
static int get_node(void *data)
        long err;
        int policy;
        unsigned long maxnode = sizeof(unsigned long) * 8;
        unsigned long nmask = 0;
        int node = -1;

        err = get_mempolicy(&policy, &nmask, maxnode, data, MPOL_F_ADDR);
        if (err == -1) {

        while (nmask != 0) {
                nmask = nmask >> 1;

        return node;

// Check if data of size `size` is interleaved on all nodes,
// by chunk of size `page_size`.
static int is_interleaved(void *data, const size_t size, const size_t page_size)
        intptr_t start;
        int node, next, num_nodes = 0;

        start = ((intptr_t) data) << page_size >> page_size;

        node = get_node((void *)start);
        // more than one node in policy.
        if (node < 0)
                return 0;

        for (intptr_t page = start + page_size;
             (size_t) (page - start) < size; page += page_size) {
                next = get_node((void *)page);

                // more than one node in page policy.
                if (next < 0)
                        return 0;
                // not round-robin
                if (next != (node + 1) && next != 0)
                        return 0;
                // cycling on different number of nodes
                if (num_nodes != 0 && next >= num_nodes)
                        return 0;
                // cycling on different number of nodes
                if (num_nodes != 0 && next == 0 && num_nodes != node)
                        return 0;
                // set num_nodes
                if (num_nodes == 0 && next == 0)
                        num_nodes = node;
                node = next;

        return 1;


 * Copyright 2019 UChicago Argonne, LLC.
 * This file is part of the AML project.
 * For more info, see
 * SPDX-License-Identifier: BSD-3-Clause

#define _GNU_SOURCE
#include <aml.h>
#include <numa.h>
#include <numaif.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

// Function to get last NUMA node id on which data is allocated.
static int
get_node(void *data)
	long err;
	int policy;
	unsigned long maxnode = sizeof(unsigned long) * 8;
	unsigned long nmask   = 0;
	int node	      = -1;

	err = get_mempolicy(&policy, &nmask, maxnode, data, MPOL_F_ADDR);
	if (err == -1) {

	while (nmask != 0) {
		nmask = nmask >> 1;

	return node;

// Check if data of size `size` is interleaved on all nodes,
// by chunk of size `page_size`.
static int
is_interleaved(void *data, const size_t size, const size_t page_size)
	intptr_t start;
	int node, next, num_nodes = 0;

	start = ((intptr_t)data) << page_size >> page_size;

	node = get_node((void *)start);
	// more than one node in policy.
	if (node < 0)
		return 0;

	for (intptr_t page = start + page_size; (size_t)(page - start) < size;
	     page += page_size) {
		next = get_node((void *)page);

		// more than one node in page policy.
		if (next < 0)
			return 0;
		// not round-robin
		if (next != (node + 1) && next != 0)
			return 0;
		// cycling on different number of nodes
		if (num_nodes != 0 && next >= num_nodes)
			return 0;
		// cycling on different number of nodes
		if (num_nodes != 0 && next == 0 && num_nodes != node + 1)
			return 0;
		// set num_nodes
		if (num_nodes == 0 && next == 0)
			num_nodes = node + 1;
		node = next;

	return 1;

// Custom area attributes.
struct area_data {
	unsigned long nid;  // Current node id;
	unsigned long nmax; // highest node number on this system.
	int page_size;      // Size of a page;

// Custom area mmap implementation
void *
custom_mmap(const struct aml_area_data *data,
	    size_t size,
	    struct aml_area_mmap_options *opts)
	intptr_t start;
	struct area_data *area = (struct area_data *)data;
	void *ret	      = mmap(NULL,

	if (ret == NULL)
		return NULL;

	start = (intptr_t)ret >> area->page_size << area->page_size;

	for (intptr_t page = start; page < (start + (intptr_t)size);
	     page += area->page_size) {
		if (mbind((void *)page,
			  8 * sizeof(unsigned long),
			  MPOL_MF_MOVE) == -1) {
			munmap(ret, size);
			return NULL;
		area->nid = area->nid << 1;
		if (area->nid > (1UL << area->nmax))
			area->nid = 1;

	return ret;

// Custom area munmap implementation
custom_munmap(const struct aml_area_data *data, void *ptr, size_t size)
	munmap(ptr, size);
	return AML_SUCCESS;

// Custom area constructor
struct aml_area *
	struct area_data *data;
	struct aml_area_ops *ops;
	struct aml_area *ret = AML_INNER_MALLOC(
	    struct aml_area, struct aml_area_ops, struct area_data);
	if (ret == NULL)
		return NULL;

	    ret, 2, struct aml_area, struct aml_area_ops, struct area_data);
	ops	 = ret->ops;
	ops->mmap   = custom_mmap;
	ops->munmap = custom_munmap;

	    ret, 3, struct aml_area, struct aml_area_ops, struct area_data);
	data		= (struct area_data *)ret->data;
	data->nid       = 1;
	data->nmax      = numa_max_node();
	data->page_size = 2 * sysconf(_SC_PAGESIZE); // 2 pages allocation

	return ret;

// Test that custom area will interleave pages as expected.
test_custom_area(const size_t size)
	void *buf;
	struct aml_area *interleave_area = custom_area_create();

	// Map buffer in area.
	buf = aml_area_mmap(interleave_area, size, NULL);
	if (buf == NULL) {
	// Check it is indeed interleaved
	if (!is_interleaved(buf, size, 2 * sysconf(_SC_PAGESIZE)))
	printf("Custom area worked and is interleaved.\n");

	// Cleanup
	aml_area_munmap(interleave_area, buf, size);

	const size_t size = (2 << 16); // 16 pages

	return 0;