DMAs

What is an AML DMA ?

In computer science, DMA (Direct Memory Access) is a hardware-accelerated method for moving data across memory regions without the intervention of a compute unit.

Similarly, in AML a DMA is the building block used to move data. Data is generally moved between two areas, between two virtual memory ranges represented by a pointer.

Data is moved from one layout to another. When performing a DMA operation, layout coordinates are walked element by element in post order and matched to translate source coordinates into destination coordinates.

Depending on the DMA implementation, this operation can be optimized or offloaded to a DMA accelerator. Data can thus be moved asynchronously by the DMA engine, e.g., pthreads on a CPU and CUDA streams on CUDA accelerators.

The API for using AML DMA is broken down into two levels.

  • The high-level API provides generic functions that can be applied on all DMAs. It also describes the general structure of a DMA for implementers.

  • Implementation-specific methods, constructors, and static DMAs declarations reside in the second level of headers <aml/dma/*.h>.

Examples of AML DMA Use Cases

  • Prefetching:

    Writing an efficient matrix multiplication routine requires many architecture-specific optimizations such as vectorization and cache blocking. On hardware with software-managed side memory cache (e.g., MCDRAM on Intel Knights Landing), manual prefetch of blocks in the side cache helps improve performance for large problem sizes. AML DMA can help make the code more compact by abstracting asynchronous memory movements. Moreover, AML is also able to benefit from the prefetch time to reorganize data such that matrix multiplication on the prefetched blocks will be vectorized. See linked publication for more details.

  • Replication:

    Some applications will have a memory access pattern such that all threads will access same data in a read-only and latency-bound fashion. On NUMA computing systems, accessing distant memories will imply a penalty that results in an increased application execution time. In such a scenario (application + NUMA), replicating data on memories in order to avoid NUMA penalties can result in significant performance improvements. AML DMA is the building block to go when implementing simple interface for performing data replication. (This work has been accepted for publication to PHYSOR 2020 conference).

Usage

First, include the good headers.

#include <aml.h>
#include <aml/dma/linux-par.h> // one DMA implementation.
#include <aml/layout/dense.h> // one layout implementation.

First header contains DMA generic API and AML utils. Second header will help build a DMA-performing data transfer in the background with pthreads. Third header will help describe source and destination data to transfer.

In order to perform a DMA request, you will need to set up a DMA, i.e., the engine that performs requests, then perform the request itself.

struct aml_dma *dma;
aml_dma_linux_par_create(&dma, 128, NULL, NULL);

We created a DMA that has 128 slots available off-the-shelf to handle asynchronous data transfer requests.

The requests happen in-between two layouts, therefore you will have to set up layouts as well. Let’s suppose we want to copy src to dst

double src[8] = {1, 2, 3, 4, 5, 6, 7, 8};
double dst[8] = {0, 0, 0, 0, 0, 0, 0, 0};

The simplest copy requires that both src and dst are one-dimensional layouts of 8 elements.

size_t dims[1] = {8};

For a one-dimension layout, dimension order does not matter, so let’s pick AML_LAYOUT_ORDER_COLUMN_MAJOR. Now we can initialize layouts.

struct aml_layout *src_layout, *dst_layout;
aml_layout_dense_create(&dst_layout, dst, AML_LAYOUT_ORDER_COLUMN_MAJOR, sizeof(*dst), 1, dims, NULL, NULL);
aml_layout_dense_create(&src_layout, src, AML_LAYOUT_ORDER_COLUMN_MAJOR, sizeof(*src), 1, dims, NULL, NULL);

We have created a DMA engine and described our source and destination data. We are all set to schedule a copy DMA request.

struct aml_dma_request *request;
aml_dma_async_copy_custom(dma, &request, dst_layout, src_layout, NULL, NULL);

Now the DMA request is in-flight. When we are ready to access data in dst, we can wait for it.

aml_dma_wait(dma, &request);

Exercise

Let a be a strided vector where contiguous elements are separated by a blank. Let b be a strided vector where contiguous elements are separated by 2 blanks. Let ddot be a function operating on two continuous vectors to perform a dot product. The goal is to transform a into continuous_a and b into continuous_b in order to perform the dot product.

ddot definition is given below:

double ddot(const double *x, const double *y, const size_t n)
{
        double result = 0.0;
        for (i = 0; i < n; i++)
                result += x[i] * y[i];
        return result;
}

A possible value for a and its layout is given here:

double a[8] = {0.534, 6.3424, 65.4543, 4.543e12, 0.0, 1.0, 9.132e2, 23.657};
size_t a_dims[1] = {4}; // a has only 4 elements.
size_t a_stride[1] = {2}; // elements are strided by two.

struct aml_layout *a_layout;
aml_layout_dense_create(&a_layout, a, AML_LAYOUT_ORDER_COLUMN_MAJOR, sizeof(*a), 1, a_dims, a_stride, NULL);

A possible value for b and its layout is given here:

double b[12] = {1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, };
size_t b_dims[1] = {4}; // b has 4 elements as well.
size_t b_stride[1] = {3}; // b elements are strided by three.

struct aml_layout *b_layout;
aml_layout_dense_create(&b_layout, b, AML_LAYOUT_ORDER_COLUMN_MAJOR, sizeof(*b), 1, b_dims, b_stride, NULL);

Solution

Click Here to Show/Hide Code

/*******************************************************************************
 * Copyright 2019 UChicago Argonne, LLC.
 * (c.f. AUTHORS, LICENSE)
 *
 * This file is part of the AML project.
 * For more info, see https://github.com/anlsys/aml
 *
 * SPDX-License-Identifier: BSD-3-Clause
 ******************************************************************************/

#include <aml.h>

#include <aml/dma/linux.h>
#include <aml/layout/dense.h>

/** Error checking and printing. Should not be triggered **/
static inline void
CHK_ABORT(int err, const char *message)
{
	if (err != AML_SUCCESS) {
		fprintf(stderr, "%s: %s\n", message, aml_strerror(err));
		exit(1);
	}
}

/**
 * The ddot function
 **/
static inline double
ddot(const double *x, const double *y, const size_t n)
{
	double result = 0.0;

	for (size_t i = 0; i < n; i++)
		result += x[i] * y[i];
	return result;
}

int
main(void)
{
	int err;

	//---------------------------  Initialization
	//------------------------------
	struct aml_dma *dma;

	err = aml_dma_linux_create(&dma, 0);
	CHK_ABORT(err, "aml_dma_linux_create:");

	// Defining 'a' vector: {0.534, 65.4543, 0, 913.2} with a stride of 2.
	double a[8] = {
	    0.534, 6.3424, 65.4543, 4.543e12, 0.0, 1.0, 9.132e2, 23.657};
	size_t a_dims[1]   = {4}; // a has only 4 elements.
	size_t a_stride[1] = {2}; // elements are strided by 2.
	struct aml_layout *a_layout;

	err = aml_layout_dense_create(&a_layout,
				      a,
				      AML_LAYOUT_ORDER_COLUMN_MAJOR,
				      sizeof(*a),
				      1,
				      a_dims,
				      a_stride,
				      NULL);
	CHK_ABORT(err, "aml_layout_dense_create:");

	// Defining 'b' vector { 1.0, 1.0, 1.0, 1.0 } with a stride of 3.
	double b[12] = {
	    1.0,
	    0.0,
	    0.0,
	    1.0,
	    0.0,
	    0.0,
	    1.0,
	    0.0,
	    0.0,
	    1.0,
	    0.0,
	    0.0,
	};
	size_t b_dims[1]   = {4}; // b has 4 elements as well.
	size_t b_stride[1] = {3}; // b elements are strided by 3.
	struct aml_layout *b_layout;

	aml_layout_dense_create(&b_layout,
				b,
				AML_LAYOUT_ORDER_COLUMN_MAJOR,
				sizeof(*b),
				1,
				b_dims,
				b_stride,
				NULL);

	//-----------------  Defining ddot continous layouts
	//-----------------------

	double continuous_a[4];
	double continuous_b[4];
	size_t continuous_dims[1] = {4};
	struct aml_layout *a_continuous_layout;
	struct aml_layout *b_continuous_layout;

	err = aml_layout_dense_create(&a_continuous_layout,
				      continuous_a,
				      AML_LAYOUT_ORDER_COLUMN_MAJOR,
				      sizeof(*continuous_a),
				      1,
				      continuous_dims,
				      NULL,
				      NULL);
	CHK_ABORT(err, "aml_layout_dense_create:");

	err = aml_layout_dense_create(&b_continuous_layout,
				      continuous_b,
				      AML_LAYOUT_ORDER_COLUMN_MAJOR,
				      sizeof(*continuous_b),
				      1,
				      continuous_dims,
				      NULL,
				      NULL);
	CHK_ABORT(err, "aml_layout_dense_create:");

	//-----------------  Transform 'a' and 'b' to be continuous
	//-----------------

	// Handle to the dma request we are about to issue.
	struct aml_dma_request *a_request;
	struct aml_dma_request *b_request;

	// Schedule requests
	err =
	    aml_dma_async_copy(dma, &a_request, a_continuous_layout, a_layout);
	CHK_ABORT(err, "aml_dma_async_copy_custom:");

	err =
	    aml_dma_async_copy(dma, &b_request, b_continuous_layout, b_layout);
	CHK_ABORT(err, "aml_dma_async_copy_custom:");

	// Wait for the requests to complete
	err = aml_dma_wait(dma, &a_request);
	CHK_ABORT(err, "aml_dma_wait:");
	err = aml_dma_wait(dma, &b_request);
	CHK_ABORT(err, "aml_dma_wait:");

	//-----------------  Perform dot product and check result
	//-------------------

	double result = ddot(continuous_a, continuous_b, 4);

	// check results match.
	if (result != 979.1883)
		return 1;

	//----------------------------- Cleanup
	//-------------------------------------

	aml_layout_destroy(&a_layout);
	aml_layout_destroy(&b_layout);
	aml_layout_destroy(&a_continuous_layout);
	aml_layout_destroy(&b_continuous_layout);
	aml_dma_linux_destroy(&dma);

	return 0;
}

You can find this solution in doc/tutorials/dma/.