Deep Dive into Contiguous Memory Allocator

Posted by Michał ‘mina86’ Nazarewicz on 10th of June 2012
Share on Bluesky

This is an extended version of an LWN article on CMA. It contains more detail on how to use CMA and a lot of boring code samples.

Contiguous Memory Allocator (or CMA) has been developed to allow large physically contiguous memory allocations. By initialising early at boot time and with some fairly intrusive changes to Linux memory management, it is able to allocate large memory chunks without a need to grab memory for exclusive use.

Simple in principle, it grew to be a quite complicated system which requires cooperation between boot-time allocator, buddy system, DMA subsystem, and some architecture-specific code. Still, all that complexity is usually hidden away and normal users won’t be exposed to it. Depending on perspective, CMA appears slightly different and there are different things to be done and look for.

Using CMA in device drivers

From device driver author’s point of view, nothing should change. CMA is integrated with DMA subsystem, so the usual calls to the DMA API (such as dma_alloc_coherent) should work as usual.

In fact, device drivers should never need to call the CMA API directly. Most importantly, device drivers operate on kernel mappings and bus addresses whereas CMA operates on pages and PFNs. Furthermore, CMA does not handle cache coherency, which the DMA API was designed to deal with. Lastly, it is more flexible and allows allocations in atomic contexts (e.g. interrupt handler) and creation of memory pools, which are well suited for small allocations.

For a quick example, this is how allocation might look like:

dma_addr_t dma_addr;
void *virt_addr =
	dma_alloc_coherent(dev, 100 << 20, &dma_addr, GFP_KERNEL);
if (!virt_addr)
	return -ENOMEM;

Provided that dev is a pointer to a valid struct device, the above code will allocate 100 MiB of memory. It may or may not be a CMA memory, but it is a portable way to get buffers. The following can be used to free it:

dma_free_coherent(dev, 100 << 20, virt_addr, dma_addr);

Barry Song has posted a simple test driver which uses those two to allocate DMA memory.

More information about the DMA API can be found in Documentation/core-api/dma-api.rst and Documentation/core-api/dma-api-howto.rst. Those two documents describe provided functions and give usage examples.

Integration with architecture code

Obviously, CMA has to be integrated with given architecture’s DMA subsystem beforehand. This is performed in a few, fairly easy steps. The CMA patchset integrates it with x86 and ARM architectures. This section will refer to both patches as well as quote their relevant portions.

Reserving memory

CMA works by reserving memory early at boot time. This memory, called CMA area or CMA context, is later returned to the buddy system so it can be used by regular applications. To make reservation happen, one needs to call:

void dma_contiguous_reserve(phys_addr_t limit);

after memblock is initialised but prior to the buddy allocator setup.

The limit argument, if not zero, specifies physical address above which no memory will be prepared for CMA. Intention is to allow limiting CMA contexts to addresses that DMA can handle. The only real constraint that CMA imposes is that reserved memory must belong to the same zone.

In case of ARM the limit is set to arm_dma_limit or arm_lowmem_limit, whichever is smallest:

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
@@ -364,6 +373,12 @@ void __init arm_memblock_init(struct meminfo *mi, struct machine_desc *mdesc)
    if (mdesc->reserve)
            mdesc->reserve();

+	/*
+	 * reserve memory for DMA contigouos allocations,
+	 * must come from DMA area inside low memory
+	 */
+	dma_contiguous_reserve(min(arm_dma_limit, arm_lowmem_limit));
+
 	arm_memblock_steal_permitted = false;
 	memblock_allow_resize();
 	memblock_dump_all();

On x86 it is called after memblock is set up in setup_arch function with no limit specified:

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
@@ -934,6 +935,7 @@ void __init setup_arch(char **cmdline_p)
 	}
 #endif
 	memblock.current_limit = get_max_mapped();
+	dma_contiguous_reserve(0);

 	/*
 	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.

The amount of reserved memory depends on a few Kconfig options and a cma kernel parameters which will be describe later on.

Architecture specific memory preparations

dma_contiguous_reserve function will reserve memory and prepare it to be used with CMA. On some architectures architecture-specific work may need to be performed as well. To allow that, CMA will call the following function:

void dma_contiguous_early_fixup(
	phys_addr_t base,
	unsigned long size);

It is architecture’s responsibility to provide it along with its declaration in asm/dma-contiguous.h header file. The function will be called quite early, so some of the kernel subsystems — like kmalloc — will not be available. Furthermore, it may be called several times, but no more than MAX_CMA_AREAS times.

If an architecture does not need any special handling, the header file may say:

#ifndef H_ARCH_ASM_DMA_CONTIGUOUS_H
#define H_ARCH_ASM_DMA_CONTIGUOUS_H
#ifdef __KERNEL__

#include <linux/types.h>
#include <asm-generic/dma-contiguous.h>

static inline void
dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
{ /* nop */ }

#endif
#endif

ARM requires some work modifying mappings and so it provides a full definition of this function:

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
[…]
+static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
+
+static int dma_mmu_remap_num __initdata;
+
+void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
+{
+	dma_mmu_remap[dma_mmu_remap_num].base = base;
+	dma_mmu_remap[dma_mmu_remap_num].size = size;
+	dma_mmu_remap_num++;
+}
+
+void __init dma_contiguous_remap(void)
+{
+	int i;
+	for (i = 0; i < dma_mmu_remap_num; i++) {
		[…]
+	}
+}

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
@@ -1114,11 +1122,12 @@ void __init paging_init(struct machine_desc *mdesc)
 {
 	void *zero_page;

-	memblock_set_current_limit(lowmem_limit);
+	memblock_set_current_limit(arm_lowmem_limit);

 	build_mem_type_table();
 	prepare_page_table();
 	map_lowmem();
+	dma_contiguous_remap();
 	devicemaps_init(mdesc);
 	kmap_init();

DMA subsystem integration

Second thing to do is to change architecture’s DMA API to use the whole machinery. To allocate memory from CMA one uses:

struct page *dma_alloc_from_contiguous(
	struct device *dev,
	int count,
	unsigned int align);

Its first argument is the device allocation is performed on behalf of. The second one specifies number of pages (not bytes or order) to allocate.

The third argument is the alignment expressed as a page order. It enables allocation of buffers which are aligned to at least 2^align pages. To avoid fragmentation, if at all possible, pass zero here. It is worth noting that there is a Kconfig option (CONFIG_CMA_ALIGNMENT) which specifies maximal alignment accepted by the function. By default, its value is 8 meaning an alignment of 256 pages.

The return value is the first of a sequence of count allocated pages.

Here’s how allocation looks on x86:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
@@ -99,14 +99,18 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 				 dma_addr_t *dma_addr, gfp_t flag)
 {
	[…]
 again:
-	page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
+	if (!(flag & GFP_ATOMIC))
+		page = dma_alloc_from_contiguous(dev, count, get_order(size));
+	if (!page)
+		page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
 	if (!page)
 		return NULL;

To free allocated buffer, one needs to call:

bool dma_release_from_contiguous(
	struct device *dev,
	struct page *pages,
	int count);

dev and count arguments are the same as before, whereas pages is what dma_alloc_from_contiguous has returned.

If region passed to the function did not come from CMA, the function will return false. Otherwise, it will return true. This removes the need for higher-level functions to track which allocations were made with CMA and which were made using some other method.

Again, here’s how it is used on x86:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
@@ -126,6 +130,16 @@ again:
 	return page_address(page);
 }

+void dma_generic_free_coherent(struct device *dev, size_t size, void *vaddr,
+			       dma_addr_t dma_addr)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct page *page = virt_to_page(vaddr);
+
+	if (!dma_release_from_contiguous(dev, page, count))
+		free_pages((unsigned long)vaddr, get_order(size));
+}
+
 /*
  * See <Documentation/x86/x86_64/boot-options.txt> for the iommu kernel
  * parameter documentation.

Atomic allocations

Beware that dma_alloc_from_contiguous may not be called from atomic context (e.g. when spin lock is hold or in an interrupt). It performs some ‘heavy’ operations such as page migration, direct reclaim, etc., which may take a while. Because of that, to make dma_alloc_coherent and co. work as advertised, architecture needs to have a different method of allocating memory in atomic context.

The simplest solution is to put aside a bit of memory at boot time and perform atomic allocations from it. This is in fact what ARM is doing. Existing architectures most likely have a special path for atomic allocations already.

Special memory requirements

At this point, most of the drivers should ‘just work’. They use the DMA API which calls CMA. Life is beautiful. Except some devices may have special memory requirements. For instance, Samsung’s Multi-format codec (MFC) requires different types of buffers to be located in different memory banks (which allows reading them through two memory channels, thus increasing memory bandwidth). Furthermore, one may want to separate some device’s allocations from others as to limit fragmentation within CMA areas.

As mentioned earlier, CMA operates on contexts describing a portion of system memory to allocate buffers from. One global area is created to be used by devices by default, but if a device needs to use a different area, it can easily be done.

There is a many-to-one mapping between struct device and struct cma (ie. CMA context). This means, that if a single device driver needs to use more than one CMA area, it has to have separate struct device objects. At the same time, several struct device objects may point to the same CMA context.

Assigning CMA area to a single device

To assign a CMA area to a device, all one needs to do is call:

int dma_declare_contiguous(
	struct device *dev,
	unsigned long size,
	phys_addr_t base,
	phys_addr_t limit);

As with dma_contiguous_reserve, this needs to be called after memblock initializes but before too much memory gets grabbed from it. For ARM platforms, a convenient place to put invocation of this function is machine’s reserve callback.

The first argument of the function is the device that the new context is to be assigned to. The second is its size in bytes (not in pages). The third is physical address of the area or zero. The last one has the same meaning as limit argument to dma_contiguous_reserve. The return value is either zero (on success) or a negative error code.

For an example, one can take a look at the code called from Samsung’s S5P platform’s reserve callback. It creates two CMA contexts for the MFC driver:

diff --git a/arch/arm/plat-s5p/dev-mfc.c b/arch/arm/plat-s5p/dev-mfc.c
@@ -22,52 +23,14 @@
 #include <plat/irqs.h>
 #include <plat/mfc.h>

[…]
 void __init s5p_mfc_reserve_mem(phys_addr_t rbase, unsigned int rsize,
 				phys_addr_t lbase, unsigned int lsize)
 {
	[…]
+	if (dma_declare_contiguous(&s5p_device_mfc_r.dev, rsize, rbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
	[…]
+	if (dma_declare_contiguous(&s5p_device_mfc_l.dev, lsize, lbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
 }

There is a limit to how many ‘private’ areas can be declared, namely CONFIG_CMA_AREAS. Its default value is seven but it can be safely increased if need arises. Called more times, dma_declare_contiguous function will print an error message and return -ENOSPC.

Assigning CMA area to multiple devices

Things get a bit more complicated if the same (not default) CMA context needs to be used by two or more devices. The current API does not provide a trivial way to do that. What can be done is use dev_get_cma_area to figure out CMA area one device is using, and dev_set_cma_area to set the same context to another device. This sequence must be called no sooner than in postcore_initcall. Here is how it could look like:

static int __init foo_set_up_cma_areas(void)
{
	struct cma *cma;

	cma = dev_get_cma_area(device1);
	dev_set_cma_area(device2, cma);
	return 0;
}
postcore_initcall(foo_set_up_cma_areas);

Of course, device1’s area must be set up with dma_declare_contiguous as described in previous subsection.

Device’s CMA context may be changed any time as long as the device hold no CMA memory — it will be rather tricky to release any allocation after area change.

No default context

As a matter of fact, there is nothing special about the default context that is created by dma_contiguous_reserve function. It is in no way required and system may work without it.

If there is no default context, for devices without assigned areas dma_alloc_from_contiguous will return NULL. dev_get_cma_area can be used distinguish this situation and allocation failure.

Of course, if there is no default area, architecture should provide other means to allocate memory, for devices without assigned CMA context.

Size of the default context

dma_contiguous_reserve does not take a size as an argument, which brings a question of how does it know how much memory should be reserved. There are two sources this information comes from.

First of all, there is a set of Kconfig options, which specify the default size of the reservation. All of those options are located under ‘Device Drivers’ » ‘Generic Driver Options’ » ‘Contiguous Memory Allocator’ in the Kconfig menu. They allow specifying one of four possible ways of calculating the size: it can be an absolute size in megabytes, a percentage of total memory, the lower of the two, or the larger of the two. By default is to allocate 16 MiB of memory.

Second of all, there is a cma kernel command line option. It lets one specify the size of the area at boot time without the need to recompile the kernel. This option specifies the size in bytes and accepts the usual suffixes.