Sunday, March 16, 2014

Making sense of /proc/buddyinfo

/proc/buddyinfo gives you an idea about the free memory fragments on your Linux box. You get to view the free fragments for each available order, for the different zones of each numa node. The typical /proc/buddyinfo looks like this:

This box has a single numa node. Each numa node is an entry in the kernel linked list pgdat_list. Each node is further divided into zones. Here are some example zone types:
  • DMA Zone: Lower 16 MiB of RAM used by legacy devices that cannot address anything beyond the first 16MiB of RAM.
  • DMA32 Zone (only on x86_64): Some devices can't address beyond the first 4GiB of RAM. On x86, this zone would probably be covered by Normal zone
  • Normal Zone: Anything above zone DMA and doesn't require kernel tricks to be addressable. Typically on x86, this is 16MiB to 896MiB. Many kernel operations require that the memory being used be from this zone
  • Highmem Zone (x86 only): Anything above 896MiB.
Each zone is further divided into power of 2 (also known as the order) page sized chunks by the buddy allocator. The buddy allocator attempts to satisfy an allocation request from a zone's free pool. Over time, this free pool will fragment and higher order allocations will fail. The buddyinfo proc file is generated on demand by walking all the free lists.

Say we have just rebooted the machine and we have a free pool of 16MiB (DMA zone). The most sensible thing to do would be to have the this memory split into largest contiguous blocks available. The largest order is defined at compile time to 11 which means that the largest slice the buddy allocator has is 4MiB block (2^10 * page_size). so the 16 MiB DMA zone would initially split into 4 free blocks.

Here's how we'll service an allocation request for 72KiB:
  1. Round up the allocation request to the next power of 2 (128)
  2. Split a 4MiB chunk into two 2MiB chunks
  3. Split  one 2 MiB chunk into two MiB chunks
  4. Continue splitting until we get a 128KiB chunk that we'll allocate.
 Allocation requests will over time split, merge, split... this pool until we get to a point where we might have to fail a request due to the lack of a contiguous memory block.

Here's an example of an allocation failure from a Gentoo bug report.

In such cases, the buddyinfo proc file will allow you to view the current fragmentation state of your memory.
Here's a quick python script that will make this data more digestible.

And sample output for the buddyinfo data pasted earlier on.