Sunday, March 16, 2014

Making sense of /proc/buddyinfo

/proc/buddyinfo gives you an idea about the free memory fragments on your Linux box. You get to view the free fragments for each available order, for the different zones of each numa node. The typical /proc/buddyinfo looks like this:

Node 0, zone DMA 5 4 4 4 4 3 2 0 0 0 2
Node 0, zone DMA32 2 2 3 2 2 1 3 1 2 3 386
Node 0, zone Normal 56332 12573 2212 959 448 179 35 5 0 0 35002
view raw _proc_buddyinfo hosted with ❤ by GitHub

This box has a single numa node. Each numa node is an entry in the kernel linked list pgdat_list. Each node is further divided into zones. Here are some example zone types:
  • DMA Zone: Lower 16 MiB of RAM used by legacy devices that cannot address anything beyond the first 16MiB of RAM.
  • DMA32 Zone (only on x86_64): Some devices can't address beyond the first 4GiB of RAM. On x86, this zone would probably be covered by Normal zone
  • Normal Zone: Anything above zone DMA and doesn't require kernel tricks to be addressable. Typically on x86, this is 16MiB to 896MiB. Many kernel operations require that the memory being used be from this zone
  • Highmem Zone (x86 only): Anything above 896MiB.
Each zone is further divided into power of 2 (also known as the order) page sized chunks by the buddy allocator. The buddy allocator attempts to satisfy an allocation request from a zone's free pool. Over time, this free pool will fragment and higher order allocations will fail. The buddyinfo proc file is generated on demand by walking all the free lists.

Say we have just rebooted the machine and we have a free pool of 16MiB (DMA zone). The most sensible thing to do would be to have the this memory split into largest contiguous blocks available. The largest order is defined at compile time to 11 which means that the largest slice the buddy allocator has is 4MiB block (2^10 * page_size). so the 16 MiB DMA zone would initially split into 4 free blocks.

Here's how we'll service an allocation request for 72KiB:
  1. Round up the allocation request to the next power of 2 (128)
  2. Split a 4MiB chunk into two 2MiB chunks
  3. Split  one 2 MiB chunk into two MiB chunks
  4. Continue splitting until we get a 128KiB chunk that we'll allocate.
 Allocation requests will over time split, merge, split... this pool until we get to a point where we might have to fail a request due to the lack of a contiguous memory block.

Here's an example of an allocation failure from a Gentoo bug report.
swapper: page allocation failure. order:4, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-gentoo-r7 #1
Call Trace:
<IRQ> [<ffffffff8107f082>] 0xffffffff8107f082
[<ffffffff810a29a4>] 0xffffffff810a29a4
[<ffffffff810a2b9f>] 0xffffffff810a2b9f
[<ffffffff810a2d35>] 0xffffffff810a2d35
[<ffffffff810a2de3>] 0xffffffff810a2de3
[<ffffffff810a2e4e>] 0xffffffff810a2e4e
[<ffffffff813ed961>] 0xffffffff813ed961
[<ffffffff813ee21b>] 0xffffffff813ee21b
[<ffffffffa07d8499>] 0xffffffffa07d8499
[<ffffffff813f4a3d>] 0xffffffff813f4a3d
view raw gistfile1.txt hosted with ❤ by GitHub

In such cases, the buddyinfo proc file will allow you to view the current fragmentation state of your memory.
Here's a quick python script that will make this data more digestible.

#!/usr/bin/env python
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 textwidth=79 autoindent
"""
Python source code
Last modified: 15 Feb 2014 - 13:38
Last author: lmwangi at gmail com
Displays the available memory fragments
by querying /proc/buddyinfo
Example:
# python buddyinfo.py
"""
import optparse
import os
import re
from collections import defaultdict
import logging
class Logger:
def __init__(self, log_level):
self.log_level = log_level
def get_formatter(self):
return logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
def get_handler(self):
return logging.StreamHandler()
def get_logger(self):
"""Returns a Logger instance for the specified module_name"""
logger = logging.getLogger('main')
logger.setLevel(self.log_level)
log_handler = self.get_handler()
log_handler.setFormatter(self.get_formatter())
logger.addHandler(log_handler)
return logger
class BuddyInfo(object):
"""BuddyInfo DAO"""
def __init__(self, logger):
super(BuddyInfo, self).__init__()
self.log = logger
self.buddyinfo = self.load_buddyinfo()
def parse_line(self, line):
line = line.strip()
self.log.debug("Parsing line: %s" % line)
parsed_line = re.match("Node\s+(?P<numa_node>\d+).*zone\s+(?P<zone>\w+)\s+(?P<nr_free>.*)", line).groupdict()
self.log.debug("Parsed line: %s" % parsed_line)
return parsed_line
def read_buddyinfo(self):
buddyhash = defaultdict(list)
buddyinfo = open("/proc/buddyinfo").readlines()
for line in map(self.parse_line, buddyinfo):
numa_node = int(line["numa_node"])
zone = line["zone"]
free_fragments = map(int, line["nr_free"].split())
max_order = len(free_fragments)
fragment_sizes = self.get_order_sizes(max_order)
usage_in_bytes = [block[0] * block[1] for block in zip(free_fragments, fragment_sizes)]
buddyhash[numa_node].append({
"zone": zone,
"nr_free": free_fragments,
"sz_fragment": fragment_sizes,
"usage": usage_in_bytes })
return buddyhash
def load_buddyinfo(self):
buddyhash = self.read_buddyinfo()
self.log.info(buddyhash)
return buddyhash
def page_size(self):
return os.sysconf("SC_PAGE_SIZE")
def get_order_sizes(self, max_order):
return [self.page_size() * 2**order for order in range(0, max_order)]
def __str__(self):
ret_string = ""
width = 20
for node in self.buddyinfo:
ret_string += "Node: %s\n" % node
for zoneinfo in self.buddyinfo.get(node):
ret_string += " Zone: %s\n" % zoneinfo.get("zone")
ret_string += " Free KiB in zone: %.2f\n" % (sum(zoneinfo.get("usage")) / (1024.0))
ret_string += '\t{0:{align}{width}} {1:{align}{width}} {2:{align}{width}}\n'.format(
"Fragment size", "Free fragments", "Total available KiB",
width=width,
align="<")
for idx in range(len(zoneinfo.get("sz_fragment"))):
ret_string += '\t{order:{align}{width}} {nr:{align}{width}} {usage:{align}{width}}\n'.format(
width=width,
align="<",
order = zoneinfo.get("sz_fragment")[idx],
nr = zoneinfo.get("nr_free")[idx],
usage = zoneinfo.get("usage")[idx] / 1024.0)
return ret_string
def main():
"""Main function. Called when this file is a shell script"""
usage = "usage: %prog [options]"
parser = optparse.OptionParser(usage)
parser.add_option("-s", "--size", dest="size", choices=["B","K","M"],
action="store", type="choice", help="Return results in bytes, kib, mib")
(options, args) = parser.parse_args()
logger = Logger(logging.DEBUG).get_logger()
logger.info("Starting....")
logger.info("Parsed options: %s" % options)
print logger
buddy = BuddyInfo(logger)
print buddy
if __name__ == '__main__':
main()
view raw buddyinfo.py hosted with ❤ by GitHub
And sample output for the buddyinfo data pasted earlier on.

2014-03-15 23:34:32,352 - main - INFO - Starting....
2014-03-15 23:34:32,352 - main - INFO - Parsed options: {'size': None}
<logging.Logger object at 0x1102cffd0>
2014-03-15 23:34:32,352 - main - DEBUG - Parsing line: Node 0, zone DMA 5 4 4 4 4 3 2 0 0 0 2
2014-03-15 23:34:32,353 - main - DEBUG - Parsed line: {'numa_node': '0', 'zone': 'DMA', 'nr_free': '5 4 4 4 4 3 2 0 0 0 2'}
2014-03-15 23:34:32,353 - main - DEBUG - Parsing line: Node 0, zone DMA32 2 2 3 2 2 1 3 1 2 3 386
2014-03-15 23:34:32,353 - main - DEBUG - Parsed line: {'numa_node': '0', 'zone': 'DMA32', 'nr_free': '2 2 3 2 2 1 3 1 2 3 386'}
2014-03-15 23:34:32,353 - main - DEBUG - Parsing line: Node 0, zone Normal 56332 12573 2212 959 448 179 35 5 0 0 35002
2014-03-15 23:34:32,353 - main - DEBUG - Parsed line: {'numa_node': '0', 'zone': 'Normal', 'nr_free': '56332 12573 2212 959 448 179 35 5 0 0 35002'}
2014-03-15 23:34:32,375 - main - INFO - defaultdict(<type 'list'>, {0: [{'usage': [20480, 32768, 65536, 131072, 262144, 393216, 524288, 0, 0, 0, 8388608], 'nr_free': [5, 4, 4, 4, 4, 3, 2, 0, 0, 0, 2], 'zone': 'DMA', 'sz_fragment': [4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304]}, {'usage': [8192, 16384, 49152, 65536, 131072, 131072, 786432, 524288, 2097152, 6291456, 1619001344], 'nr_free': [2, 2, 3, 2, 2, 1, 3, 1, 2, 3, 386], 'zone': 'DMA32', 'sz_fragment': [4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304]}, {'usage': [230735872, 102998016, 36241408, 31424512, 29360128, 23461888, 9175040, 2621440, 0, 0, 146809028608], 'nr_free': [56332, 12573, 2212, 959, 448, 179, 35, 5, 0, 0, 35002], 'zone': 'Normal', 'sz_fragment': [4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304]}]})
Node: 0
Zone: DMA
Free KiB in zone: 9588.00
Fragment size Free fragments Total available KiB
4096 5 20.0
8192 4 32.0
16384 4 64.0
32768 4 128.0
65536 4 256.0
131072 3 384.0
262144 2 512.0
524288 0 0.0
1048576 0 0.0
2097152 0 0.0
4194304 2 8192.0
Zone: DMA32
Free KiB in zone: 1590920.00
Fragment size Free fragments Total available KiB
4096 2 8.0
8192 2 16.0
16384 3 48.0
32768 2 64.0
65536 2 128.0
131072 1 128.0
262144 3 768.0
524288 1 512.0
1048576 2 2048.0
2097152 3 6144.0
4194304 386 1581056.0
Zone: Normal
Free KiB in zone: 143823288.00
Fragment size Free fragments Total available KiB
4096 56332 225328.0
8192 12573 100584.0
16384 2212 35392.0
32768 959 30688.0
65536 448 28672.0
131072 179 22912.0
262144 35 8960.0
524288 5 2560.0
1048576 0 0.0
2097152 0 0.0
4194304 35002 143368192.0

9 comments:

  1. Hi,

    How to tackle this issue?

    ReplyDelete
  2. Great article, thanks for sharing some code! I rewrite the code to be able to execute it under 2.4 Python interpreter (that's what you got when with old "enterprise"-like Linux). The code is available here:
    https://www.dropbox.com/s/n7n8ru4d4p551zv/buddyinfo.py?dl=0
    Sorry, no fancy Github link for now.
    I believe that one way to tackle the issue is monitor over time the buddyinfo usage in a server and try to fetch a baseline after the applications are running for some time. Any degradation over the baseline might be an indication that somethings goes wrong (and a schedule server reboot is in order).
    I saw that Collectl (http://collectl.sourceforge.net/index.html) might help with that, but I think it is a bit too complicated to figure out how the results are shown.

    ReplyDelete
  3. If we see that the our memory is fragmented, does /proc/buddyinfo actually help? For example, I have a machine with 700 Gig of memory and the "free" command line utility shows that 18 Gig are free. This agrees with the output from your buddyinfo.py. However, I have 573 Gig cached memory. Is not most of that memory available for use, should an application request more memory? So /proc/buddyinfo is cool but it doesn't tell us if our memory is fragmented to the degree that apps are impacted, or not.

    ReplyDelete
  4. Thanks for this great article!

    ReplyDelete
  5. Added support for the parser arguments - diffs against the original posted above as follows:

    $ diff buddyinfo.py.org buddyinfo.py
    7c7,8
    < Last author: lmwangi at gmail com
    ---
    > Last author: lmwangi at gmail com
    > Parser args support added by: mloeser at cisco com
    42c43
    < def __init__(self, logger):
    ---
    > def __init__(self, logger, options):
    44a46,50
    > if options.in_file == None:
    > self.in_file = "/proc/buddyinfo"
    > else:
    > self.in_file = options.in_file
    > self.size = options.size
    56c62
    < buddyinfo = open("/proc/buddyinfo").readlines()
    ---
    > buddyinfo = open(self.in_file).readlines()
    89c95,105
    < ret_string += " Free KiB in zone: %.2f\n" % (sum(zoneinfo.get("usage")) / (1024.0))
    ---
    > if self.size == "B":
    > szstr = "bytes" # Bytes
    > div = 1.0
    > elif self.size == "M":
    > szstr = "mib" # Megabytes
    > div = 1024.0 * 1024.0
    > else:
    > szstr = "kib"
    > div = 1024.0 # Kilobytes
    > ret_string += " Free %s in zone: %.2f\n" % (szstr,
    > sum(zoneinfo.get("usage")) / (div))
    91,93c107,108
    < "Fragment size", "Free fragments", "Total available KiB",
    < width=width,
    < align="<")
    ---
    > "Fragment size", "Free fragments", ("Total available %s" % szstr),
    > width=width, align="<")
    100c115
    < usage = zoneinfo.get("usage")[idx] / 1024.0)
    ---
    > usage = zoneinfo.get("usage")[idx] / div)
    109a125,126
    > parser.add_option("-i", "--input", dest="in_file", action="store", type="string",
    > help="Input file")
    116c133
    < buddy = BuddyInfo(logger)
    ---
    > buddy = BuddyInfo(logger, options)
    120c137
    < main()
    ---
    > main()

    ReplyDelete
  6. Today, while I was at work, my sister stole my iPad and tested to see if it can survive a forty foot drop, just so she can be a youtube sensation. My apple ipad is now broken and she has 83 views. I know this is entirely off topic but I had to share it with someone!

    ReplyDelete
  7. Good info. Lucky me I came across your site by chance (stumbleupon). I've saved it for later!

    ReplyDelete
  8. patch for python3

    --- buddyinfo_python2.py
    +++ buddyinfo.py
    @@ -60,7 +60,7 @@ class BuddyInfo(object):
    for line in map(self.parse_line, buddyinfo):
    numa_node = int(line["numa_node"])
    zone = line["zone"]
    - free_fragments = map(int, line["nr_free"].split())
    + free_fragments = list(map(int, line["nr_free"].split()))
    max_order = len(free_fragments)
    fragment_sizes = self.get_order_sizes(max_order)
    usage_in_bytes = [block[0] * block[1] for block in zip(free_fragments, fragment_sizes)]
    @@ -115,9 +115,9 @@ def main():
    logger = Logger(logging.DEBUG).get_logger()
    logger.info("Starting....")
    logger.info("Parsed options: %s" % options)
    - print logger
    + print(logger)
    buddy = BuddyInfo(logger)
    - print buddy
    + print(buddy)

    if __name__ == '__main__':
    main()

    ReplyDelete