Sunday, March 16, 2014

Making sense of /proc/buddyinfo

/proc/buddyinfo gives you an idea about the free memory fragments on your Linux box. You get to view the free fragments for each available order, for the different zones of each numa node. The typical /proc/buddyinfo looks like this:

This box has a single numa node. Each numa node is an entry in the kernel linked list pgdat_list. Each node is further divided into zones. Here are some example zone types:
  • DMA Zone: Lower 16 MiB of RAM used by legacy devices that cannot address anything beyond the first 16MiB of RAM.
  • DMA32 Zone (only on x86_64): Some devices can't address beyond the first 4GiB of RAM. On x86, this zone would probably be covered by Normal zone
  • Normal Zone: Anything above zone DMA and doesn't require kernel tricks to be addressable. Typically on x86, this is 16MiB to 896MiB. Many kernel operations require that the memory being used be from this zone
  • Highmem Zone (x86 only): Anything above 896MiB.
Each zone is further divided into power of 2 (also known as the order) page sized chunks by the buddy allocator. The buddy allocator attempts to satisfy an allocation request from a zone's free pool. Over time, this free pool will fragment and higher order allocations will fail. The buddyinfo proc file is generated on demand by walking all the free lists.

Say we have just rebooted the machine and we have a free pool of 16MiB (DMA zone). The most sensible thing to do would be to have the this memory split into largest contiguous blocks available. The largest order is defined at compile time to 11 which means that the largest slice the buddy allocator has is 4MiB block (2^10 * page_size). so the 16 MiB DMA zone would initially split into 4 free blocks.

Here's how we'll service an allocation request for 72KiB:
  1. Round up the allocation request to the next power of 2 (128)
  2. Split a 4MiB chunk into two 2MiB chunks
  3. Split  one 2 MiB chunk into two MiB chunks
  4. Continue splitting until we get a 128KiB chunk that we'll allocate.
 Allocation requests will over time split, merge, split... this pool until we get to a point where we might have to fail a request due to the lack of a contiguous memory block.

Here's an example of an allocation failure from a Gentoo bug report.

In such cases, the buddyinfo proc file will allow you to view the current fragmentation state of your memory.
Here's a quick python script that will make this data more digestible.

And sample output for the buddyinfo data pasted earlier on.

Wednesday, January 29, 2014

Fifos and persistent readers

I recently worked on a daemon (call it slurper) that persistently read data from syslog via a FIFO (also known as a named pipe). On startup, slurper would work fine for a couple of hours then stop processing input from the FIFO.  The relevant code in slurper is:

Digging into this mystery revealed that syslogd server was getting EAGAIN errors on the fifo descriptor.  According to man 7 pipe:

      O_NONBLOCK enabled, n <= PIPE_BUF
              If there is room to write n bytes to the pipe, then write(2) succeeds immediately, writing all n bytes; otherwise write(2) fails, with errno set to EAGAIN.

The syslogd daemon was opening the pipe in O_NONBLOCK mode and getting EAGAIN errors which implied that the pipe was full. (man 7 pipe states that the pipe buffer is 64K).
Additionally, a `cat` on the FIFO drains the pipe and allows syslogd to write more content.

All these clues imply that the FIFO has no reader. But how can that be? A check on lsof shows that slurper has an open fd for the named pipe. Digging deeper, an attempt to `cat` slurpers' open fd didn't return any data

cat /proc/$(pgrep slurper)/fd/ # Be careful with this. It will steal data from your pipe/file/socket on a production system

So I decided to whip up a reader that emulates slurper's behaviour

Strace this script to see which syscalls are being invoked

This reveals that a writer closing it's fd will cause readers to read an EOF (and probably exit in the case of the block under the context manager).
So we have two options:
1) Ugly and kludgy:  Wrap the context manager read block within an infinite loop the reopens the file:
2) Super cool trick. Open another dummy writer to the FIFO. The kernel sends an EOF when the last writer closes it's fd. Since our dummy writer never closes the fd, readers will never get an EOF if the real writer closes it's fd.

The actual root cause: The syslog daemon was being restarted and this would cause it to close and reopen it's fds.

Wednesday, January 22, 2014

Macbook pro setup for office use

Funky prompt thanks to powerline and powerline-fonts. Powerline can integrate with vim/ipython/bash/zsh…

I seem to prefer zsh over bash these days (git integration, rvm integration…):
In zshrc: ZSH_THEME=“agnoster”
Plugin support
Theme screenshots.

Vim has a very cool set of plugins thanks to spf13:

If you have a mac, iterm2 rocks:

And finally, I like the solarized theme for my terminal:

Thursday, December 12, 2013

Calculating the start time of a process

A quick script that calculate the start time of a process in Linux:

Monday, November 4, 2013

EXT4 err: couldn't mount because of unsupported optional features

Backward & Forwards compatibility is great! Until it bites you.

The extN branch of filesystem is the pretty much the standard on your average GNU/Linux installation. Migrations tend to work well if N is increasing i.e. from ext2->ext3->ext4. However, things are not so rosy if you want to mount your ext4 disk on your old PC.
Ext4 which made it into mainline in 2.6.27 introduces new features such as extents that are incompatible and unknown by older kernels. Attempting to mount an incompatible extN fs on your old kernel will fail and err out with logs similar to:

EXT3-fs: sda1: couldn't mount because of unsupported optional features (240).

EXT2-fs: sda1: couldn't mount because of unsupported optional features (240).

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1)
The kernel here attempts to probe and mount the using the most recent extN. If that fails it tries the next ext filesystem and so on. In this case, this is a 2.6.16 kernel attempting to mount an ext4 fs with new features.

# dumpe2fs /tmp/img.ext4
Filesystem volume name: /
Last mounted on: /mnt/tmp
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash

 The solution in this case is to use a newer kernel that understands ext4 or recompile your kernel to support ext4.

Monday, October 28, 2013

Linux kernel network backdoor

The ksplice blog has a very nice entry on hosting backdoors in hardware.
The quick summary of this backdoor is:
  1. Register a protocol handler for an unused IP protocol number .
  2. Call usermodhelper to execute the payload of the packet (skb->data).
  3. Remote system now executes any command that you send it as root.
Unfortunately, it looks like the code is either out of date and/or buggy. Attempting to modprobe the backdoor module generates the following kernel call trace:

Further investigations reveal that this is due to us calling a sleepy method from an atomic one... call_usermodhelper will eventually call wait_for_common which sleeps.  You do not want to sleep in an ISR routine.

The fix for this is to use a deferrable; we need to stop working in an interrupt context and schedule the non atomic work for future processing.

One possible solution is to use work queues for deferrable work. Here's an example implementation in github using work queues.

And here's an example session:

Friday, October 11, 2013

Linux: The pagecache and the loop back fs

Linux has a mechanism that allows you to create a block device that is backed by a file.  Most commonly, this is used to provide an encrpyted fs. All this is fine and dandy. However, you have to factor in that the Linux OS (And practically any other OS) will want to cache contents for block device in memory.
The reasoning here is. Accessing the contents of a file from a disk will cost you ~5ms. Rather than incur this cost on future reads, the OS caches the contents of the recently used file in a page cache. Future reads or writes to this file will hit the page cache which is orders of magnitude faster than your average disk.
This means that your writes will linger in memory until the your backing file's contents get evicted from the page cache. Using o_direct on files that are hosted within the loop backed fs won't help. You have to force pagecache evictions. The easiest way to do this is a call to fadvise and a sync to force pdflush to write your changes.

Here's an experiment:
I have 3 windows open.:
  • One running blktrace  which shows VFS activity.
  • One that has a oneshot dd.
  • One that has a bunch of shell commands that poke around the loop mounted fs.
The experiment is executed in the following order:
  1. An fadvise and a sync at the beginning to make sure the pagecache is clean and all writes are on the FS.
  2. We also print out a couple of stats from /proc/meminfo.
  3. We issue a dd call with direct I/O (oflag=direct).
  4. Print stats from /proc/meminfo and take a look at the backing file using debugfs.
  5. Show how much of the backing file is cached in the pagecache by using fincore.
  6. Evict the pagecache using the fadvise command from linux-ftools.
  7. Force a sync which wakes up pdflush to write out the dirty buffers.
  8. Run debugfs to take a peek at the backing fs again.
  9. Print out some more stats from meminfo.
Here's the script:
And the output generated:
The interesting bits to note here is that writes to a loop back filesystems in Linux are not guaranteed to be on disk until you evict the pagecache and force a sync.

If you are interested in further digging, here's debug output from:

Debugfs and dumpe2fs: