Perl, Linux Namespaces, and Pedestrian Problems

At ZipRecruiter we have a problem that I suspect is fairly common. We use cronjobs for various tasks and sometimes a cronjob will fail to clean up after itself and end up filling up a partition. It’s annoying. I solved this by using some simple but poorly supported Linux features.

🔗 The Goal

I want to be able to run a program with /tmp as a bindmount to some other directory, but only for that program. Last year I wrote about unshare, and while that tool helped me know what to do, it’s too coarse to reasonably use. If I were to use it I’d have to break a single tool (create tmpdir, set up namespace, set up bindmount, exec) into two programs. While that would work, especially for this usecase, it would be useful to be able to set up the bindmount from within a master daemon that has already preloaded all the libraries for the child worker, for very efficient memory usage. It seemed like I could do better, so I sought out to do better!

🔗 False starts

The fundamental system call for creating namespaces is unshare(2). With that in mind I searched CPAN and found Linux::Unshare, which gives a completely reasonable interface to the unshare system call. Unfortunately it fails its tests, and even if you skip them it fails in action the same way. In case anyone is Googling, here’s the error:

Your vendor has not defined Linux::Unshare macro CLONE_NEWNS

I didn’t look a whole lot closer because I knew that I could call any system call with a little bit of effort, no XS or C needed.

🔗 Arbitrary System Calls

There’s a legendary post on the packagecloud blog about system calls . I read it a while ago when it made the rounds but the main takeaway was this:

A system call is just a number; it is not magic, and using system calls that don’t already have predefined wrapper functions can be used with fairly minimal effort.

The generic system call interface in Perl is simple. First off, you use the syscall subroutine. Next you need the magic numbers that define syscalls and their various flags. In C you’d load headers; in Perl there is a fairly elegant tool called h2ph which can parse the headers and create Perl files that have all the constants baked in. Here’s how you use it:

cd /usr/include
h2ph -a syscall.h

Chances are you’ll need more than just that header, but that’s the general interface. I initially assumed that h2ph -a /usr/include/syscall.h would work, but the argument needs to be relative to the current directory, so you need to change into the /usr/include directory.

Once the above block of code has been run, in your script you pull in the constants and then use them (and some others) like this:

# normally you'd use POSIX::access for this, but it's an easy example

require 'syscall.ph';

# these headers are mentioned in access(2), that's the only reason I knew to use
# them
require 'fcntl.ph';
require 'unistd.ph';

my $path = shift;
say "$path is writeable"
      if syscall(SYS_access(), $path, W_OK()) != -1;

If you were in some weird situation where you couldn’t run h2ph, as long as you are on the same architecture I’m pretty sure it would be ok to hardcode the constants yourself, manually getting them from the C header files, or the .ph files that you have now but won’t have later:

# constants for amd64

sub SYS_access () { 21 }
sub W_OK() { 2 }

my $path = shift;
say "$path is writeable"
      if syscall(SYS_access(), $path, W_OK()) != -1;

Another way to get the values of constants like the one above is to use strace with the -e raw flag. If you already have a working program that makes the call you’re trying to replicate, you’d run something like this:

strace -e raw=access SOME-PROGRAM

And the call to access would have the literal numbers shown (though with strings it’s just not enough information.)

🔗 Creating the Mount Namespace

With the tools and techniques explained above we now have everything we need to build the temporary bind mount described in the beginning of this post. The first thing I did was strace unshare(1) so that I could figure out what system calls I’d need to use:

$ sudo strace /usr/bin/unshare --mount ls -lh

[ ... ]
unshare(CLONE_NEWNS)                    = 0
mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL) = 0
[ ... ]

The first system call, unshare, is the creation of the namespace. The second one is probably worth an entire blog post on its own, but for now I’ll just say that you have to do that for the new namespace to actually make any sense in this situation. And that’s it! Here’s the final program:

# We start as root and the drop to $user
my $user = shift;
my @command = @ARGV;

# SYS_* system call ids
require 'syscall.ph';

# CLONE_* flags
require 'linux/sched.ph';

# MS_* flags
require 'sys/mount.ph';
syscall SYS_unshare(), CLONE_NEWNS();

my $none = "none";
my $root = "/";
syscall SYS_mount(), $none, $root, 0, MS_REC() | MS_PRIVATE();

my $tmpdir = File::Temp->newdir();

# bind mount the temp dir, cargo culted from mount --bind
my $tmp = '/tmp';
syscall SYS_mount(), $tmpdir, $tmp, 0, MS_MGC_VAL() | MS_BIND();

my (undef, undef, $uid, $gid) = getpwnam $user;
chown $uid, $gid, $tmpdir, '/tmp';

# From experience I know that sudo does a lot of things I do not want to
# reimplement.  If we were doing the master/worker pattern described above we'd
# need to go through the effort to drop privileges after a fork instead of during
# exec
system qw(sudo -u), $user, '--',  @command;

File::Path::rmtree( $tmpdir )

That wasn’t so hard, and it didn’t require Docker koolaid, was efficient, and was a single simple script. I am pleased to be able to say that I successfully achieved the goal mentioned at the end of my pid namespaces post. I hope that this made it clear that these seemingly esoteric Linux features are well within reach and generally useful.

Posted Mon, Sep 12, 2016

If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.