Python’s os.fork, subprocess, and the case of the missing output

One of my current projects is helping to improve contributor documentation and tools for Zulip, an open source group chat platform. We recommend that first-time contributors use the Vagrant method of setting up the Zulip development environment.

One current downside of this method, in the context of Zulip, is that many of the helper tools are written to be used with a Zulip development environment setup directly on a host system running Ubuntu. An example of this is the pre-commit hook, which looked like this when I first started working on Zulip:

./tools/lint-all $(git diff --cached --name-only --diff-filter=ACM) || true

This code is intended to be run in the zulip directory, with a Python venv activated and all dependencies installed and configured. So, if I run it in my Mac terminal, I see the following:

ImportError: No module named typing
You need to run the Zulip linters inside a Zulip dev environment.
If you are using Vagrant, you can `vagrant ssh` to enter the Vagrant guest.

That’s a helpful enough message and if I ssh into my zulip dev environment, I can run the linters:

christie@Olamina [~/Work/zulip] $ vagrant ssh vagrant ssh
...
(zulip-venv)vagrant@vagrant-ubuntu-trusty-64:/srv/zulip$ ./tools/lint-all

Which is good, but doesn’t align with how I actually work. Recall that this command is part of a git pre-commit hook. It’s intended to be run by git when I run git commit, which I do on my Mac terminal, not within the Zulip dev environment. I like to use git on my Mac terminal because that’s where I have git configured and where I have all my fancy oh-my-zshell terminal integrations. I could set this up inside the Zulip dev environment, but because that’s designed to be encapsulated and generalized for anyone contributing to Zulip, I’d have to do it every time I re-created or re-provisioned the environment. Not ideal.

Knowing that you can pass commands through vagrant ssh just like you can through regular ssh, I figured it would be easy to update the pre-commit script to work for both types of dev setup. I tried this:

vagrant ssh -c '/srv/zulip/tools/lint-all $(cd /srv/zulip && git diff --cached --name-only --diff-filter=ACM) || true'

The only change I made was to add a cd to the zulip directory before running git diff. This is necessary because ssh’ing doesn’t start you in the /srv/zulip directory.

It seemed to work. I saw some output from the linter:

Fix trailing whitespace at README.prod.md line 4:
This documents the process for installing Zulip in a production environment.

Until this point I’d been testing the hook by invoking the script directly by running .git/hooks/pre-commit in Mac my terminal. Now it was time to try with an actual git commit. The command took a long while to complete, so it seemed as if the linter was running, but I didn’t see any output. WTF?

I vagrant ssh’d into the dev environment and tried a git commit there (after setting the bare minimum config values that git requires). The linter ran and generated output. I was back at square one.

Let’s recap the situation:

I saw linter output when I:

Ran the .git/hooks/pre-commit script directly via Mac terminal
Ran the vagrant ssh -c command directly via Mac terminal
Ran .git/hooks/pre-commit, ./tools/lint-all, or git commit inside vagrant virtual machine

I do not see linter output when I:

Ran git commit via Mac terminal outside of the virtual machine.

Because the hook intentionally returns true, and therefore allows git to continue with your commit, no matter the output of the linter, having output is essential. It’s the only way you know you have code style issues to address.

First I investigated the bash/ssh aspect of the situation. I spend a couple of hours reviewing Bash documentation, including a side tangent about psuedoterminals that was both frustrating and fascinating. This led no where. In fact, all I ended up figuring out was that the Bash conditions under which the linters were evoked were essentially identical. (I did this by putting a shoptat the beginning of .git/hooks/pre-commit and comparing the output from running it on my Mac terminal and inside the Zulip dev environment side-by-side.)

At this point it was well into dinner time and I called it a night.

By the time I returned to my keyboard the next morning, fellow Zulip contributor Steve Howell had been working on the issue and eliminated vagrant as the issue by reproducing the missing output using plain old ssh. This and my work from the previous evening shifted my focus to the ./tools/lint-all python code itself. I dove in, determined the understand each and every line.

This took a while. lint-all is almost 500 lines. It looks like one of those things that started simple, but has grown more complex over time as its parent project grew more complex. It has very few inline comments.

Here’s what I figured out:

lint-all runs several linter scripts in parallel using Python’s os.fork(), subprocess.call(), and subprocess.Popen().
The linter scripts are a combination of other stand alone commands (e.g. jslint, puppet parser, etc.) and inline Python code. Each linter wrapper is its own function and is cataloged by the lint_functions dictionary. The run_parallel() function loops through lint_functions, creating a new child process for each with os.fork() and then executing it. It also checks the status of each of the child processes and returns failed == True if any of them exit with a failed status.
Most of the linter wrappers use subprocess.call() to invoke their respective command line scripts. The exception is the function that runs the pyflakes linter. Pyflakes writes some messages to stdout and others to stderr. So, in order to capture both kinds of messages and then loop through them to determine what to output, subprocess.Popen is used with arguments stdout = subprocess.PIPE and stderr = subprocess.PIPE

(If all of that is really confusing, feel free to take a look at this highly simplified version I wrote to accompany this post.)

While I was learning how lint-all worked, I asked other Zulip contributors as I went along. At one point, Steve Howell asked what the redirected output looked like and I ran this within the virtual environment:

 ./tools/lint-all > out.log 2>&1

And it created a file without any output. Wait, what? > out.log 2>&1 should have redirected both stdout and stderr to the file out.log.

So, when I ran the linter, I saw output on screen. But when trying to redirect that output to a file, I got nothing. Within the virtual environment.

At this point I knew the problem had something to do with these child processes and pipes, but I was getting tired. So I did what any good programmer does when they’re stuck, which is start asking questions on Twitter:

What would cause a python script to create output on the screen, but not when redirected to a file?

— Christie Koehler (@christi3k) August 16, 2016

To which Adam Lowry replies with this hint:

@christi3k That is weird! I wonder if the child is killed bytes in a buffer wouldn't be flushed? Seems unlikely, but a .flush() wouldnt hurt

— Adam Lowry (@robotadam) August 17, 2016

And so I added sys.stderr.flush() and sys.stdout.flush() prior to the call to os._exit() and that fixed it! I could now see complete output when I ran the linter under all conditions.

For whatever reason, output that was in the stdout and stderr buffers was getting lost when child processes were terminated. The python docs for os._exit() does give a hit about this:

os._exit(n)
Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.

Lesson learned? If in doubt, .flush() your output. Also, Zulip’s ./tools/lint-all could use some re-factoring. I haven’t tackled it yet, but I’d try getting rid of that os.fork() and using subprocess methods functions entirely, or another concurrency method.

Again, if you’d like to try or see a simplified example of the problem for yourself, take a look here.

And here’s the final git pre-commit hook:

#!/bin/bash

# This hook runs the Zulip code linter ./tools/lint-all and returns true
# regardless of linter results so that your commit may continue.

# Messages from the linter will be printed out to the screen.
#
# If you are running this one machine hosting a Vagrant guest that
# contains your provisioned Zulip development environment, the linter
# will automatically be run through `vagrant ssh`.

if [ -z "$VIRTUAL_ENV" ] && `which vagrant > /dev/null` && [ -e .vagrant ]; then
    vcmd='/srv/zulip/tools/lint-all $(cd /srv/zulip && git diff --cached --name-only --diff-filter=ACM) || true'
    echo "Running lint-all using vagrant..."
    vagrant ssh -c "$vcmd"
else
    ./tools/lint-all $(git diff --cached --name-only --diff-filter=ACM) || true
fi
exit 0

Update 27 August, 2016 17:10 PDT: Brendan asked me about the use of os._exit() so I took another look at the docs. That led me to wonder about using sys.exit() instead, which led me to this post. Using sys.exit (code) appears to obviate the need for the .flush() statements.