Category: Code Story

August 27, 2016

Python’s os.fork, subprocess, and the case of the missing output

One of my current projects is helping to improve contributor documentation and tools for Zulip, an open source group chat platform. We recommend that first-time contributors use the Vagrant method of setting up the Zulip development environment.

One current downside of this method, in the context of Zulip, is that many of the helper tools are written to be used with a Zulip development environment setup directly on a host system running Ubuntu. An example of this is the pre-commit hook, which looked like this when I first started working on Zulip:

./tools/lint-all $(git diff --cached --name-only --diff-filter=ACM) || true

This code is intended to be run in the zulip directory, with a Python venv activated and all dependencies installed and configured. So, if I run it in my Mac terminal, I see the following:

ImportError: No module named typing
You need to run the Zulip linters inside a Zulip dev environment.
If you are using Vagrant, you can `vagrant ssh` to enter the Vagrant guest.

That’s a helpful enough message and if I ssh into my zulip dev environment, I can run the linters:

christie@Olamina [~/Work/zulip] $ vagrant ssh vagrant ssh
...
(zulip-venv)vagrant@vagrant-ubuntu-trusty-64:/srv/zulip$ ./tools/lint-all

Which is good, but doesn’t align with how I actually work. Recall that this command is part of a git pre-commit hook. It’s intended to be run by git when I run git commit, which I do on my Mac terminal, not within the Zulip dev environment. I like to use git on my Mac terminal because that’s where I have git configured and where I have all my fancy oh-my-zshell terminal integrations. I could set this up inside the Zulip dev environment, but because that’s designed to be encapsulated and generalized for anyone contributing to Zulip, I’d have to do it every time I re-created or re-provisioned the environment. Not ideal.

Knowing that you can pass commands through vagrant ssh just like you can through regular ssh, I figured it would be easy to update the pre-commit script to work for both types of dev setup. I tried this:

vagrant ssh -c '/srv/zulip/tools/lint-all $(cd /srv/zulip && git diff --cached --name-only --diff-filter=ACM) || true'

The only change I made was to add a cd to the zulip directory before running git diff. This is necessary because ssh’ing doesn’t start you in the /srv/zulip directory.

It seemed to work. I saw some output from the linter:

Fix trailing whitespace at README.prod.md line 4:
This documents the process for installing Zulip in a production environment.

Until this point I’d been testing the hook by invoking the script directly by running .git/hooks/pre-commit in Mac my terminal. Now it was time to try with an actual git commit. The command took a long while to complete, so it seemed as if the linter was running, but I didn’t see any output. WTF?

I vagrant ssh’d into the dev environment and tried a git commit there (after setting the bare minimum config values that git requires). The linter ran and generated output. I was back at square one.

Let’s recap the situation:

I saw linter output when I:

Ran the .git/hooks/pre-commit script directly via Mac terminal
Ran the vagrant ssh -c command directly via Mac terminal
Ran .git/hooks/pre-commit, ./tools/lint-all, or git commit inside vagrant virtual machine

I do not see linter output when I:

Ran git commit via Mac terminal outside of the virtual machine.

Because the hook intentionally returns true, and therefore allows git to continue with your commit, no matter the output of the linter, having output is essential. It’s the only way you know you have code style issues to address.

First I investigated the bash/ssh aspect of the situation. I spend a couple of hours reviewing Bash documentation, including a side tangent about psuedoterminals that was both frustrating and fascinating. This led no where. In fact, all I ended up figuring out was that the Bash conditions under which the linters were evoked were essentially identical. (I did this by putting a shoptat the beginning of .git/hooks/pre-commit and comparing the output from running it on my Mac terminal and inside the Zulip dev environment side-by-side.)

At this point it was well into dinner time and I called it a night.

By the time I returned to my keyboard the next morning, fellow Zulip contributor Steve Howell had been working on the issue and eliminated vagrant as the issue by reproducing the missing output using plain old ssh. This and my work from the previous evening shifted my focus to the ./tools/lint-all python code itself. I dove in, determined the understand each and every line.

This took a while. lint-all is almost 500 lines. It looks like one of those things that started simple, but has grown more complex over time as its parent project grew more complex. It has very few inline comments.

Here’s what I figured out:

lint-all runs several linter scripts in parallel using Python’s os.fork(), subprocess.call(), and subprocess.Popen().
The linter scripts are a combination of other stand alone commands (e.g. jslint, puppet parser, etc.) and inline Python code. Each linter wrapper is its own function and is cataloged by the lint_functions dictionary. The run_parallel() function loops through lint_functions, creating a new child process for each with os.fork() and then executing it. It also checks the status of each of the child processes and returns failed == True if any of them exit with a failed status.
Most of the linter wrappers use subprocess.call() to invoke their respective command line scripts. The exception is the function that runs the pyflakes linter. Pyflakes writes some messages to stdout and others to stderr. So, in order to capture both kinds of messages and then loop through them to determine what to output, subprocess.Popen is used with arguments stdout = subprocess.PIPE and stderr = subprocess.PIPE

(If all of that is really confusing, feel free to take a look at this highly simplified version I wrote to accompany this post.)

While I was learning how lint-all worked, I asked other Zulip contributors as I went along. At one point, Steve Howell asked what the redirected output looked like and I ran this within the virtual environment:

 ./tools/lint-all > out.log 2>&1

And it created a file without any output. Wait, what? > out.log 2>&1 should have redirected both stdout and stderr to the file out.log.

So, when I ran the linter, I saw output on screen. But when trying to redirect that output to a file, I got nothing. Within the virtual environment.

At this point I knew the problem had something to do with these child processes and pipes, but I was getting tired. So I did what any good programmer does when they’re stuck, which is start asking questions on Twitter:

What would cause a python script to create output on the screen, but not when redirected to a file?

— Christie Koehler (@christi3k) August 16, 2016

To which Adam Lowry replies with this hint:

@christi3k That is weird! I wonder if the child is killed bytes in a buffer wouldn't be flushed? Seems unlikely, but a .flush() wouldnt hurt

— Adam Lowry (@robotadam) August 17, 2016

And so I added sys.stderr.flush() and sys.stdout.flush() prior to the call to os._exit() and that fixed it! I could now see complete output when I ran the linter under all conditions.

For whatever reason, output that was in the stdout and stderr buffers was getting lost when child processes were terminated. The python docs for os._exit() does give a hit about this:

os._exit(n)
Exit the process with status n, without calling cleanup handlers, flushing stdio buffers, etc.

Lesson learned? If in doubt, .flush() your output. Also, Zulip’s ./tools/lint-all could use some re-factoring. I haven’t tackled it yet, but I’d try getting rid of that os.fork() and using subprocess methods functions entirely, or another concurrency method.

Again, if you’d like to try or see a simplified example of the problem for yourself, take a look here.

And here’s the final git pre-commit hook:

#!/bin/bash

# This hook runs the Zulip code linter ./tools/lint-all and returns true
# regardless of linter results so that your commit may continue.

# Messages from the linter will be printed out to the screen.
#
# If you are running this one machine hosting a Vagrant guest that
# contains your provisioned Zulip development environment, the linter
# will automatically be run through `vagrant ssh`.

if [ -z "$VIRTUAL_ENV" ] && `which vagrant > /dev/null` && [ -e .vagrant ]; then
    vcmd='/srv/zulip/tools/lint-all $(cd /srv/zulip && git diff --cached --name-only --diff-filter=ACM) || true'
    echo "Running lint-all using vagrant..."
    vagrant ssh -c "$vcmd"
else
    ./tools/lint-all $(git diff --cached --name-only --diff-filter=ACM) || true
fi
exit 0

Update 27 August, 2016 17:10 PDT: Brendan asked me about the use of os._exit() so I took another look at the docs. That led me to wonder about using sys.exit() instead, which led me to this post. Using sys.exit (code) appears to obviate the need for the .flush() statements.

February 17, 2015

Fun with git submodules

Git submodules are amazingly useful. Because they provide a way for you to connect external, separate git repositories they can be used to organize your vim scripts, your dotfiles, or even a whole mediawiki deployment.

As incredibly useful as git submodules are, they can also be a bit confusing to use. This goal of this article is to walk you through the most common git submodule tasks: adding, removing and updating. We’ll also review briefly how to make changes to code you have checked out as a submodule.

I’ve created some practice repositories. Fork submodule-practice if you’d like to follow along. We’ll used these test repositories as submodules:

I’ve used version 2.3.0 of the git client for these examples. If you’re seeing something different, check your version with git --version.

Initializing a repository with submodules

First, let’s clone our practice repository:

[skade ;) ~/Work] 
christie$ git clone git@github.com:christi3k/submodule-practice.git
Cloning into 'submodule-practice'...
remote: Counting objects: 63, done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 63 (delta 9), reused 0 (delta 0), pack-reused 47
Receiving objects: 100% (63/63), 6.99 KiB | 0 bytes/s, done.
Resolving deltas: 100% (25/25), done.
Checking connectivity... done.

And then cd into the working directory:

christie$ cd submodule-practice/

Currently, this project has two submodules: furry-octo-nemesis and psychic-avenger.

When we run ls we see directories for these submodules:

[skade ;) ~/Work/submodule-practice (master)] 
christie$ ll
▕ drwxrwxr-x▏christie:christie│4  min │   4K│furry-octo-nemesis
▕ drwxrwxr-x▏christie:christie│4  min │   4K│psychic-avenger
▕ -rw-rw-r--▏christie:christie│4  min │  29B│README.md
▕ -rw-rw-r--▏christie:christie│4  min │ 110B│README.mediawiki

But if we run ls for either submodule directory we see they are empty. This is because the submodules have not yet been initialized or updated.

[skade ;) ~/Work/submodule-practice (master)] 
christie$ git submodule init
Submodule 'furry-octo-nemesis' (git@github.com:christi3k/furry-octo-nemesis.git) registered for path 'furry-octo-nemesis'
Submodule 'psychic-avenger' (git@github.com:christi3k/psychic-avenger.git) registered for path 'psychic-avenger'

git submodule init copies the submodule names, urls and other details from .gitmodules to .git/config, which is where git looks for config details it should apply to your working copy.

git submodule init does not update or otherwise alter information in .git/config. If you have changed .gitmodules for any submodule already initialized, you’ll need to deinit and init the submodule again for changes to be reflected in .git/config.

You can initialize specific submodules by specifying their name:

git submodule init psychich-avenger

At this point you could customized git submodule urls for use in your local checkout by editing them in .git/config before proceeding to git submodule update.

Now let’s actually checkout the submodules with submodule update:

skade ;) ~/Work/submodule-practice (master)] 
christie$ git submodule update --recursive
Cloning into 'furry-octo-nemesis'...
remote: Counting objects: 6, done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 6
Receiving objects: 100% (6/6), done.
Checking connectivity... done.
Submodule path 'furry-octo-nemesis': checked out '1c4b231fa0bcfd5ce8b8a2773c6616689032d353'
Cloning into 'psychic-avenger'...
remote: Counting objects: 25, done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 25 (delta 1), reused 0 (delta 0), pack-reused 15
Receiving objects: 100% (25/25), done.
Resolving deltas: 100% (3/3), done.
Checking connectivity... done.
Submodule path 'psychic-avenger': checked out '169c5c56154f58fd745352c4f30aa0d4a1d7a88e'

Note: The --recursive flag tells git to recurse into submodule directories and run update on any submodules those submodules include. It’s not needed for this example, but I’ve included it here anyway since it’s common for projects to have nested submodules.

Now when we run ls on either directory, we see they now contain our submodule’s files:

[skade ;) ~/Work/submodule-practice (master)] 
christie$ ls furry-octo-nemesis/
▕ -rw-rw-r--▏42 sec │  52B│README.md

[skade ;) ~/Work/submodule-practice (master)] 
christie$ ls psychic-avenger/
▕ -rw-rw-r--▏46 sec │ 133B│README.md
▕ -rw-rw-r--▏46 sec │   0B│other.txt

Note: It’s possible to run init and update in one command with git submodule update --init --recursive

Adding a git submodule

We’ll start in the working directory of submodule-practice.

To add a submodule, use:

git submodule add <git-url>

Let’s try adding sample project scaling-octo-wallhack as a submodule.

[2495][skade ;) ~/Work/submodule-practice (master)] 
christie$ git submodule add git@github.com:christi3k/scaling-octo-wallhack.git
Cloning into 'scaling-octo-wallhack'...
remote: Counting objects: 19, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 19 (delta 1), reused 0 (delta 0), pack-reused 9
Receiving objects: 100% (19/19), done.
Resolving deltas: 100% (3/3), done.
Checking connectivity... done.

Note: If you want the submodule to be cloned into a directory other than ‘scaling-octo-wallhack’ then you need to specify a directory to clone into as you would when cloning any other project. For example, this command will clone psychic-avenger to the subdirectory submodules:

christie$ git submodule add git@github.com:christi3k/scaling-octo-wallhack.git submodules/scaling-octo-wallhack

Let’s see what git status tells us:

[skade ;) ~/Work/submodule-practice (master +)] 
christie$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   .gitmodules
	new file:   scaling-octo-wallhack

And running ls we see that there are files in scaling-octo-wallhack directory:

[skade ;) ~/Work/submodule-practice (master +)] 
christie$ ll scaling-octo-wallhack/
▕ -rw-rw-r--▏christie:christie│<  min │ 180B│README.md
▕ -rw-rw-r--▏christie:christie│<  min │   0B│cutting-edge-changes.txt
▕ -rw-rw-r--▏christie:christie│<  min │   0B│file1.txt
▕ -rw-rw-r--▏christie:christie│<  min │   0B│file2.txt

Specifying a branch

When you add a git submodule, git makes some assumptions for you. It sets up a remote repository to the submodule called ‘origin’ and it checksout the ‘master’ branch for you. In many cases you may no want to use the master branch. Luckily, this is easy to change.

There are two methods to specific which branch of the submodule should be checked out by your project.

Method 1: Specify a branch in .gitmodules

Here’s what the modified section of .gitmodules looks like for scaling-octo-wallhack:

[submodule "scaling-octo-wallhack"]
  path = scaling-octo-wallhack
  url = git@github.com:christi3k/scaling-octo-wallhack.git
  branch  = REL_1

Be sure to save .gitmodules and then run git submodule update --remote:

[skade ;( ~/Work/submodule-practice (master *+)] 
christie$ git submodule update --remote
Submodule path 'psychic-avenger': checked out 'fba086dbb321109e5cd2d9d1bc3b59478dacf6ee'
Submodule path 'scaling-octo-wallhack': checked out '88d66d5ecc58d2ab82fec4fea06ffbfd2c55fd7d'

Method 2: Checkout specific branch in submodule directory

In the submodule directory, checkout the branch you want with git checkout origin/branch-name:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((b49591a...))] 
christie$ git checkout origin/REL_1
Previous HEAD position was b49591a... Cutting-edge changes.
HEAD is now at 88d66d5... Prep Release 1.

Either method will result will yield the following from git status:

[skade ;) ~/Work/submodule-practice (master *+)] 
christie$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   .gitmodules
	new file:   scaling-octo-wallhack

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   scaling-octo-wallhack (new commits)

Now let’s stage and commit the changes:

[skade ;) ~/Work/submodule-practice (master *+)] 
christie$ git add scaling-octo-wallhack

[skade ;) ~/Work/submodule-practice (master +)] 
christie$ git commit -m "Add scaling-octo-wallhack submodule, REL_1."
[master 4a97a6f] Add scaling-octo-wallhack submodule, REL_1.
 2 files changed, 4 insertions(+)
 create mode 160000 scaling-octo-wallhack

And don’t forget to push them to our remote repository so they are available for others:

[skade ;) ~/Work/submodule-practice (master)] 
christie$ git push -n origin master 
To git@github.com:christi3k/submodule-practice.git
   7e6d09e..4a97a6f  master -> master

(Note the -n flag means ‘dry run’, that is ‘do everything except actually send the updates.’ I recommend using this when available with commands that have potentially destructive results, including push and merge.)

Looks good, do it for real now:

[skade ;) ~/Work/submodule-practice (master)] 
christie$ git push origin master 
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 439 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
To git@github.com:christi3k/submodule-practice.git
   7e6d09e..4a97a6f  master -> master

Removing a git submodule

Removing a submodule is a bit trickier than adding one.

Deinitialize

First, deinit the submodule with git submodule deinit :

[skade ;) ~/Work/submodule-practice (master)]
christie$ git submodule deinit psychic-avenger
Cleared directory 'psychic-avenger'
Submodule 'psychic-avenger' (git@github.com:christi3k/psychic-avenger.git) unregistered for path 'psychic-avenger'

This command removes the submodule’s confg entries in .git/config and .gitmodules and it removes files from the submodule’s working directory. This command will delete untracked files, even when they are listed in .gitignore.

Note: You can also use this command if you simply want to prevent having a local checkout of the submodule in your working tree, without actually removing the submodule from your main project.

Let’s check our work:

[skade ;) ~/Work/submodule-practice (master)]
christie$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

This shows no changes because git submodule deinit only makes changes to our local working copy.

Running ls we also see the directories are still present:

[skade ;) ~/Work/submodule-practice (master)]
christie$ ll
▕ drwxrwxr-x▏christie:christie│4 day │ 4K│furry-octo-nemesis
▕ drwxrwxr-x▏christie:christie│16 sec │ 4K│psychic-avenger
▕ -rw-rw-r--▏christie:christie│4 day │ 29B│README.md
▕ -rw-rw-r--▏christie:christie│4 day │ 110B│README.mediawiki

[skade ;) ~/Work/submodule-practice (master)]

Remove with git rm

To actually remove the submodule from your project’s repository, use git rm:

[skade ;) ~/Work/submodule-practice (master)]
christie$ git rm psychic-avenger
rm 'psychic-avenger'

Let’s check our work:

[skade ;) ~/Work/submodule-practice (master +)]
christie$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
(use "git reset HEAD ..." to unstage)

modified: .gitmodules
deleted: psychic-avenger

These changes have been staged by git automatically, so to see what has changed about .gitmodules, use --cached flag or its alias --staged:

[skade ;) ~/Work/submodule-practice (master +)]
christie$ git diff --cached
diff --git a/.gitmodules b/.gitmodules
index dec1204..e531507 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,3 @@
[submodule "furry-octo-nemesis"]
path = furry-octo-nemesis
url = git@github.com:christi3k/furry-octo-nemesis.git
-[submodule "psychic-avenger"]
- path = psychic-avenger
- url = git@github.com:christi3k/psychic-avenger.git
diff --git a/psychic-avenger b/psychic-avenger
deleted file mode 160000
index fdd4b36..0000000
--- a/psychic-avenger
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit fdd4b366458757940d7692b61e22f4d1b21c825a

So we see that in .gitmodules, lines related to psychic-avenger have been removed and that the psychic-avenger directory and commit hash has also been removed.

And a directory listing shows the files are no longer in our working directory:

christie$ ll
▕ drwxrwxr-x▏christie:christie│4 day │ 4K│furry-octo-nemesis
▕ -rw-rw-r--▏christie:christie│4 day │ 29B│README.md
▕ -rw-rw-r--▏christie:christie│4 day │ 110B│README.mediawiki

Removing all reference to the submodule (optional)

For whatever reason, git does not remove all trace of the submodule even after these commands. To completely remove all reference, you need to also delete the .git/modules entry to really have it be gone:

[skade ;) ~/Work/submodule-practice (master)]
christie$ rm -rf .git/modules/psychic-avenger

Note: This probably optional for most use-cases. The only time you might run into trouble if you leave this reference is if you later add a submodule of the same name. In that case, git will complain and ask you to pick a different name or to simply checkout the submodule from the remote source it already knows about.

Also, be careful with rm -rf because it doesn’t prompt you for a confirmation and there’s no dry-run flag.

Commit changes

Now we commit our changes:

[skade ;) ~/Work/submodule-practice (master +)]
christie$ git commit -m "Remove psychic-avenger submodule."
[master 7833c1c] Remove psychic-avenger submodule.
2 files changed, 4 deletions(-)
delete mode 160000 psychic-avenger

Looks good, let’s push our changes:

[skade ;) ~/Work/submodule-practice (master)]
christie$ git push -n origin master
To git@github.com:christi3k/submodule-practice.git
d89b5cb..7833c1c master -&gt; master

Looks good, let’s do it for real:

[skade ;) ~/Work/submodule-practice (master)]
christie$ git push origin master
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 402 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To git@github.com:christi3k/submodule-practice.git
d89b5cb..7833c1c master -&gt; master

Updating submodules within your project

The simplest use case for updating submodules within your project is when you simply want to pull in the most recent version of that submodule or want to change to a different branch.

There are two methods for updating modules.

Method 1: Specify a branch in .gitmodules and use `git submodule update --remote`

Using this method, you first need to ensure that the branch you want to use is specified for each submodule in .gitmodules.

Let’s take a look at the .gitmodules file for our sample project:

[submodule "furry-octo-nemesis"]
  path = furry-octo-nemesis
  url = git@github.com:christi3k/furry-octo-nemesis.git
[submodule "psychic-avenger"]
  path = psychic-avenger
  url = git@github.com:christi3k/psychic-avenger.git
  branch = RELEASE_E
[submodule "scaling-octo-wallhack"]
  path = scaling-octo-wallhack
  url = git@github.com:christi3k/scaling-octo-wallhack.git

The submodule psychic-avenger is set to checkout branch RELEASE_E and both furry-octo-nemesis and scaling-octo-wallhack will checkout master because no branch is specified.

Edit .gitsubmodules

To change the branch that is checked out, update the value of branch:

[submodule "scaling-octo-wallhack"]
  path = scaling-octo-wallhack
  url = git@github.com:christi3k/scaling-octo-wallhack.git
  brach = REL_2

Now scaling-octo-wallhack is set to checkout the REL_2 branch.

Update with git submodule update –remote

[skade ;) ~/Work/submodule-practice (master *)] 
christie$ git submodule update --remote
Submodule path 'scaling-octo-wallhack': checked out 'e845f5431119b527b7cde1ad138a373c5b2d4ec1'

And if we cd into scaling-octo-wallhack and run branch -vva we confirm we’ve checked out the REL_2 branch:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((e845f54...))] 
christie$ git branch -vva
* (detached from e845f54) e845f54 Release 2.
  master                  b49591a [origin/master] Cutting-edge changes.
  remotes/origin/HEAD     -> origin/master
  remotes/origin/REL_1    88d66d5 Prep Release 1.
  remotes/origin/REL_2    e845f54 Release 2.
  remotes/origin/master   b49591a Cutting-edge changes.

Method 2: git fetch and git checkout within submodule

First, change into the directory of the submodule you wish to update.

fetch from remote repository

Then run git fetch origin to grab any new commits:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((b49591a...))] 
christie$ git fetch origin 
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0
Unpacking objects: 100% (3/3), done.
From github.com:christi3k/scaling-octo-wallhack
   e845f54..1cc1044  REL_2      -> origin/REL_2

Here was can see that the last commit for the REL_2 branch changed from e845f54 to 1cc1044.

Running branch -vva confirms this and that we haven’t changed which commit is checked out yet:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((88d66d5...))] 
christie$ git branch -vva
* (detached from 88d66d5) 88d66d5 Prep Release 1.
  master                  b49591a [origin/master] Cutting-edge changes.
  remotes/origin/HEAD     -> origin/master
  remotes/origin/REL_1    88d66d5 Prep Release 1.
  remotes/origin/REL_2    1cc1044 Hotfix for Release 2 branch.
  remotes/origin/master   b49591a Cutting-edge changes.

Checkout branch

So now we can re-checkout the REL_2 remote branch:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((88d66d5...))] 
christie$ git checkout origin/REL_2
Previous HEAD position was 88d66d5... Prep Release 1.
HEAD is now at 1cc1044... Hotfix for Release 2 branch.

Let’s check our work with branch -vva:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((1cc1044...))] 
christie$ git branch -vva
* (detached from origin/REL_2) 1cc1044 Hotfix for Release 2 branch.
  master                       b49591a [origin/master] Cutting-edge changes.
  remotes/origin/HEAD          -> origin/master
  remotes/origin/REL_1         88d66d5 Prep Release 1.
  remotes/origin/REL_2         1cc1044 Hotfix for Release 2 branch.
  remotes/origin/master        b49591a Cutting-edge changes.

Commiting the changes

Moving back to our main project directory, let’s check our work with git status && git diff:

[skade ;) ~/Work/submodule-practice (master *)] 
christie$ git status && git diff
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   scaling-octo-wallhack (new commits)

no changes added to commit (use "git add" and/or "git commit -a")
diff --git a/scaling-octo-wallhack b/scaling-octo-wallhack
index 88d66d5..1cc1044 160000
--- a/scaling-octo-wallhack
+++ b/scaling-octo-wallhack
@@ -1 +1 @@
-Subproject commit 88d66d5ecc58d2ab82fec4fea06ffbfd2c55fd7d
+Subproject commit 1cc104418a6a24b9a3cc227df4ebaf707ea23b49

Notice that there are no changes to .gitmodules with this method. Instead, we’ve simply changed the commit hash that the super project is pointing to for this submodule.

Now let’s add, commit and push our changes:

[skade ;) ~/Work/submodule-practice (master *)] 
christie$ git add scaling-octo-wallhack

[skade ;) ~/Work/submodule-practice (master +)] 
christie$ git commit -m "Updating to current REL_2."
[master 5ddbe87] Updating to current REL_2.
 1 file changed, 1 insertion(+), 1 deletion(-)

[skade ;) ~/Work/submodule-practice (master)] 
christie$ git push -n origin master
To git@github.com:christi3k/submodule-practice.git
   4a97a6f..5ddbe87  master -> master

[skade ;) ~/Work/submodule-practice (master)] 
christie$ git push origin master
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 261 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
To git@github.com:christi3k/submodule-practice.git
   4a97a6f..5ddbe87  master -> master

what’s the difference between git submodule update and git submodule update –remote?

Note: git submodule update –remote looks at the value you have in .gitmodules for branch. If there isn’t a value there, it assumes master. git submodule update looks at your repository has for the commit of the submodule project and checks that commit out. Both checkout to a detached state by default unless you specify –merge or –rebase.

These two commands have the ability to step on each other. If you have checked out a specific commit in the submodule directory, it’s possible for it to be different than the commit that would be checked out by git submdoule update –remote specificied in the branch value of .gitmodules.
Likewise, simply looking at the branch value in .gitmodules does not guarentee that’s the branch you have checked out for the submodule. When in doubt, cd to the submodule directory and run git branch -vva. git branch -vva is your friend!

When a subbmodule has been removed

When a submodule has been removed from a repository, what’s the best way to update your working directory to reflect this change?

The answer is that it depends on whether or not you have local, untracked files in the submodule directory that you want to keep.

Method 1: deinit and then fetch and merge

Use this method if you want to completely remove the submodule directory even if you have local, untracked files in it.

Note: In the following examples, we’re working in another checkout of our submodule-practice.

First, use git submodule deinit to deinitialize the submodule:

[skade ;) ~/Work/submodule-elsewhere (master *)] 
christie$ git submodule deinit psychic-avenger
error: the following file has local modifications:
    psychic-avenger
(use --cached to keep the file, or -f to force removal)
Submodule work tree 'psychic-avenger' contains local modifications; use '-f' to discard them

We have untracked changes, so we need to use -f to remove them:

[skade ;( ~/Work/submodule-elsewhere (master *)] 
christie$ git submodule deinit -f psychic-avenger
Cleared directory 'psychic-avenger'
Submodule 'psychic-avenger' (git@github.com:christi3k/psychic-avenger.git) unregistered for path 'psychic-avenger'

Now fetch changes from the remote repository and merge them:

[skade ;) ~/Work/submodule-elsewhere (master)] 
christie$ git fetch origin 
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0
Unpacking objects: 100% (3/3), done.
From github.com:christi3k/submodule-practice
   666af5d..6038c72  master     -> origin/master

[skade ;) ~/Work/submodule-elsewhere (master)] 
christie$ git merge origin/master 
Updating 666af5d..6038c72
Fast-forward
 .gitmodules     | 3 ---
 psychic-avenger | 1 -
 2 files changed, 4 deletions(-)
 delete mode 160000 psychic-avenger

Running ls on our project directory shows that the all of psychic-avenger’s files have been removed:

[skade ;) ~/Work/submodule-elsewhere (master)] 
christie$ ll 
▕ drwxrwxr-x▏christie:christie│3  hour│   4K│furry-octo-nemesis
▕ drwxrwxr-x▏christie:christie│5  min │   4K│scaling-octo-wallhack
▕ -rw-rw-r--▏christie:christie│3  hour│  29B│README.md
▕ -rw-rw-r--▏christie:christie│3  hour│ 110B│README.mediawiki

Method 2: fetch and merge and clean up as needed

Use this method if you have local, untracked (and/or ignored) changes that you want to keep, or if you want to remove files manually.

First, fetch changes from the remote repository and merge them with your local branch:

[skade ;) ~/Work/submodule-elsewhere (master)]
christie$ git fetch origin
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0
Unpacking objects: 100% (3/3), done.
From github.com:christi3k/submodule-practice
d89b5cb..7833c1c master -> origin/master

[skade ;) ~/Work/submodule-elsewhere (master)]
christie$ git merge origin/master
Updating d89b5cb..7833c1c
warning: unable to rmdir psychic-avenger: Directory not empty
Fast-forward
.gitmodules | 3 ---
psychic-avenger | 1 -
2 files changed, 4 deletions(-)
delete mode 160000 psychic-avenger

Note the warning “unable to rm dir…” and let’s check our work:

[skade ;) ~/Work/submodule-elsewhere (master)]
christie$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
(use "git add ..." to include in what will be committed)

psychic-avenger/

nothing added to commit but untracked files present (use "git add" to track)

No uncommited or staged changes, but the directory that was our submodule psychic-avenger is now untracked. Running ls shows that there are still files in the directory, too:

[skade ;( ~/Work/submodule-elsewhere (master)]
christie$ ll psychic-avenger/
▕ -rw-rw-r--▏christie:christie│30 min │ 192B│README.md

Now you can clean up files as you like. In this example we’ll delete the entire psychic-avenger directory:

[skade ;) ~/Work/submodule-elsewhere (master)]
christie$ rm -rf psychic-avenger

Working on projects checked out as submodules

Working on projects checked out as submodules is rather straight-forward, particularly if you are comfortable with git branching and make liberal use of git branch -vva.

Let’s pretend that scaling-octo-wallhack is an extension that I’m developing for my project submodule-practice. I want to work on the project while it’s checked out as a submodule because doing so makes it easy to test the extension within my larger project.

Create a working branch

First switch the the branch that you want to use as the base for your work. I’m going to use local tracking branch master, which I’ll first ensure is up to date with the remote origin/master:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((1cc1044...))] 
christie$ git branch -vva
* (detached from origin/REL_2) 1cc1044 Hotfix for Release 2 branch.
  master                       b49591a [origin/master] Cutting-edge changes.
  remotes/origin/HEAD          -> origin/master
  remotes/origin/REL_1         88d66d5 Prep Release 1.
  remotes/origin/REL_2         1cc1044 Hotfix for Release 2 branch.
  remotes/origin/master        b49591a Cutting-edge changes.

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack ((b49591a...))] 
christie$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.

If master had not been up-to-date with orgin/master, I would have merged.

Next, let’s create a tracking branch for this awesome feature we’re going to work on:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (master)] 
christie$ git checkout -b awesome-feature
Switched to a new branch 'awesome-feature'

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ git branch -vva
* awesome-feature       b49591a Cutting-edge changes.
  master                b49591a [origin/master] Cutting-edge changes.
  remotes/origin/HEAD   -> origin/master
  remotes/origin/REL_1  88d66d5 Prep Release 1.
  remotes/origin/REL_2  1cc1044 Hotfix for Release 2 branch.
  remotes/origin/master b49591a Cutting-edge changes.

Do some work, add and commit changes

No we’ll do some work on the feature, add and commit that work:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ touch awesome_feature.txt

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ git add awesome_feature.txt 

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature +)] 
christie$ git commit -m "first round of work on awesome feature"
[awesome-feature 005994b] first round of work on awesome feature
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 awesome_feature.txt

Push to remote repository

Now we’ll push that to our remost repository so others can contribute:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ git push -n origin awesome-feature 
To git@github.com:christi3k/scaling-octo-wallhack.git
 * [new branch]      awesome-feature -> awesome-feature

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ git push origin awesome-feature 
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 265 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
To git@github.com:christi3k/scaling-octo-wallhack.git
 * [new branch]      awesome-feature -> awesome-feature

Switch back to remote branch, headless checkout

If we’d like to switch back to a remote branch, we can:

[skade ;) ~/Work/submodule-practice/scaling-octo-wallhack (awesome-feature)] 
christie$ git checkout origin/REL_2
Note: checking out 'origin/REL_2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 1cc1044... Hotfix for Release 2 branch.

Using this new branch to collaborate

To try this awesome feature in another checkout, use git fetch:

[skade ;) ~/Work/submodule-elsewhere/scaling-octo-wallhack ((1cc1044...))] 
christie$ git fetch origin 
remote: Counting objects: 2, done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 2 (delta 1), reused 2 (delta 1), pack-reused 0
Unpacking objects: 100% (2/2), done.
From github.com:christi3k/scaling-octo-wallhack
 * [new branch]      awesome-feature -> origin/awesome-feature

If you just want to try the feature, checkout orgin/branch:

[2724][skade ;) ~/Work/submodule-elsewhere/scaling-octo-wallhack ((1cc1044...))] 
christie$ git checkout origin/awesome-feature 
Previous HEAD position was 1cc1044... Hotfix for Release 2 branch.
HEAD is now at 005994b... first round of work on awesome feature

If you plan to work on the feature, create a tracking branch:

[2725][skade ;) ~/Work/submodule-elsewhere/scaling-octo-wallhack ((005994b...))] 
christie$ git checkout -b awesome-feature 
Switched to a new branch 'awesome-feature'

Acknowledgements

Thanks GPHemsley for helping me figure out git submodules within the context of our MozillaWiki work. I couldn’t have written this post without those conversations or the notes I took during them.

Additional Resources

Update 19 Feb: Fixed typos and added ‘Additional Resources’ section.

April 10, 2014

An Explanation of the Heartbleed bug for Regular People

I’ve put this explanation together for those who want to understand the Heartbleed bug, how it fits into the bigger picture of secure internet browsing, and what you can do to mitigate its affects.

HTTPS vs HTTP (padlock vs no padlock)

When you are browsing a site securely, you use https and you see a padlock icon in the url bar. When you are browsing insecurely you use http and you do not see a padlock icon.

Firefox url bar for HTTPS site (above) and non-HTTPS (below).

HTTPS relies on something called SSL/TLS.

Understanding SSL/TLS

SSL stands for Secure Sockets Layer and TLS stands for Transport Layer Security. TLS is the later version of the original, proprietary, SSL protocol developed by Netscape. Today, when people say SSL, they generally mean TLS, the current, standard version of the protocol.

Public and private keys

The TLS protocol relies heavily on public-key or asymmetric cryptography. In this kind of cryptography, two separate but paired keys are required: a public key and a private key. The public key is, as its name suggests, shared with the world and is used to encrypt plain-text data or to verify a digital signature. (A digital signature is a way to authenticate identity.) A matching private key, on the other hand, is used to decrypt data and to generate digital signatures. A private key should be safeguarded and never shared. Many private keys are protected by pass-phrases, but merely having access to the private key means you can likely use it.

Authentication and encryption

The purpose of SSL/TLS is to authenticate and encrypt web traffic.

Authenticate in this case means “verify that I am who I say I am.” This is very important because when you visit your bank’s website in your browser, you want to feel confident that you are visiting the web servers of — and thereby giving your information to — your actual bank and not another server claiming to be your bank. This authentication is achieved using something called certificates that are issued by Certificate Authorities (CA). Wikipedia explains thusly:

The digital certificate certifies the ownership of a public key by the named subject of the certificate. This allows others (relying parties) to rely upon signatures or assertions made by the private key that corresponds to the public key that is certified. In this model of trust relationships, a CA is a trusted third party that is trusted by both the subject (owner) of the certificate and the party relying upon the certificate.

In order to obtain a valid certificate from a CA, website owners must submit, at minimum, their server’s public key and demonstrate that they have access to the website (domain).

Encrypt in this case means “encode data such that only authorized parties may decode it.” Encrypting internet traffic is important for sensitive or otherwise private data because it is trivially easy eavesdrop on internet traffic. Information transmitted not using SSL is usually done so in plain-text and as such clearly readable by anyone. This might be acceptable for general internet broswing. After all, who cares who knows which NY Times article you are reading? But is not acceptable for a range of private data including user names, passwords and private messages.

Behind the scenes of an SSL/TLS connection

When you visit a website with HTTPs enabled, a multi-step process occurs so that a secure connection can be established. During this process, the sever and client (browser) send messages back and forth in order to a) authenticate the server’s (and sometimes the client’s) identity and, b) to negotiate what encryption scheme, including which cipher and which key, they will use for the session. Identities are authenticated using the digital certificates mentioned previously.

When all of that is complete, the secure connection is established and the server and client send traffic back and forth to each other.

All of this happens without you ever knowing about it. Once you see your bank’s login screen the process is complete, assuming you see the padlock icon in your browser’s url bar.

Keepalives and Heartbeats

Even though establishing an ssl connection happens almost imperceptibly to you, it does have an overhead in terms of computer and network resources. To minimize this overhead, network connections are often kept open and active until a given timeout threshold is exceed. When that happens, the connection is closed. If the client and server wish to communicate again, they need to re-negotiate the connection and re-incur the overhead of that negotiation.

One way to forestall a connection being closed is via keepalives. A keepalive message is used to tell a server “Hey, I know I haven’t used this connection in a little while, but I’m still here and I’m planning to use it again really soon.”

Keepalive functionality was added to the TLS protocol specification via the Heartbeat Extension. Instead of “Keepalives,” they’re called “Heartbeats,” but they do basically the same thing.

Specification vs Implementation

Let’s pause for a moment to talk about specifications vs implementations. A protocol is a defined way of doing something. In this case of TLS, that something is encrypted network communications. When a protocol is standardized, it means that a lot of people have agreed upon the exact way that protocol should work and this way is outlined in a specification. The specification for TLS is collaboratively developed, maintained and promoted by the standards body Internet Engineering Task Force (IETF). A specification in and of itself does not do anything. It is a set of documents, not a program. In order for a specifications to do something, they must be implemented by programmers.

OpenSSL implementation of TLS

OpenSSL is one implementation of the TLS protocol. There are others, including the open source GnuTLS as well as proprietary implementations. OpenSSL is a library, meaning that it is not a standalone software package, but one that is used by other software packages. These include the very popular webserver Apache.

The Heartbleed bug only applies to webservers with SSL/TLS enabled, and only those using specific versions of the open source OpenSSL library because the bug relates to an error in the code of that library, specifically the heartbeat extension code. It is not related to any errors in the TLS specification or and in any of the underlying ciper suites.

Usually this would be good news. However, because OpenSSL is so widely used, particularly the affected version, this simple bug has tremendously reach in terms of the number of servers and therefor the number of users it potentially affects.

What the heartbeat extension is supposed to do

The heartbeat extension is supposed to work as follows:

A client sends a heartbeat message to the server.
The message contains two pieces of data: a payload and the size of that payload. The payload can by anything up to 64kb.
When the server receives the heartbeat message, it is to add a bit of extra data to it (padding) and send it right back to the client.

Pretty simple, right? Heartbeat isn’t supposed to do anything other than let the server and client know they are each still there and accepting connections.

What the heartbeat code actually does

In the code for affected versions (1.0.1-1.0.1f) of the OpenSSL heartbeat extension, the programmer(s) made a simple but horrible mistake: They failed to verify the size of the received payload. Instead, they accepted what the client said was the size of the payload and returned this amount of data from memory, thinking it should be returning the same data it had received. Therefore, a client could send a payload of 1KB, say it was 64KB and receive that amount of data back, all from server memory.

If that’s confusing, try this analogy: Imagine you are my bank. I show up and make a deposit. I say the deposit is $64, but you don’t actually verify this amount. Moments later I request a withdrawal of the $64 I say I deposited. In fact, I really only deposited $1, but since you never checked, you have no choice but to give me $64, $63 of which doesn’t actually belong to me.

And, this is exactly how a someone could exploit this vulnerability. What comes back from memory doesn’t belong to the client that sent the heartbeat message, but it’s given a copy of it anyway. The data returned is random, but would be data that the OpenSSL library had been storing in memory. This should be pre-encryption (plain-text) data, including your user names and passwords. It could also technically be your server’s private key (because that is used in the securing process) and/or your server’s certificate (which is also not something you should share).

The ability to retrieve a server’s private key is very bad because that private key could be used to decrypt all past, present and future traffic to the sever. The ability to retreive a server’s certificate is also bad because it gives the ability to impersonate that server.

This, coupled with the widespread use of OpenSSL, is why this bug is so terribly bad. Oh, and it gets worse…

Taking advantage of this vulnerability leaves no trace

What’s worse is that logging isn’t part of the Heartbeat extension. Why would it be? Keepalives happen all the time and generally do not represent transmission of any significant data. There’s no reason to take up value time accessing the physical disk or taking up storage space to record that kind of information.

Because there is no logging, there is no trace left when someone takes advantage of this vulnerability.

The code that introduced this bug has been part of OpenSSl for 2+ years. This means that any data you’ve communicated to servers with this bug since then has the potential to be compromised, but there’s no way to determine definitively if it was.

This is why most of the internet is collectively freaking out.

What do server administrators need to do?

Server (website) administrators need to, if they haven’t already:

Determine whether or not their systems are affected by the bug. (test)
Patch and/or upgrade affected systems. (This will require a restart)
Revoke and reissue keys and certificates for affected systems.

Furthermore, I strongly recommend you enable Perfect forward secrecy to safeguard data in the event that a private key is compromised:

When an encrypted connection uses perfect forward secrecy, that means that the session keys the server generates are truly ephemeral, and even somebody with access to the secret key can’t later derive the relevant session key that would allow her to decrypt any particular HTTPS session. So intercepted encrypted data is protected from prying eyes long into the future, even if the website’s secret key is later compromised.

What do users (like me) need to do?

The most important thing regular users need to do is change your passwords on critical sites that were vulnerable (but only after they’ve been patched). Do you need to change all of your passwords everywhere? Probably not. Read You don’t need to change all your passwords for some good tips.

Additionally, if you’re not already using a password manager, I highly recommend LastPass, which is cross-platform and works on pretty much every device. Yesterday LastPass announced they are helping users to know which passwords they need to update and when it is safe to do so.

If you do end up trying LastPass, checkout my guide for setting it up with two-factor auth.