The architectural and organizational/process advantages of containerization (eg., via Docker) are commonly known. However, in constructing images, especially those that serve as the base for other images, adding functionality via package installation is a double edged sword. On one hand we want our images to be most useful for the purposes they are built but—as images are downloaded, moved around our networks and live in our production environments—we pay a real speed and cost price for bloated image sizes. The obvious onus on image creators is to make them as practically small as possible without sacrificing efficacy and extensibility. This blog shows how we shrunk our images with a pretty simple trick…

The great impetus towards smaller images manifests in a few places:

  1. OS distros, such a phusion (minimal, Docker-friendly Ubuntu), busybox (intended for embedded systems), and alpine. These provide operating systems that are minimally functional yet can be easily extended.
  2. Programming/Environments, such as microcontainers from Iron.io.
  3. Shrink wrapping, such as skinnywhaledocker exportstrip-docker-image, work with existing image layers/containers and try to compress them by finding redunancies and commonalities.

When creating Wise.io‘s open version of the Python datascience base image I found that the OS distro choice does not affect the final image size much, since there are so many dependencies required to get a fully functional data science environment up and running. In advance of a focus on post-image creation shrink wrapping, I wound up looking for ways to shrink down the resulting image in the Dockerfile itself.

The essential point is that since each RUN creates a new layer, one needs to condense logical installation and tear down steps into one line. You can do this easily with chained double ampersands (&&) in the shell. By tearing down/cleaning up in another RUN, your final image will still have the bloat from the previous layers. We needed three major installation/clean up steps in our Dockerfile:

1. System level dependencies

=”//www.gistfy.com/github/wiseio/datascience-docker/datascience-base/Dockerfile?branch=master&slice=12:16&lang=dockerfile&style=github”

Here you’ll notice that in addition to updating the OS, installing new packages, and setting locales, we also purge the cache of apt installation files.

2. (Python) Conda distro and data science friendly Python packages like jupyter notebook, pandas, numpy, matplotlib, plotly, sklearn, scikit-image, nltk, gensim, psycopg2:

http://www.gistfy.com/github/wiseio/datascience-docker/datascience-base/Dockerfile?branch=master&slice=19:24&lang=dockerfile&style=github”

Here we get the latest miniconda from Continuum.io, install our favorite data science packages for Python and then tidy up. Using “conda clean” in this layer leads to a major space savings.

3. All the Python packages we want that are not in the standard conda distro channel (e.g. gensim, plotly), but are available via pip:

http://www.gistfy.com/github/wiseio/datascience-docker/datascience-base/Dockerfile?branch=master&slice=27:28&lang=dockerfile&style=github”

Here we make sure to remove the cache directory after we’re done.

The “trick” is really just two components:

  1. Put all logically connected installations (e.g. from one package manager) into their own RUN, to produce fewer layers.
  2. Figure out what the tear down/clean up commands are for those installations/package managers and tack them on to the end of the RUN (e.g., conda clean, rm, …).

All told, we saved about 46% space (475 MB) just by setting up and tearing down in the same RUN.

https___hub_docker_com_r_wiseio_datascience-docker_tags_-1.png

If you’re a Pythonista/data scientist and would like to give our base image a shot just:

   docker pull wiseio/datascience-docker

And get started with jupyter notebooks and more.

We’d love to hear from you if you’ve got any other tricks to strink down this image.

Thanks to Paul Baines and Henrik Brink for comments on earlier drafts.

1 Comment

  1. If you take a look at the pywren project(http://pywren.io/) from Berkely BIDS lab, it does many things to reduce the size of Miniconda + base data science packages even further, as to fit in AWS Lambda function. Specifically the steps in https://github.com/pywren/runtimes/blob/master/shrinkconda.py. The conda clean step was already mentioned in the blog.

    These same techniques would be applicable in to a Docker container, especially if you are looking at just containerizing a working data science app.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s