This post is a bit different than all other posts I usually publish. In this post I decided to talk about my data science development workflow, and how I utilize dockers in my daily work. I truly believe in great tools which make us more productive and help us focus on our main problem at hand.
I will start with why would we even want to use dockers? and what’s wrong with the usual way data scientists work with notebooks?
The Common Data Science Workflow
For start, let’s imagine a situation where you want to work on multiple projects and each project has it’s own dependencies. And at the end of your research, you plan to deploy those products / models to some third party servers so they’ll be accessible for everyone to use.
Also, you need to be flexible since some projects require a larger amount of memory, and some even need a GPU. But you don’t want to limit yourself on which machines you can run your code.
As data scientists, we’re often required to work quickly and deliver results fast, the last thing we need is to deal with slow infrastructure or customize dev environments.
It can also be cool to have the ability to switch computers without loosing the “look and feel” of the environment we work with. So if for example I start working on a some project, and suddenly I need to switch computer, or move to work on the cloud, I can have exactly the same environment I’m used to after the change, and not having this slowing me down.
In this post I’m not going to dive deep into all the functionality that docker gives you, but rather I will focus on the data science workflow, and give some tips which you might find useful.
Docker Terminology
I’m not going to explain all the terminology here, but the full docker glossary can be found here
I will focus on the terms which are relevant to us:
- Docker Images – A Docker image is made up of multiple layers. A user composes each Docker image to include system libraries, tools, and other files and dependencies for the executable code. Image developers can reuse static image layers for different projects. Reuse saves time, because a user does not have to create everything in an image.
- Docker Containers – A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
Set Up Your Docker Environment
In order to setup your data science environment, you’re going to write your first docker-compose.yml file.
This file is going to use an existing docker image, which I already created , that contains python 3.6 along with all the common data science libraries. Including tensorflow, keras, numpy, flask, sklearn, jupyter-lab and more…
Let’s see how the file looks like…
In bold you can see the interesting parts which I’m going to discuss
version: '3'
services:
datascience:
image: 'yaronv/ds:latest'
working_dir: /home/ds/
entrypoint: ./run_lab.sh
stdin_open: true
tty: true
ports:
- '8888:8888'
volumes:
- './notebooks:/home/ds/notebooks/'
- './general_data:/home/ds/data/'
Let’s explain the code above
Defining the service
In the code above, we define a docker-compose version 3 file, with a single service called “datascience” (the name is meaningless). We tell the service to use an existing docker image called ‘yaronv/ds:latest‘.
Defining the entry point
We set the entrypoint to ./run_lab.sh – it means that when the container starts, the run_lab.sh script will be executed (this script starts the jupyter-lab service on port 8888)
#!/bin/bash cd /home/ds/ source python-envs/env3.6/bin/activate jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
Defining port forwarding
Next, in order to open the notebook from our local machine (remember that the service runs inside a docker – so it’s like a different machine), we need to set a port-forwarding to port 8888, from the container environment to our local environment. (It means that all traffic that goes to port 8888 in our container, will also be available to our local machine at port 8888).
Defining volumes
Since the notebooks are saved on our local machine, and we don’t want to publish them together with the docker-image, we define the ‘volumes‘ property. This binds the local machine address ‘./notebooks’ to the location inside the container at ‘/home/ds/notebooks/’. This way, all changes done to the notebooks will actually be saved locally immediately.
Lastly, we also bind another folder called ‘./general_data’ that contains all my extra files that I usually use in various NLP tasks – such as word-embeddings, taggers, tokenizers etc…
Running the Container
In order to run our docker image inside a docker container, we’re going to use the docker-compose command.
First, make sure you have docker-compose installed
Second, I strongly recommend using a tool called dockstation, it’s available for all platforms (windows, mac and linux), and it really makes life easier by simplifying the process of starting and stopping containers.
In case you choose not to use the dockstation application, you can simply run the following command (from inside the folder that contains the docker-compose.yml file) :
docker-compose up -d
This will bring up a docker container, running as a daemon, which runs the data science image.
Last and Important Note
In order for everything to run without issues, it is better if you set the folders structure as follow (although you can update the docker-compose.yml file to match your existing structure)
Here’s how the folders are ordered
- ds_docker
- general_data
- notebooks
- docker-compose.yml
Cheers