Prelude
This is a living document as things usually change across software versions.
A lot of workplaces suck in the sense that they don't want to give their developers full control over their computer. Such workplaces also usually choose a stable linux distribution, like Ubuntu 12.04. The rationale are as diverse as the C++ style guides under the sky, but fact is you're stuck with missing or highly outdated tools.
This document will go through all the steps necessary to build a scientific computing environment in Python3, with an OpenBLAS (previously GotoBLAS2) or MKL (Intel's BLAS) backed NumPy and SciPy, matplotlib, OpenCV, and IPython with its notebook. Notice that I said build an environment, and by that I mean a virtualenv.
Another reason to follow this guide is if your distro doesn't back NumPy/SciPy with a good BLAS implementation but you need high performance. (If you need very high performance but stay dynamic, I recommend going with Julia, but that's for another article.)
First Baby Steps
In this guide, we'll install everything into the ~/inst
prefix; it will
mirror the structure of /usr
but contain everything we installed locally.
In addition, sourcecode of all packages will reside in ~/inst/src
and we will
compile them in ~/inst/src/build-*
as much as possible.
In the following commands, I will often have to write /home/USERNAME
instead
of ~
. In most cases, this is necessary and you should do so too, but, of
course, replace USERNAME
with your username.
Optionally, you can add the ~/inst/bin
and friends to your PATH
and friends
environment variables. This will have the effect that everything you install to
~/inst
will always be preferred to the system-installed versions,
implicitly. This might or might not be what you want. (It's not what I want.)
If it is what you want, add the following to your shell's startup file; for
bash
, that's ~/.bashrc
and:
PATH="/home/USERNAME/inst/bin:$PATH"
LD_LIBRARY_PATH="/home/USERNAME/inst/lib:$LD_LIBRARY_PATH"
Then, re-source it: . ~/.bashrc
.
If you're using another shell, such as the excellent fish or the hipster's zsh, I expect you to know how to adapt the above to your shell.
While for the remainder of this article I'll assume you didn't do the above, you don't need to do anything differently if you did.
If you're a Mac user, replace all wget foo
s by the corresponding
curl foo > filename
s or download the files manually.
Static or Shared Python Libraries?
Compiling Python with static libraries makes the whole process easier, but if you plan to have anything embed that python you're compiling (I'm thinking of both OpenCV and Julia's PyCall here), you have to go for the shared library option. I will describe both ways in this article.
Python3
We don't really care about other dependencies of Python, such as tkinter
,
readline
and others since we'll go with IPython as a shell and IPython's
excellent notebook for interactive plotting.
Should you want to get everything into python, do install libreadline, liblzma, libgdbm, and tcl/tk into your prefix. (Note: tcl/tk doesn't support out-of-source builds.)
Getting, configuring, compiling and installing Python3 works as follows (but read on first if you go for the shared library option!):
$ cd ~/inst/src
$ wget http://www.python.org/ftp/python/3.3.3/Python-3.3.3.tgz
$ tar -xzv < Python-3.3.3.tgz
$ mkdir build-Python-3.3.3
$ cd build-Python-3.3.3
$ ../Python-3.3.3/configure --srcdir=../Python-3.3.3 --prefix=/home/USERNAME/inst
$ make
$ make install
$ cd -
Keep the build directory, as it also contains a make uninstall
target that
might come in handy at some point. If you're short on disk space, you can run
make clean
in the build directory though.
With Shared Python
The only thing which changes in the above step is that you need to add the
--enable-shared
option to the configure
step above. This makes the python
executable link against libpython3.so
and thus we'll have to adapt some paths
later on.
Notes on Tcl/Tk
Installing those two guys from source wasn't possible without too much hassle, and since I don't need them, I didn't try harder. If you are unfortunate enough to really need them, here are a few pointers:
- They don't support out-of-source builds.
configure
resides in theunix
subfolder.- They don't like to be linked statically only.
- Python's make will need
LD_LIBRARY_PATH
to hold~/inst/lib
.
OpenBLAS
TODO: It seems this wants gfortran
so that it also compiles lapack functions.
OpenBLAS is an open-source continuation of Kazushige Goto's BLAS implementation which completely crushed all available implementations. You might want to use that unless you have a license to use Intel's MKL, in which case you can skip this section.
Unfortunately, I couldn't find any easy way to build OpenBLAS out-of-source, so here we go:
$ cd ~/inst/src
$ git clone git://github.com/xianyi/OpenBLAS
$ cd OpenBLAS
$ make
$ make PREFIX=/home/USERNAME/inst install
$ cd -
If you have some kind of fancy special CPU which is not correctly autodetected,
you might want to look into the TARGET=xxx
flag described in the
TargetList.txt
file. Usually, you'll be fine without it.
If you want to compile ALL the algorithms such that
the one for the correct CPU will be chosen at runtime (this only makes sense if
you will use this on different computers, say because your home directory is in
a grid/cluster engine.), add the DYNAMIC_ARCH=1
option to both the make
and
the make install
commands above.
NOTE to self: I previously included the option NO_SHARED=1
, but in recent
versions, NumPy doesn't like the static libs anymore: it doesn't link to -lm
,
-pthread
and -lgfortran
, so we get undefined references. Interestingly,
NumPy doesn't seem to need an updated LD_LIBRARY_PATH
given a correct
site.cfg
, but SciPy does.
Creating a Virtual Environment
Now that we have Python3 compiled, we want to create a virtual environment based on that and install all of the following things into that environment.
Python >= 3.3
Starting with version 3.3, Python comes with built-in support for virtualenvs.
TODO: Experimental notes.
$ ~/inst/bin/pyvenv sci-env3
This already seems to take the lib from ~/inst/lib
, so it might not need the LD_LIBRARY_PATH
fixes?
Error calling pyvenv
Depending on your distribution (e.g. Ubuntu 14.04) and version of python, you might encounter the following error:
Error: Command '['/home/lucas/sci-env3.4/bin/python3.4', '-Im', 'ensurepip', '--upgrade', '--default-pip']' returned non-zero exit status 1
In this case, you should create the venv without pip
and install pip
manually into it:
$ pyvenv-3.4 --without-pip sci-env3
$ . sci-env3.4/bin/activate
$ curl https://bootstrap.pypa.io/get-pip.py | python
Python < 3.3
For older versions of Python, most prominently version 2.7, you'll need to rely on a 3rd-party virtualenv creation tool, such as virtualenv:
$ cd ~/inst/src
$ wget https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.2.tar.gz
$ tar -xzv < virtualenv-1.11.2.tar.gz
And now we can create the virtual environment. I'll call it sci-env3
and
place it into my home folder. The virtualenv script needs to be called by the
python executable you wish to have in the environment, so we'll use the one we
just compiled a few minutes ago (See below first if you used --enable-shared
):
$ cd virtualenv-1.11.2
$ ~/inst/bin/python3 virtualenv.py /home/USERNAME/sci-env3
$ cd -
All of the following needs to be done with the environment activated! So let's activate it:
$ . ~/sci-env3/bin/activate
(sci-env3)$
Or, for friends of fishes:
$ . ~/sci-env3/bin/activate.fish
With --enable-shared
If you have chosen to compile Python with the --enable-shared
flag, you'll
need to make sure the libpython3
can be found by the OS. So, instead of the
above ~/inst/bin/python3
command, run the following:
$ LD_LIBRARY_PATH=~/inst/lib:$LD_LIBRARY_PATH ~/inst/bin/python3 virtualenv.py /home/USERNAME/sci-env3
In order not to have to type this huge prefix all the time, I recommend you add
the following to the virtualenv's activate script in the deactivate
function
around line 12:
if [ -n "$_OLD_VIRTUAL_LD_PATH" ] ; then
LD_LIBRARY_PATH="$_OLD_VIRTUAL_LD_PATH"
export LD_LIBRARY_PATH
unset _OLD_VIRTUAL_LD_PATH
fi
and, around line 53,
_OLD_VIRTUAL_LD_PATH="$LD_LIBRARY_PATH"
LD_LIBRARY_PATH="$VIRTUAL_ENV/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH
Now, activate the virtualenv using this updated activation script by calling
. ~/sci-env3/bin/activate
and then copy the shared libraries into the
virtualenv, just like is being done with the binary:
(sci-env3)$ cp ~/inst/lib/libpython3{.3m.so.1.0,.so} $VIRTUAL_ENV/lib
(sci-env3)$ ln -s $VIRTUAL_ENV/lib/libpython3.3m.so.1.0 $VIRTUAL_ENV/lib/libpython3.3m.so
Having done this, whenever you activate the virtualenv, the LD_LIBRARY_PATH
will be correctly set.
For the fish shell
I'm using the lovely fish shell. In that case, it's:
if test -n "$_OLD_VIRTUAL_LD_PATH"
set -gx LD_LIBRARY_PATH $_OLD_VIRTUAL_LD_PATH
set -e _OLD_VIRTUAL_LD_PATH
end
and
set -gx _OLD_VIRTUAL_LD_PATH $LD_LIBRARY_PATH
set -gx LD_LIBRARY_PATH "$VIRTUAL_ENV/lib" $LD_LIBRARY_PATH
(Notice how in fish, the PATH
variables are arrays. Nice.) Don't forget to
run the last part above, i.e. activating the virtualenv, the copy of the libs
and the creation of the symlink!
Numpy
Now we're ready to install NumPy. Backing NumPy with OpenBLAS and/or Intel MKL has become considerably easier with NumPy 1.8 and thus many of the tutorials you'll find on the internet are outdated and overcomplicated. First, get NumPy and its dependency cython:
(sci-env3)$ pip install cython
(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/numpy/numpy
(sci-env3)$ cd numpy
Unless you have a compelling reason, it's a good idea to checkout the latest tag, as that one is likely not to be in-between some changes.
(sci-env3)$ git tag
(sci-env3)$ git checkout v1.9.0
Now, we'll have to customize the site.cfg
file in order for setup.py
to
find our installation of either OpenBLAS or MKL.
(sci-env3)$ cp site.cfg.example site.cfg
OpenBLAS
For OpenBLAS, it's as simple as uncommenting and adapting the prepared entries:
[openblas]
libraries = openblas
library_dirs = /home/USERNAME/inst/lib
include_dirs = /home/USERNAME/inst/include
But do notice the warning above, of which the TL;DR is that before Python3.4, multithreaded OpenBLAS and python's multiprocessing don't work well together!
MKL
In case you do have an MKL license, you might want to go for MKL instead of OpenBLAS. There are many different distributions of MKL, the following is what worked for mine, which came bundled with Intel's Composer XE 2013:
library_dirs = /opt/intel_csxe_2013/composer_xe_2013.5.192/mkl/lib/intel64/
lapack_libs = mkl_lapack95_lp64
mkl_libs = mkl_rt
TODO: I don't really want to bother compiling with icc
as most online
docs do; if you still want to, it doesn't really
look
difficult.
TODO: Looks like current NumPys (1.7 through 1.9) don't work with my MKL,
since all of them segfault the unittest when it reaches dot
.
And then
After having correctly configured the site.cfg
, we can compile and install
NumPy. The output of the first step should hopefully show that the correct
BLAS implementation will be used:
(sci-env3)$ python setup.py config
(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -
Testing
This installed NumPy. Let's just make sure everything worked out fine by
running NumPy's testsuite. We'll need nose
for that, so we'll install
it using pip
and then run NumPy's test suite:
(sci-env3)$ pip install nose
(sci-env3)$ python -c 'import numpy; numpy.test()'
A dot represents a successful unittest, a 'S' is a skipped one, a 'K' is a known failure, all of which are no problem. Anything else is problematic.
Finally, when Numpy doesn't find any BLAS implementation, it compiles its own very slow fallback implementation. To check whether NumPy actually found the BLAS implementation you set up, run
(sci-env3)$ python -c 'import numpy; numpy.__config__.show()'
SciPy
While NumPy wraps BLAS and some more in a convenient interface, SciPy is a higher-level wrapper around LAPACK, UMFPACK and FFTW/DJBFFT. For the sake of this article, I only really cared about LAPACK since both UMFPACK and FFTW are not strictly necessary; they only improve performance of sparse matrix operations and FFTs, respectively. If I'll ever need them, I'll revisit this article.
(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/scipy/scipy
(sci-env3)$ cd scipy
Again, you might want to pick the most recent tag and check that out. Then,
(sci-env3)$ python setup.py config
It should've picked up the BLAS implementation NumPy is using. If it didn't, either you already failed doing that correctly for NumPy, or maybe you left the virtualenv.
(NOTE/TODO: Recent versions fail because they don't link -lm
, -lpthread
and -lgfortran
which are needed by OpenBLAS which can't link to them itself
since it's a static library.)
(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -
OpenBLAS Fixme
For now, because static OpenBLAS doesn't work, you'll also need to fix the
LD_LIBRARY_PATH
in the virtualenv's activate
and deactivate
functions as
described in the With --enable-shared section and then do
(sci-env3)$ ln -s ~/inst/lib/libopenblas.so.0 $VIRTUAL_ENV/lib/libopenblas.so.0
Testing
Again, we can test the validity of the installation:
(sci-env3)$ python -c 'import scipy; scipy.test()'
Here too, you can check whether it correctly detected your BLAS and LAPACK implementations by running
(sci-env3)$ python -c 'import scipy; scipy.__config__.show()'
Note: For me (v0.14.0.dev), there were 89 errors and 1 failure, though they all happened in the sparse matrix functions, which I don't use yet.
Note2: Dayum, it segfaults whith MKL and I don't even know which test does that.
No corefile generated (even though ulimit
correct O.o) and ltrace
doesnt help.
Matplotlib
TODO: Better install using pip install -e .
so that dependencies are downloaded?
Matplotlib is the most mature and popular plotting library for Python. It's inspired by Matlab's plotting functionality (hence the name) but it already outgrew it, while keeping its simplicity. Other plotting packages I follow closely are ContinuumIO's bokeh which aims to bring the grammar of graphics (i.e. R's ggplot2) to Python, and Mike Bostock's d3js which, while it is not Python, is simply genius.
Since we installed NumPy from the github master, we'll need to do that for matplotlib too (Same story when checking out tags.):
(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/matplotlib/matplotlib.git
(sci-env3)$ cd matplotlib
(sci-env3)$ python setup.py config
(sci-env3)$ python setup.py build
(sci-env3)$ python setup.py install
(sci-env3)$ cd -
You might try to just pip install matplotlib
, but it might very likely not
work because of the very recent NumPy/SciPy we just compiled.
Testing
Same story as for the other packages above:
(sci-env3)$ python -c 'import matplotlib; matplotlib.test()'
OpenCV
OpenCV is a computer-vision library whose origins lie in Intel's highly optimized IPL. The core of OpenCV is very stable, highly optimized and generally of very high quality. Unfortunately, it grew enormously lately and contains a lot of higher-level but less-well maintained parts.
As it's a C/C++ library at heart and the Python is a wrapper tacked onto it, we
have to install OpenCV into the virtualenv we just created. You'll have to
repeat this step for every single virtualenv you create! (I might explore
creating a virtualenv which inherits from another one in another essay, which
might make this easier.) In addition, OpenCV, being a super-modern C++ project
(I'm kidding), you'll need the CMake
C++ build system. If your workplace
doesn't even have this, I pity you.
Another thing to watch out for is that OpenCV is BIG: the git repo weights in at 381 MiB and the compilation will take ages if you compile with CUDA support. The reason we use the latest master is that none of the 2.x versions of OpenCV support Python3, only the current master (and future version 3.x) does. Anyways, let's get started:
(sci-env3)$ git clone https://github.com/Itseez/opencv.git
(sci-env3)$ mkdir build-opencv
(sci-env3)$ cd build-opencv
(sci-env3)$ ccmake -D WITH_CUDA=OFF -D PYTHON_INCLUDE_DIR=/usr/include/python3.4m ../opencv
You might see errors about java and matlab. If that is the case, keep in mind you had those errors and go on.
This got you into OpenCV's configure dialog. Press the c
key once. Now's the
point where you can choose which features to compile and which not to. Press
the t
key to toggle the "advanced" settings. The important parts are the
following:
- Set
CMAKE_BUILD_TYPE
toRelease
. - Set
CMAKE_INSTALL_PREFIX
to/home/USERNAME/inst
. - Set
PYTHON3_EXECUTABLE
to/home/USERNAME/sci-env3/bin/python3.3
. - Set
PYTHON3_INCLUDE_DIR
to/home/USERNAME/sci-env3/include/python3.3m
. - Set
PYTHON3_LIBRARY
to/home/USERNAME/sci-env3/lib/libpython3.3m.so
. - Set
PYTHON3_NUMPY_INCLUDE_DIRS
to/home/USERNAME/sci-env3/lib/python3.3/site-packages/numpy/core/include
. - Set
PYTHON3_PACKAGES_PATH
to/home/USERNAME/sci-env3/lib/python3.3/site-packages/
. - If you had the java error, set the
BUILD_opencv_java
option toOFF
. - If you had the matlab error, set the
BUILD_opencv_matlab
option toOFF
.
NOTE: Older versions of OpenCV called the python-related variables PYTHON
instead of PYTHON3
, so make sure to check which ones you set.
Another possibility is to try just running the following line:
cmake -D WITH_CUDA=OFF -D PYTHON_INCLUDE_DIR=/usr/include/python3.4m -DCMAKE_BUILD_TYPE=Relase -DCMAKE_INSTALL_PREFIX=/home/lucas/sci-env3.4 ../opencv/
After having made those changes, press the c
key again, then press it again.
(yes, twice. After the first time, new settings have been added which you need
to "confirm" by pressing c
again.) Then, press g
to generate the makefiles
which you can use to build and install OpenCV:
(sci-env3)$ make
(sci-env3)$ make install
(sci-env3)$ cd -
Again, I recommend keeping the build folder, as you'll be able to run
make uninstall
in it, which might come in handy in the future.
Testing
While there is no python test-suite, you can do some poor-man's testing of the python bindings by running the following code snippet:
(sci-env3)$ python -c 'import cv2'
and, of course, you can run the C++ test-suite(s) if you're feeling patient:
(sci-env3)$ build-opencv/bin/opencv_test_core
(sci-env3)$ for f in build-opencv/bin/opencv_test_*; do exec $f; done
IPython
We'll now install IPython and all the dependencies it
needs for its notebook, which is an awesome
"IDE" in the browser. This includes, amongst others, pyzmq
which will produce
an error and say something "blabla unless you interrupt me in the next 10
seconds...". This is OK, don't interrupt it.
(sci-env3)$ pip install ipython[all]
You're now able to start IPython by simply running ipython3
(notice the three
in the name), or start the IPython notebook server by running ipython3
notebook
, which will also open your favorite browser and point it to IPython's
interactive notebook. Have fun with it!
Scikit-learn
Scikit-learn is a collection of Machine Learning libraries, or wrapper of such libraries, for Python. The interface is quite well thought-out and the documentation is fist-class. Additionally, installing that one is easy:
(sci-env3)$ pip install scikit-learn
You can test the installation by running:
$ nosetests --exe sklearn
Additional Virtualenvs
Once you've reached here, you've got yourself a nice scientific python base environmentand. It doesn't end here, though. You might want to work on various projects needing additional, more domain-specific libraries, like Theano, statsmodels, NLTK, MMTK or even BioPython. If it's about trying out something specific, I'd recommend creating a new virtualenv for that, and not installing it into your "main" scientific env. Assuming you didn't remove any of the files created in the previous steps, creating additional scientific base environments is much easier now:
-
Create the env as described above or, if you are lucky enough to have a globally installed virtualenv, using
$ virtualenv -p ~/inst/bin/python3.3 env . env/bin/activate
-
Install NumPy into it
(env)$ cd ~/inst/src/numpy (env)$ python setup.py install (env)$ cd - (env)$ pip install nose (env)$ python -c 'import numpy; numpy.test()'
-
Install SciPy into it
(env)$ pip install cython (env)$ cd ~/inst/src/scipy (env)$ python setup.py install (env)$ cd - (env)$ python -c 'import scipy; scipy.test()'
-
For the rest, either do as in the above two steps, or just
pip install
.
Updating a Virtualenv
Do you really want to do (risk) that? Just
create a new one
but git pull
the repos before.
More
Theano
TODO
cuDNN
TODO
PyDot
For viewing Theano functions and expressions as a graph (which is useful for
debugging), the pydot
bindings to the dot
graphing language are required.
Unfortunately, the default pydot
available in the cheeseshop currently fails
to install in Python3 with a very uninformative message:
(sci-env3)$ pip install pydot
Collecting pydot
Using cached pydot-1.0.2.tar.gz
Traceback (most recent call last):
File "<string>", line 20, in <module
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-e15pmion/pydot
Luckily for us, github user @nlhepler
is hosting a Py2/3 compatible version
of it, which we can simply install straight from the repo by running:
(sci-env3)$ pip install git+https://github.com/nlhepler/pydot.git
Note that this only works when graphviz
is installed, which I haven't had to
go through so far.
Julia?
Just as a quickref until I write a full article on that.
Julia is my big hope for the future of scientific programming. Since it's still a child (though not a newborn anymore), it's missing a lot of libraries. Whenever I'm missing something, and it's too much to implement it myself in the time I have, I'm happy to be able to call Python libraries through the excellent PyCall.jl package.
(sci-env3)$ cd ~/inst/src
(sci-env3)$ git clone https://github.com/JuliaLang/julia.git
(sci-env3)$ cd julia
(sci-env3)$ make
Since Julia is such an insanely-fast moving ecosystem; I recommend not actually
installing it, but running it from the src
directory and repeating the
git pull
and the make
step on a regular (daily?) basis.
If you still want to install it, run:
(sci-env3)$ DESTDIR=/home/USERNAME/inst make install
In order to use the IPython version we installed in the virtualenv, we'll need to have the virtualenv activated when running the following:
(sci-env3)$ ./julia
julia> Pkg.update()
julia> Pkg.add("IJulia")
julia> Pkg.add("PyCall")
julia> Pkg.add("PyPlot")
Leave julia and start the IJulia notebook server (which is actually an IPython notebook server masquerading as IJulia):
(sci-env3)$ ipython3 notebook --profile julia --no-browser
Now point your favorite browser to whatever URL IPython told you, most likely
http://127.0.0.1:8998/
and use Julia inside your browser!
Caveat emptor: PyPlot doesn't work with matplotlib 1.4 yet, so there's that.
More about Julia and PyCall in some future article.