How to get up-and-running using Python, the easy_way, for the scientist

Update: I now drink Continuum Analytic’s cool-aid! (And I mean this in a really good way!) See this post for new instructions to get started :).

As a graduate student doing biological sequence analysis in Python, I spend quite a bit of time in my IPython HTML notebooks. (My preference – because it’s easier on the eyes than the Terminal is.) However, getting setup wasn’t easy, not when there’s so many ways just to get started with scientific computing.

If you’re a typical scientist, you’d want to get up and running as fast as possible with few obstruction. If you’re a scientist who’s takes the long-view of things, you’d also want your stuff to be setup to be “vendor-independent”. With Python, there’s a variety of ways to get things up-and-running: manual installation, Anaconda, Canopy… there’s a variety of ways to get the scientific computing packages for Python. Also, there’s different ways of installing Python packages, such as homebrew, pip, conda, easy_install… gosh, it’s a nightmare for the newcomer to Python. This clearly violates a Pythonic principle:

There should be one– and preferably only one –obvious way to do it.

I’ve gone through the nightmares of managing packages that were installed in disparate directories due to Anaconda, homebrew, pip and easy_install using separate directories, to the point that I wiped out all but the Apple-provided installation of Python, and started from scratch with my packages. I have also seen a fellow scientist struggle with packages breaking due to package updates happening in different directories…

Therefore, in this post, I’d like to provide a guide on how to get up and running using Python in a way that is simple but keeps your installations clear-cut and vendor-independent.

I’ll reveal the answer up-front: My method of choice for maintaining packages is to do it by hand.

The rationale is this: I believe in doing things in a fashion that is maximally-compatible and generalizable, and because I see Python being a part of my graduate career (and hopefully beyond), I want to keep things vendor-independent.

TL;DR

  • To get started, install only the official Python distribution and no other distributions.
  • To install Python packages, use the commands “sudo easy_install package_name” for first installation, and “sudo easy_install -U package_name” for subsequent updates.

Installing Python

To install Python, there’s many ways to get it. For starters, there’s the official distribution, there’s Anaconda by Continuum Analytics, and there’s Canopy by Enthought. Where does one start? What’s the pros and cons to each distribution? Does a scientist even have to care?

So… The multitude of options for installing Python and its packages is an unfriendly situation for the scientist who just wants to get up and running, quickly learn the essentials of Python, and start automating analyses. 

This also voids the Pythonic principle of having only one obvious way of doing things.

My simple response is this:

install-python-and-carry-on

Yeah, basically, grab Python raw from the official website, and don’t worry about the other distributions of Python. As long as you only use the official distribution of Python, you should never have to deal with scary $PATH variables or directory issues.

What will happen if you use Anaconda or Canopy? You get a different directory with Python installed. Good idea if you’re only going to use the packages that come with Anaconda and Canopy, bad idea if you’re exploring and may decide to branch out later on, and bad idea if you’re worried about vendor lock-in.

Packages

When I first started with Python, I needed SciPy in order to get BioPython up and running. However, for some reason, some people suggested that I use homebrew for some things, and pip or easy_install for other things. Or to get one of Anaconda or Canopy. I started following the instructions, and ended up blindly modifying my $PATH variable, while having conda and homebrew both updating my packages, leading to code that was broken in IPython, as one day, I found myself suddenly unable to do any imports at the top of my code.

Again, a very unfriendly scenario for the scientist who is a novice programmer. 

My short response is this: use easy_install and nothing else, and you should never have to run into $PATH variable or directory issues.

If you’re using only what’s provided in the Anaconda or Canopy distributions of Python, then you’ll be safe using the “conda” commands (for Anaconda) for maintaining and updating packages. However, since I’ve advocated for using only the official Python distribution, I will advocate for the most generalizable installation commands for the packages: “easy_install”.

“easy_install” is my command-of-choice when it comes to installing packages. As its command name suggests, easy_install makes it easy to install any package from the Python Package Index (PyPI), also affectionately known as the Cheese Shop. The command for installing packages in the Mac terminal is:

sudo easy_install packagename

For example, if I wanted to install any package, I could do it with the command:

sudo easy_install scipy

To update/upgrade a package, at the terminal, one would use the command:

sudo easy_install -U packagename

The -U flag tells easy_install that you want to update/upgrade the package.

Assuming you only use the official distribution of Python, then you’d end up with packages being installed in the Python framework directory (as written above). Again, this ensures that one neither has to deal with separate directories, nor deal with modifying the $PATH variable, which for me is still a scary prospect even after a year’s worth of experience with Python.

I’d recommend keeping a running list of packages installed with Python in some text file. I personally use Evernote, and keep track of the dates on which I update my Python packages.

Vendor Lock-In?

So some of you might say, “Enthought and Continuum Analytics (makers of Anaconda) distribute Python for free!”

Yeah, and I totally agree with you that Enthought and Continuum Analytics are totally in with the Open Source principle. I’ve even chatted with Travis Oliphant, CEO of Continuum Analytics, at the PyData conference held in Boston, and I am fully assured that he’s committed to keeping their distributions of Python open-source.

I’m also going to give them credit where credit is due. I’ve experienced trying to install scipy and numpy. It’s one heck of a ride. You need to get the Xcode command line tools. You need to get some Fortran-related thingamajigs, and you need to install a whole host of other stuff things as well. And sometimes, the installation fails for reasons unrelated to Python, but the underlying operating system. In contrast, Continuum Analytics has made getting scipy, numpy and other packages as easy as downloading an installer. Essentially, what they’ve done is packaged everything that a data scientist or analyst might need into one easy-to-install package.

As someone who does analysis, I’m also totally for Continuum Analytics’ mission to get the Python tools into the hands of analysts and data scientists so that they can get up and running quickly. For this, though, I’d much prefer their web-hosted solution, Wakari.io, rather than the Anaconda way. Using the web-based Wakari solution means I don’t have to mess around with my own system. A safer option for the scientist, but it also removes the ability to work offline.

That said, though, I’d only truly be won over if either company made a package manager that made package management easy and worked with the default directory structures, and also worked with PyPI directly. By packaging everything into their own directories, I read that as a signal that they’re essentially “forking” off as opposed to “building on top of” what’s already present. I’d much prefer that they do things in the “most compatible” way, and work with just “one, and preferably only one, obvious way of managing packages“.

Final Thoughts

Rather than say I’m “against” anything, here’s what I’m “for”:

  1. One, and preferably only one, obvious way of doing things.
  2. Keeping things as compatible as possible.

When it comes to Python package management, I’d like to know that I’m able to maintain knowledge and control over where things are going, without needing to depend on potentially non-compatible tools in the future.

 

Leave a Reply