Profiling PyPy vs. Python for Agent-Based Simulation

Profiling PyPy vs. Python for Agent-Based Simulation


  1. Introduction:
    1. Motivation
    2. Model description
    3. Link to code
  2. Environment Setup
  3. Performance
    1. Python vs. PyPy on one parameter set.
    2. Vary number of hosts, record time.


As part of my PhD dissertation, I wanted to investigate the role of host ecology on the generation of reassortant viruses. Knowing myself to be a fairly algebra-blind person, I decided that an agent-based model (ABM) was going to be much more manageable than writing ODEs. (Actually, the real reason is that I”m modelling discrete states, rather than continuous states, but yes, I will admit that I do take longer than your average programmer with algebra.)

Model Description

Starting with our intuition of host-pathogen interactions, I implemented a custom ABM using Python classes – “Hosts” and “Viruses”.


“Viruses” had two segments, representing a segmented virus (like the Influenza or Lassa virus), each with a color (red or blue), and can infect Hosts (which are likewise red or blue). Viruses that are of a particular color prefer to infect hosts of the same color, but can still infect hosts of of a different colour, just at a lower probability. If two viruses are present in the same host, then there can be, at some small probability, the opportunity for gene sharing to occur.

One of the virus’ segments determines host immunity; if the virus encounters a host which has immunity against its color, then the probability of infection drastically decreases, and it is likely that the virus will eventually be cleared.


“Hosts” are where viruses replicate. Hosts gain immunity to one of the segment’s colors, after a set number of days of infection. When a host gains immunity to a particular virus color, it can much more successfully fend off a new infection with that same color. Hosts also interact with one another. They may have a strong preference for a host of the same color, a.k.a. homophily.


My code for the simulations can be found on this Github repository. The details of the simulation are still a work in progress, as these ideas are still early stage. My point on this blog post here will be to try to compare PyPy against CPython on performance. However, I do welcome further comments on the modelling, if you’ve taken the time to read through my code.

Code for the statistical draws can be found on this other Github repository.

Environment Setup

My CPython environment is managed by conda. (Highly recommended! Download here. Make sure to get Python 3!)

I installed pypy and pypy3 under my home directory on Ubuntu Linux, and ensured that my bash shell $PATH variable also pointed to ~/pypy[3]/bin.


Let’s take a look at the performance of the CPython vs. PyPy using pure-Python code.

Default parameters

I first started with 1000 agents in the simulation, with the simulation running for 150 time steps.

Under these circumstances, on an old Asus U30J with 8GB RAM and an SSD hard disk, Core i3 2.27GHz, executing the simulation with PyPy required only 13.4 seconds, while executing with CPython required 110.5 seconds. 10x speedup.

Varying number of hosts in the model

I wanted to measure the time complexity of the simulation as a function of the number of hosts. Therefore, I varied the number of hosts from 100 to 1600, in steps of 300.

Partial (mostly because of laziness) results are tabulated below. (Yes, this degree of laziness would never fly in grad school.)

Agents PyPy Trial 1 PyPy Trial 2 PyPy Trial 3 CPython Trial 1 CPython Trial 2 CPython Trial 3
1000 13.4 12.8 12.9 110.5
700 8.63 9.02 8.65 53.7
400 4.35 4.33 4.66 18.2 18.2
100 1.03 1.00 1.17 1.47 1.48 1.45

As we can see, PyPy wins when the number of iterations is large.

Statistical Draws

I use statistical Bernoulli trials (biased coin flips) extensively in the simulation. Yet, one thing that is conspicuously unavialable to PyPy users (in an easily installable format) is the scientific Python stack. Most of that boils down to numpy. Rather than fiddle with trying to get numpy, scipy and other packages installed, I re-implemented my own bernoulli function.

from random import random

class bernoulli(object):
    docstring for bernoulli
    def __init__(self, p):
        super(bernoulli, self).__init__()
        self.p = p
    def rvs(self, num_draws):
        draws = []
        for i in range(num_draws):
            draws.append(int(random() > self.p))
        return draws

This is almost a drop-in replacement for scipy.stats.bernoulli. (The API isn’t exactly the same.) I wanted to know whether the calling bernoulli function I wrote performed better than calling on the scipy.stats function. I therefore setup a series of small tests to determine at what scale of function calls it makes more sense to use PyPy vs. CPython.

I then wrote a simple block of code that times the Bernoulli draws. For the PyPy version:

from stats.bernoulli import bernoulli
from time import time

start = time()
bern_draws = bernoulli(0.5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)
end = time()

print(end - start)

And for the CPython/scipy version:

from scipy.stats import bernoulli
from time import time

start = time()
bern_draws = bernoulli(0/5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)   
end = time()

print(end - start)
Bernoulli Draws PyPy + Custom (1) PyPy + Custom (2) PyPy + Custom (3) CPython + SciPy (1) CPython + SciPy (2) CPython + SciPy (3)
1000000 0.271 0.241 0.206 0.486 0.513 0.481
100000 0.0437 0.0421 0.0473 0.0534 0.0794 0.0493
10000 0.0311 0.0331 0.0345 0.00393 0.00410 0.00387

As we can see, scipy is quite optimized, and outperforms at lower number of statistical draws. Things only become better for PyPy as the number of draws increases.


Some things that I’ve learned from this exercise:

  1. For pure-Python code, PyPy can serve as a drop-in replacement for CPython.
  2. Because of the JIT compiler, PyPy is blazing fast when doing iterations!
  3. numpy is not, right now, easily pip-installable. Because of this, the rest of the Scientific Python stack is also not pip-installable in a PyPy environment. (I will admit to still being a learner here – I wouldn’t be able to articulate why numpy doesn’t work with PyPy out-of-the-box. Experts chime in please?)

Some things I hope will happen:

  1. Let’s port the scientific Python stack code to make it PyPy compatible! (Yeah, wishful thinking…)
  2. Alternatively, let’s hope the numba project allows JIT compilation when using Python objects instead.

As is the usual case for me, starting a new project idea gave me the impetus to try out a new thing, as I wouldn’t have to risk breaking workflows that have worked for existing projects. If you find yourself in the same spot, I would encourage you to try out PyPy, especially for pure-Python code (i.e. no external libraries are used).

Predicting HIV Drug Resistance Phenotype from Genotype

Note to Reader: I’d highly suggest reading this blog post on the left half of your screen, and have the Jupyter notebook on the right half of your screen. Makes things a bit easier to follow.

I recently have been writing a proposal to conduct some experiments to predict viral RNA polymerase activity, in some standardized unit, from protein genotype. The main application of this would be to be able to conduct quantitative surveillance in a precise fashion. For example, with HIV, as treatment progresses, the virus accumulates mutations that confer resistance. For a physician treating a patient infected with HIV, would it be possible to determine, from sequence data alone, what would be the degree of predicted viral resistance for that patient’s virus?

Knowing fully that this problem has been tackled many times in the past from multiple angles, I wanted to know how easily I could set up an ML workflow to go from sequence to predicted drug resistance. The goal of this blog post is to document how easy it was for me to get up and running, using Python packages really well-written Python packages.

Raw Code

All of my code can be found on my Github repository. You can also jump directly to the Jupyter notebook that I reference code from in this article.

Data Source and Preprocessing

I sourced the data from the Stanford HIV Drug Resistance Database. Specifically, I downloaded the high quality, filtered set of Genotype-to-Phenotype mappings, for protease inhibitors, nucleoside reverse transcriptase inhibitors, and non-nucleoside reverse transcriptase inhibitors. I wrote a few custom functions to preprocess the data, including the following steps:

  1. Replacing all “-” characters with the consensus sequences. I am guessing that they use the “-” character in place to help highlight where the mutations are; much more human readable.
  2. Removing sequences that had more than one mutation present. Mostly a function of being lazy than anything else.
  3. Removing sequences with ambiguous amino acids. These are a bit harder to deal with down the road. From biological background knowledge, it’s unlikely that excluding them would be detrmimental.
  4. Dropping all conserved amino acid positions. They add nothing to the analysis.
  5. Binarizing the columns. This transforms the letters of the amino acid into a first-pass feature set, in which the binarized columns indicate whether or not an amino acid is present at a given position or not.

These are found in my module, which I imported into the Jupyter notebooks. Having futzed around for about a day copying/pasting blocks of code, I refactored the code into separate functions for readability, so that only the “business logic” is shown in the notebook.

Train/Test Split

It is a standard practice to split the dataset into a training and test set. K-fold cross-validation is quite easy to do using scikit-learn. Given an X and a Y matrix, to split it into X_train, X_test, Y_train, and Y_test, simply do the function call:

X_train, X_test, Y_train, Y_test = train_test_split(X_binarized, Y) 

Model Training

To train models, I used the scikit-learn package to help. It’s useful to note that scikit-learn has a consistent API – every regressor model has a and a MODEL.predict() function. This ‘modular’ style allowed me to wrap the series of function calls into single-line functions, and thus quickly try out a variety of models to see what out-of-box predictive power would be. Using the Random Forest Regressor as an example, I wrapped up the training and plotting phases:

# Model Training
kwargs = {'n_jobs':-1, 'n_estimators':1000}
rfr, rfr_preds, rfr_mse, rfr_r2 = cf.train_model(*tts_data, model=RandomForestRegressor, modelargs=kwargs)

# Plotting
cf.scatterplot_results(rfr_preds, Y_test, rfr_mse, rfr_r2, DRUG, 'Rand. Forest', figsize=std)

Here, cf simply refers to the module I wrote to refactor out the repetitive boilerplate code needed.

The ensemble learners also include feature importances, i.e. an identification of the columns in the data that best predict the outcome of interest. I wrapped the feature importances code to make it easy to plot:

cf.barplot_feature_importances(rfr, DRUG, 'Rand. Forest', figsize=std)

In particular, I used the ensemble learners, which are known to be pretty powerful for learning tasks. And for comparison, I pitted them against a number of linear models as well. As you can see in the notebooks, the ensemble learners outperformed the linear models, at least for a binarized amino acid feature set. This makes intuitive sense – protein sequence to function is non-linear, and highly contextual.

Neural Networks!

One of the things I wanted to highlight here was how Daniel Nouri’s nolearn package made it easy for me to start experimenting with neural networks. By no means am I a deep learning expert – I consider myself too algebra-blind (but by no means code-blind!) to learn the math behind it all. However, I know that my learning style of diving into the deep end and doing a lot of hands-on trials would help me get a fairly good intuitive grasp of how to do it. So after futzing around on a GPU cluster for a few days trying to get it configured right, I got theano, lasagne and nolearn up and running. (Note: A GPU makes light(er) work of training artificial neural nets. CPUs take around 5-10x more time. Highly recommended to use a GPU with neural nets. Shout-out to my PhD thesis committee member, Mark Bathe, for giving me access to his lab’s GPU machine!)

nolearn’s API is, by design, really close to the scikit-learn API. I find this to be a great thing – since neural networks are basically ML models, we get the familiar and neural_net.predict() function calls. (The importance of great APIs! Thank you, @dnouri!) The API also makes specifying a neural network architecture quite easy. For example, in the simple feed-forward network architecture that I show in the Jupyter notebook, network layers are specified as a list of parameters, and each layer’s properties can be specified by using named parameters that sync up with the specified names. The comments in my Jupyter notebook in some of the layers show my (rather simple) efforts in experimenting with different architectures. To note, the original structure of the feed-forward network is directly lifted from Daniel Nouri’s tutorial, so big thanks to him for publishing it!

FWIW, I also learned from other experienced Pythonistas (shout out to Rick Landau!) for teaching me not to use binary features on neural networks, something I happily disregarded for this first pass. But I’ve nonetheless still remembered that lesson! And in future iterations, probably I will try to change to non-binary features.

Summary & Future Work

So to summarize, the scikit-learn API makes it pretty easy to get my hands dirty doing machine learning. By keeping a scikit-learn-like API, nolearn also does a great job of keeping neural nets accessible.

What else do I think can be done? Well, there’s one idea I’ve been thinking about, to improve the regression score beyond what experimental error in the dataset may limit us to. Think convolutional networks (convnets) and their data input requirements. Basically, most image recognition convnets need a 2D image. But what if I used a convnet that can take in a 3D image instead? Calculating a numerical value, such as the electrostatic charge at every grid point in a static, modelled 3D protein structure, may be much more informative than simply using binarized amino acid columns. A recent paper that I read in BMC genomics (say “yay” for open access!) uses a computationally efficient shortcut representation to get structural features encoded as well, with great success; I would definitely be game for implementing their version as a first pass as well.

Apart from that, it’s insufficient to run only on a single train/test split. There will always be unavoidable biases in the trained model. Training k-fold splits k times is a common practice I’ve heard. In papers I’ve read, usually k-fold splits are trained k times, and the standard deviation reported. I can imagine also doing it n times, to get a better feel for the generalization error (assuming the data are representative of the population).

Software Engineering Skills for Data Analytics

When you think about software engineering skills, you probably don’t think about the analytics types, or data scientist (DS) teams. This is a reasonable thought. Data scientists aren’t in the business of building software, they’re in the business of using software to analyze data. That said, I think it’s still important for a data scientist (or analytics person, for that matter), to know some basic software engineering skills. Here’s the why, followed by the what.

Continue reading

On the ‘humbleness’ of conference attendees

Conferences are made up of people, just like any other group of human beings grouping together.

What makes one differ from another really boils down to the people.

I read a tweet recently that described the SciPy 2015 conference as having really ‘humble’ attendees. This was exactly my feeling! It was great to see such a community of developers and scientists who, knowing that while they may be domain experts there is still much to learn, choose to carry themselves in a really humble way.

I think this was one of the reasons why I really enjoyed SciPy. It was devoid of the ego that plagues other field-specific conferences. Yet another reason to go again!

SciPy 2015 – Done!

The conference is over! I get to go home now, but I also will miss being a part of the community. Hope to go back next year!


  1. It was fun to set so many new people from a variety of disciplines. This conference was right in the sweet spot of nerdy-ness and social interaction. I think I lost track of the number of business cards I handed out to people, but it’s definitely dented my collection!
  2. I learned a lot, particularly about the variety of packages that other people have developed to solve a variety of problems.
  3. Swag! This time it was in the form of a free book by Kurt Smith, on using Cython to speed up Python code. Also got a bunch of stickers. At PyCon, I didn’t know where to stick them, and I was hesitant to stick them to my laptop (I like a clean laptop), so I stuck them to my poster tube instead.
  4. I gave a lightning talk on Testing Data, for data integrity purposes. Later, I was contacted by Greg, who leads the Software Carpentry (SWC)initiative, on providing some material on doing data tests. Looks like it could be fun! And I cannot wait to get my own SWC instructor training – c’mon Greg!
  5. My roommate, Chuan, was a physician from China, who was in the Houston area doing a year of research. I had a great time conversing about code and culture with him, and I learned a lot about contemporary Chinese medicine from him.
  6. Finally, I participated in my first ever coding sprint! It was with the matplotlib team. It was a great learning experience, participating in the actual modern git workflows of software development. I helped with making changes to the documented examples, a task suitable to first timer sprinters (as it doesn’t risk messing up the established code logic). Seeing my first merged PR to a major software project gave me an absolute thrill :). I also got to observe how to do Continuous Integration with auto testing. Next conference I will most certainly make time for at least part of a coding sprint.
  7. I missed a bunch of talks on the second and third day, because I needed some headspace to finish up thinking about this paper that I am writing. However, because of the great efforts by the AV team that Enthought hired, it’s possible to view them online after the conference. This also have those who couldn’t attend the conference a chance to access the conference materials. Kudos to them!

This year’s conference was a really great experience. I learned lots, learned about the many people doing cool stuff with the scientific Python stack, and made new connections with them too. I would highly recommend joining SciPy 2016, and I hope to make it an annual thing with PyCon!

SciPy 2015: Talks Day 1

Morning Session

  1. Will millenials ever get married?: Survival analysis made simple.
  2. Distarray: A tool to do distributed computing on numpy arrays.
  3. Teaching with Jupyter: nbgrader for autograding of Jupyter notebooks, and JupyterHub to provide a uniform environment for teaching and learning.
  4. Open Source data archive for melanoma screening: How to build a system that helps diagnose melanoma.

Mid-Day Session

  1. Story time with Bokeh: State of Bokeh’s plotting package.
  2. VisPy: Harnessing The GPU For Fast, High-Level Visualization: A very impressive package for doing real-time visualization!
  3. HoloViews: Interactive data visualizations made really easy.

Afternoon Session

  1. Deep learning crash course: A real crash course!
  2. PyStruct (structured prediction): a package for doing structured prediction.

I then left to get a breather, and prepare a bit for my lighting talk!