Reticulate Evolution and Microbial Ecology

I am happy to announce that, with my advisor’s (Jon Runstadler, MIT) approval, I’ve uploaded my first 1st-author paper to BioRXiv. This may sound surprising, to document and write about the paper pre-peer review, rather than after being formally accepted after peer-preview in a journal. This choice is borne out of two desires.

Firstly, I am fed up with having the editorial process decide whether my work can be sent for peer evaluation, based on some perceived notion of broad impact or editorial alignment. This manuscript has gone through four editorial rejections and (hopefully not) counting…  I acknowledge that “4” is a small number, but my patience and tolerance level for bureaucratic delays is very low, and I think the editorial process is a big reason delaying the work from review. After internal review and public presentation in talks and posters, peer review and evaluation is the final step in getting my work evaluated in the public domain. Really, it’s about getting feedback on whether what I’ve worked on constitutes sufficiently rigorous scientific work or not. I’d like to know that earlier rather than later, and I do not believe that editors should be the gatekeepers of this process. Pre-print servers let me release the work to the public domain without tough barriers to break down, which help me achieve this goal.

Secondly, I’m really, like really, excited about the results in this paper, regardless of what others think about its impact. It’s hard to contain this PhD candidate’s enthusiasm for his work. I’m particularly proud of the work done here, especially the code written from scratch to answer this question, with everything written and designed to be reproducible. Thus, much like an artist has a natural drive towards showcasing work s/he is proud of, I have this desire to get it out there. I hope you enjoy it.

Manuscript in Brief

Title: Reticulate evolution is favoured in microbial niche switching.

Question being answered: Can we measure how important reticulate evolutionary processes that result in gene exchange are in enabling microbes to switch between ecological niches?

tl;dr answer: Yes we can, and when switching ecological niches, the ability to exchange genes is more important the greater the difference between the niches.

I think some definitions are required before I proceed.

  • Reticulate evolution: A non-tree-like evolutionary trajectory. It involves processes like recombination, reassortment, horizontal gene transfer, sexual reproduction etc. In this paper, we detect reassortment amongst influenza A viruses.
  • Ecological niche: Ecology is the study of organisms and their interactions with one another and their environment. Organisms occupy certain niches – such as particular habitats, producing and consuming particular combinations of nutrients. In this paper, an ecological niche is defined as a particular host species of the influenza A virus.
  • Switching: When a microbe switches ecological niches, it has either entered a new ecological niche – different environment, a new set of host species it interacts with etc. In this paper, we define niche switching as switching between different hosts, as the host is the environment for the influenza A virus.

Brief Background:

In the field of ecology, there’s this idea that gene exchange between two microbes can lead to genetic diversity, and lead to fitness advantages in new environments. To the best of my knowledge, this idea has not been tested explicitly (and I am happy to be corrected on this). In our paper, we sought to test whether, in switching between different host species, gene shuffling between influenza A viruses was more prevalent than clonal transmission. If so, it would allude to a broader principle that reticulate evolutionary events are important when microbes change ecological niches.

Why should we care?

Why should we care about this problem that, seemingly, only microbes have to deal with? Well, we know that microbes and human health intersect, and this is borne out in the scientific literature. Specifically with the influenza virus, every new pandemic virus that has emerged in human populations has been shown to be a reassortant virus. It looks quite clear to me that reticulate evolution intersects with host and microbe ecology, impacting human health.

How did we answer the scientific question at hand?

To do this, we used data from the Influenza Research Database, which houses influenza sequence data augmented with collection date and host species metadata. This makes it a really suitable dataset to answer the broader question, “Is reticulate evolution is favoured in ecological niche switches?”. This is because of a few reasons. Firstly, it is a densely-sampled dataset with a global scope, which allows us to detect rare reassortment events. Secondly, we can trace source-sink paths through different host species, by using time and genetic data. Finally, influenza can undergo reassortment, which is its reticulate evolutionary process.

To answer the broader question, we first had to figure out which viruses were reassortant, and where they came from. Our method involves using a phylogenetic heuristic method, which has been previously published, and which we modified to detect reticulate evolutionary events. We basically ask, “What are the sources of segments for each influenza virus?” We try to make sure that all segments for one “sink” virus isolate are traceable back to one older “source” isolate, ensuring that the two isolates are of maximal similarity possible across all 8 segments. That sink virus would be denoted as a clonally transmitted virus, resulting from “clonal descent”. However, if a “source pair”, i.e. two viruses donating part of their genome, could improve the summed similarity score, then we denote that sink virus as the reassortant virus, resulting from “reassortment descent”.

How do we represent the data? This is done as a directed graph of influenza A virus isolates (nodes), and two types of descent (edges): clonal and reassortment. If there are more than one plausible clonal or reassortment source, then the corresponding edges are weighted accordingly, such that the sum of edges into one node is equal to 1. This encodes our knowledge of one virus being the result of one possible reassortment event, or one clonal transmission event.

What did we find?

We show in our supplementary data that our method and representation together well-approximates the phylogenetic relationships calculated by more sophisticated Bayesian methods, and that it captures known patterns of influenza transmission and ‘famous’ reassortant viruses. In some sense, the method we adapted can scale better compared to Bayesian methods, especially when the number of viral taxa are large, and I think it can represent reticulate events better than trees. However, I will state that Bayesian phylogenetic inference is still the gold-standard for representing evolution on a single gene/trait.

We then wanted to assess whether reassortment was over-represented when hosts were identical or different. To calculate this, we looked all edges where the source and sink hosts were identical, and compared them to edges where the source and sink hosts were different. When source and sink hosts were different, the proportion of edges that represented reassortment events were over-represented compared to a null model of host species labels on the viral nodes being permuted. When they were identical, reassortment was under-represented compared to the null.

We then wanted to know whether at different host group interfaces, reassortment was over- or under-represented compared to a null model. We found that once again, reassortment was under-represented compared to the null when the virus jumped between hosts of the same group, and over-represented when jumping between different host groups.

Finally, we wanted to know whether the degree of dissimilarity between two hosts mattered. We used data from the Barcode of Life Database to identify the host’s cytochrome oxidase I genes, and measured percentage dissimilarity as a metric of the degree of host dissimilarity. Here, we found that as hosts become more different, reassortment was more prominently featured during those switches.

So what does this all mean?

All-in-all, we find that reticulate evolutionary processes are indeed important in microbe ecology. Scientifically, I think the novelty here lies in quantitatively defining the importance of the ability to shuffle genes when a microbe changes ecological contexts, with the change quantitatively defined as well. From a public health perspective, if I were allowed to put on my “divergent thinking” hat, one direct application of this knowledge would be to quantify the “reticulate potential” of a pathogen – how easily it can acquire and share genes. This may inform which pathogens we would want to perform deeper disease surveillance on.

Where can I view this paper, code, and data?

In line with open science principles, we have uploaded all of the code used in the analysis to Zenodo. Their DOIs are referenced in the manuscript.

The paper is available here on BioRXiv.

The data are publicly available on the Influenza Research Database.

I hope you enjoy reading it!

Profiling PyPy vs. Python for Agent-Based Simulation

Profiling PyPy vs. Python for Agent-Based Simulation


  1. Introduction:
    1. Motivation
    2. Model description
    3. Link to code
  2. Environment Setup
  3. Performance
    1. Python vs. PyPy on one parameter set.
    2. Vary number of hosts, record time.


As part of my PhD dissertation, I wanted to investigate the role of host ecology on the generation of reassortant viruses. Knowing myself to be a fairly algebra-blind person, I decided that an agent-based model (ABM) was going to be much more manageable than writing ODEs. (Actually, the real reason is that I”m modelling discrete states, rather than continuous states, but yes, I will admit that I do take longer than your average programmer with algebra.)

Model Description

Starting with our intuition of host-pathogen interactions, I implemented a custom ABM using Python classes – “Hosts” and “Viruses”.


“Viruses” had two segments, representing a segmented virus (like the Influenza or Lassa virus), each with a color (red or blue), and can infect Hosts (which are likewise red or blue). Viruses that are of a particular color prefer to infect hosts of the same color, but can still infect hosts of of a different colour, just at a lower probability. If two viruses are present in the same host, then there can be, at some small probability, the opportunity for gene sharing to occur.

One of the virus’ segments determines host immunity; if the virus encounters a host which has immunity against its color, then the probability of infection drastically decreases, and it is likely that the virus will eventually be cleared.


“Hosts” are where viruses replicate. Hosts gain immunity to one of the segment’s colors, after a set number of days of infection. When a host gains immunity to a particular virus color, it can much more successfully fend off a new infection with that same color. Hosts also interact with one another. They may have a strong preference for a host of the same color, a.k.a. homophily.


My code for the simulations can be found on this Github repository. The details of the simulation are still a work in progress, as these ideas are still early stage. My point on this blog post here will be to try to compare PyPy against CPython on performance. However, I do welcome further comments on the modelling, if you’ve taken the time to read through my code.

Code for the statistical draws can be found on this other Github repository.

Environment Setup

My CPython environment is managed by conda. (Highly recommended! Download here. Make sure to get Python 3!)

I installed pypy and pypy3 under my home directory on Ubuntu Linux, and ensured that my bash shell $PATH variable also pointed to ~/pypy[3]/bin.


Let’s take a look at the performance of the CPython vs. PyPy using pure-Python code.

Default parameters

I first started with 1000 agents in the simulation, with the simulation running for 150 time steps.

Under these circumstances, on an old Asus U30J with 8GB RAM and an SSD hard disk, Core i3 2.27GHz, executing the simulation with PyPy required only 13.4 seconds, while executing with CPython required 110.5 seconds. 10x speedup.

Varying number of hosts in the model

I wanted to measure the time complexity of the simulation as a function of the number of hosts. Therefore, I varied the number of hosts from 100 to 1600, in steps of 300.

Partial (mostly because of laziness) results are tabulated below. (Yes, this degree of laziness would never fly in grad school.)

Agents PyPy Trial 1 PyPy Trial 2 PyPy Trial 3 CPython Trial 1 CPython Trial 2 CPython Trial 3
1000 13.4 12.8 12.9 110.5
700 8.63 9.02 8.65 53.7
400 4.35 4.33 4.66 18.2 18.2
100 1.03 1.00 1.17 1.47 1.48 1.45

As we can see, PyPy wins when the number of iterations is large.

Statistical Draws

I use statistical Bernoulli trials (biased coin flips) extensively in the simulation. Yet, one thing that is conspicuously unavialable to PyPy users (in an easily installable format) is the scientific Python stack. Most of that boils down to numpy. Rather than fiddle with trying to get numpy, scipy and other packages installed, I re-implemented my own bernoulli function.

from random import random

class bernoulli(object):
    docstring for bernoulli
    def __init__(self, p):
        super(bernoulli, self).__init__()
        self.p = p
    def rvs(self, num_draws):
        draws = []
        for i in range(num_draws):
            draws.append(int(random() > self.p))
        return draws

This is almost a drop-in replacement for scipy.stats.bernoulli. (The API isn’t exactly the same.) I wanted to know whether the calling bernoulli function I wrote performed better than calling on the scipy.stats function. I therefore setup a series of small tests to determine at what scale of function calls it makes more sense to use PyPy vs. CPython.

I then wrote a simple block of code that times the Bernoulli draws. For the PyPy version:

from stats.bernoulli import bernoulli
from time import time

start = time()
bern_draws = bernoulli(0.5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)
end = time()

print(end - start)

And for the CPython/scipy version:

from scipy.stats import bernoulli
from time import time

start = time()
bern_draws = bernoulli(0/5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)   
end = time()

print(end - start)
Bernoulli Draws PyPy + Custom (1) PyPy + Custom (2) PyPy + Custom (3) CPython + SciPy (1) CPython + SciPy (2) CPython + SciPy (3)
1000000 0.271 0.241 0.206 0.486 0.513 0.481
100000 0.0437 0.0421 0.0473 0.0534 0.0794 0.0493
10000 0.0311 0.0331 0.0345 0.00393 0.00410 0.00387

As we can see, scipy is quite optimized, and outperforms at lower number of statistical draws. Things only become better for PyPy as the number of draws increases.


Some things that I’ve learned from this exercise:

  1. For pure-Python code, PyPy can serve as a drop-in replacement for CPython.
  2. Because of the JIT compiler, PyPy is blazing fast when doing iterations!
  3. numpy is not, right now, easily pip-installable. Because of this, the rest of the Scientific Python stack is also not pip-installable in a PyPy environment. (I will admit to still being a learner here – I wouldn’t be able to articulate why numpy doesn’t work with PyPy out-of-the-box. Experts chime in please?)

Some things I hope will happen:

  1. Let’s port the scientific Python stack code to make it PyPy compatible! (Yeah, wishful thinking…)
  2. Alternatively, let’s hope the numba project allows JIT compilation when using Python objects instead.

As is the usual case for me, starting a new project idea gave me the impetus to try out a new thing, as I wouldn’t have to risk breaking workflows that have worked for existing projects. If you find yourself in the same spot, I would encourage you to try out PyPy, especially for pure-Python code (i.e. no external libraries are used).

Predicting HIV Drug Resistance Phenotype from Genotype

Note to Reader: I’d highly suggest reading this blog post on the left half of your screen, and have the Jupyter notebook on the right half of your screen. Makes things a bit easier to follow.

I recently have been writing a proposal to conduct some experiments to predict viral RNA polymerase activity, in some standardized unit, from protein genotype. The main application of this would be to be able to conduct quantitative surveillance in a precise fashion. For example, with HIV, as treatment progresses, the virus accumulates mutations that confer resistance. For a physician treating a patient infected with HIV, would it be possible to determine, from sequence data alone, what would be the degree of predicted viral resistance for that patient’s virus?

Knowing fully that this problem has been tackled many times in the past from multiple angles, I wanted to know how easily I could set up an ML workflow to go from sequence to predicted drug resistance. The goal of this blog post is to document how easy it was for me to get up and running, using Python packages really well-written Python packages.

Raw Code

All of my code can be found on my Github repository. You can also jump directly to the Jupyter notebook that I reference code from in this article.

Data Source and Preprocessing

I sourced the data from the Stanford HIV Drug Resistance Database. Specifically, I downloaded the high quality, filtered set of Genotype-to-Phenotype mappings, for protease inhibitors, nucleoside reverse transcriptase inhibitors, and non-nucleoside reverse transcriptase inhibitors. I wrote a few custom functions to preprocess the data, including the following steps:

  1. Replacing all “-” characters with the consensus sequences. I am guessing that they use the “-” character in place to help highlight where the mutations are; much more human readable.
  2. Removing sequences that had more than one mutation present. Mostly a function of being lazy than anything else.
  3. Removing sequences with ambiguous amino acids. These are a bit harder to deal with down the road. From biological background knowledge, it’s unlikely that excluding them would be detrmimental.
  4. Dropping all conserved amino acid positions. They add nothing to the analysis.
  5. Binarizing the columns. This transforms the letters of the amino acid into a first-pass feature set, in which the binarized columns indicate whether or not an amino acid is present at a given position or not.

These are found in my module, which I imported into the Jupyter notebooks. Having futzed around for about a day copying/pasting blocks of code, I refactored the code into separate functions for readability, so that only the “business logic” is shown in the notebook.

Train/Test Split

It is a standard practice to split the dataset into a training and test set. K-fold cross-validation is quite easy to do using scikit-learn. Given an X and a Y matrix, to split it into X_train, X_test, Y_train, and Y_test, simply do the function call:

X_train, X_test, Y_train, Y_test = train_test_split(X_binarized, Y) 

Model Training

To train models, I used the scikit-learn package to help. It’s useful to note that scikit-learn has a consistent API – every regressor model has a and a MODEL.predict() function. This ‘modular’ style allowed me to wrap the series of function calls into single-line functions, and thus quickly try out a variety of models to see what out-of-box predictive power would be. Using the Random Forest Regressor as an example, I wrapped up the training and plotting phases:

# Model Training
kwargs = {'n_jobs':-1, 'n_estimators':1000}
rfr, rfr_preds, rfr_mse, rfr_r2 = cf.train_model(*tts_data, model=RandomForestRegressor, modelargs=kwargs)

# Plotting
cf.scatterplot_results(rfr_preds, Y_test, rfr_mse, rfr_r2, DRUG, 'Rand. Forest', figsize=std)

Here, cf simply refers to the module I wrote to refactor out the repetitive boilerplate code needed.

The ensemble learners also include feature importances, i.e. an identification of the columns in the data that best predict the outcome of interest. I wrapped the feature importances code to make it easy to plot:

cf.barplot_feature_importances(rfr, DRUG, 'Rand. Forest', figsize=std)

In particular, I used the ensemble learners, which are known to be pretty powerful for learning tasks. And for comparison, I pitted them against a number of linear models as well. As you can see in the notebooks, the ensemble learners outperformed the linear models, at least for a binarized amino acid feature set. This makes intuitive sense – protein sequence to function is non-linear, and highly contextual.

Neural Networks!

One of the things I wanted to highlight here was how Daniel Nouri’s nolearn package made it easy for me to start experimenting with neural networks. By no means am I a deep learning expert – I consider myself too algebra-blind (but by no means code-blind!) to learn the math behind it all. However, I know that my learning style of diving into the deep end and doing a lot of hands-on trials would help me get a fairly good intuitive grasp of how to do it. So after futzing around on a GPU cluster for a few days trying to get it configured right, I got theano, lasagne and nolearn up and running. (Note: A GPU makes light(er) work of training artificial neural nets. CPUs take around 5-10x more time. Highly recommended to use a GPU with neural nets. Shout-out to my PhD thesis committee member, Mark Bathe, for giving me access to his lab’s GPU machine!)

nolearn’s API is, by design, really close to the scikit-learn API. I find this to be a great thing – since neural networks are basically ML models, we get the familiar and neural_net.predict() function calls. (The importance of great APIs! Thank you, @dnouri!) The API also makes specifying a neural network architecture quite easy. For example, in the simple feed-forward network architecture that I show in the Jupyter notebook, network layers are specified as a list of parameters, and each layer’s properties can be specified by using named parameters that sync up with the specified names. The comments in my Jupyter notebook in some of the layers show my (rather simple) efforts in experimenting with different architectures. To note, the original structure of the feed-forward network is directly lifted from Daniel Nouri’s tutorial, so big thanks to him for publishing it!

FWIW, I also learned from other experienced Pythonistas (shout out to Rick Landau!) for teaching me not to use binary features on neural networks, something I happily disregarded for this first pass. But I’ve nonetheless still remembered that lesson! And in future iterations, probably I will try to change to non-binary features.

Summary & Future Work

So to summarize, the scikit-learn API makes it pretty easy to get my hands dirty doing machine learning. By keeping a scikit-learn-like API, nolearn also does a great job of keeping neural nets accessible.

What else do I think can be done? Well, there’s one idea I’ve been thinking about, to improve the regression score beyond what experimental error in the dataset may limit us to. Think convolutional networks (convnets) and their data input requirements. Basically, most image recognition convnets need a 2D image. But what if I used a convnet that can take in a 3D image instead? Calculating a numerical value, such as the electrostatic charge at every grid point in a static, modelled 3D protein structure, may be much more informative than simply using binarized amino acid columns. A recent paper that I read in BMC genomics (say “yay” for open access!) uses a computationally efficient shortcut representation to get structural features encoded as well, with great success; I would definitely be game for implementing their version as a first pass as well.

Apart from that, it’s insufficient to run only on a single train/test split. There will always be unavoidable biases in the trained model. Training k-fold splits k times is a common practice I’ve heard. In papers I’ve read, usually k-fold splits are trained k times, and the standard deviation reported. I can imagine also doing it n times, to get a better feel for the generalization error (assuming the data are representative of the population).

Software Engineering Skills for Data Analytics

When you think about software engineering skills, you probably don’t think about the analytics types, or data scientist (DS) teams. This is a reasonable thought. Data scientists aren’t in the business of building software, they’re in the business of using software to analyze data. That said, I think it’s still important for a data scientist (or analytics person, for that matter), to know some basic software engineering skills. Here’s the why, followed by the what.

Continue reading

On the ‘humbleness’ of conference attendees

Conferences are made up of people, just like any other group of human beings grouping together.

What makes one differ from another really boils down to the people.

I read a tweet recently that described the SciPy 2015 conference as having really ‘humble’ attendees. This was exactly my feeling! It was great to see such a community of developers and scientists who, knowing that while they may be domain experts there is still much to learn, choose to carry themselves in a really humble way.

I think this was one of the reasons why I really enjoyed SciPy. It was devoid of the ego that plagues other field-specific conferences. Yet another reason to go again!

SciPy 2015 – Done!

The conference is over! I get to go home now, but I also will miss being a part of the community. Hope to go back next year!


  1. It was fun to set so many new people from a variety of disciplines. This conference was right in the sweet spot of nerdy-ness and social interaction. I think I lost track of the number of business cards I handed out to people, but it’s definitely dented my collection!
  2. I learned a lot, particularly about the variety of packages that other people have developed to solve a variety of problems.
  3. Swag! This time it was in the form of a free book by Kurt Smith, on using Cython to speed up Python code. Also got a bunch of stickers. At PyCon, I didn’t know where to stick them, and I was hesitant to stick them to my laptop (I like a clean laptop), so I stuck them to my poster tube instead.
  4. I gave a lightning talk on Testing Data, for data integrity purposes. Later, I was contacted by Greg, who leads the Software Carpentry (SWC)initiative, on providing some material on doing data tests. Looks like it could be fun! And I cannot wait to get my own SWC instructor training – c’mon Greg!
  5. My roommate, Chuan, was a physician from China, who was in the Houston area doing a year of research. I had a great time conversing about code and culture with him, and I learned a lot about contemporary Chinese medicine from him.
  6. Finally, I participated in my first ever coding sprint! It was with the matplotlib team. It was a great learning experience, participating in the actual modern git workflows of software development. I helped with making changes to the documented examples, a task suitable to first timer sprinters (as it doesn’t risk messing up the established code logic). Seeing my first merged PR to a major software project gave me an absolute thrill :). I also got to observe how to do Continuous Integration with auto testing. Next conference I will most certainly make time for at least part of a coding sprint.
  7. I missed a bunch of talks on the second and third day, because I needed some headspace to finish up thinking about this paper that I am writing. However, because of the great efforts by the AV team that Enthought hired, it’s possible to view them online after the conference. This also have those who couldn’t attend the conference a chance to access the conference materials. Kudos to them!

This year’s conference was a really great experience. I learned lots, learned about the many people doing cool stuff with the scientific Python stack, and made new connections with them too. I would highly recommend joining SciPy 2016, and I hope to make it an annual thing with PyCon!