Lessons learned from publishing the reassortment paper

The paper can be found here (preprint, freely available; accepted at PNAS and in press).

In no order of importance, here are the things I would tell myself to do from the start.

Lesson 1. Computational work requires simulated data.

Creating simulated data is paramount. Just as writing an idea down for an audience forces out the details, creating simulated data for an algorithm forces out the assumptions.

Lesson 2. Use pre-submission inquiries!

Don’t waste time formatting papers for journals. Write it up, write the abstract, and use pre-submission inquiries to rapidly iterate over the journals that are likely to accept or reject the paper.

Lesson 3. Use pre-print servers.

Scientists funded by public money should have their work disseminated back to the public in due time. Put written work on pre-print servers; Knowledge Without Barriers is worth it! In fact, it may even pay off to be radical, and write the whole paper in the open.

Lesson 4. Dare to try for a wide audience.

If not for Jon’s encouragement, I might not have had the guts to try for the broad readership journals. In most cases, I know it won’t pay off; in this case, I’m thankful it did.

Lesson 5. Writing with clarity is difficult.

Crafting the scientific narrative around the data was a difficult iterative process. It was difficult taking that step of chopping off a ton of derivative data that did not contribute to the scientific insight being conveyed, but in the end, I think it was necessary.

Lesson 6. The non-linear path in science is real.

Israeli scientist Uri Alon described the non-linear path that a scientist takes. At the outset, I first thought of the scientific narrative to this project as being “a better reassortment finding algorithm”, and that’s where and how I focused my efforts (small datasets, simulations). It later changed to “here’s the state of reassortment in the IRD”, and that was reflected in expanded data scope and optimizations (hacks, really) to work with computing clusters and large datasets.

But only later I realized the really exciting problem we were solving was “quantifying reticulate evolution importance in context of ecological niche switching”, a broadly general problem with great basic scientific interest (if nonetheless lacking in public health importance).

Therein lies the tension of all creative work. One needs to convince people of one’s direction early on. Yet, later on, the direction may pivot, and one needs to be ready for that, and to convince stakeholders in it.

Lesson 7. A first-draft template for scientific project management.

I now think of it as a “folder” of stuff that should be kept together, version controlled, and done openly. This is just a first draft, still evolvable.

  1. Data – everything collected and used, in its raw-est form available.
  2. Code – for processing data, and for generating figures.
    1. New software packages should be kept as a sub-folder, isolated from the rest of the work.
    2. Software packages, for the scientist, are essentially code written that provide functions written to do the similar stuff over and over.
  3. Protocols – for experimental work.
  4. Manuscript – written openly as the work progresses.

Lesson 8: The transferrable things a graduate student has to learn.

By this, I mean stuff beyond the “good’ol do-your-experiments/write-your-code properly”. This is an incomplete list, hoping to expand it further.

  1. Prioritizing time to get the most important stuff done, not merely being efficient in crossing off todo lists.
  2. Saying “no” to good stuff, to leave time to say “yes” to the best stuff. (same as the above point, really.)
  3. Learning how to craft a narrative that ties data together logically, and connects the data to a specifically important problem.
  4. Leveraging others’ strengths to achieve common goals, and expanding one’s own strengths.
  5. Learning how to learn and apply new things really quickly. (Something I’d like to expand on later.)

Lesson 9: Keep writing.

Ultimately, the end product of our work is a written piece. Therefore, early on, start writing the narrative as an abstract. Get the narrative out there. And then rewrite and expand on the narrative until it is self-coherent, coherent with the data, and connects to an important problem.

Lesson 10: The problem space is infinite.

I don’t believe that scientific competition is healthy. The problem space is infinite; the narrative space is as well. Getting scooped is not something to be worried about. Work openly, and rapidly advance your work on hand.

The back-story to our publication on influenza reassortment.

For the reader of a newly-published article, all that we see is precisely that – the article itself. We rarely get to hear about the back-story of that paper, or the choices that were made, the struggles involved, and the emotional ride taken. I thought I’d take the time to document what the back-story to our manuscript on influenza reassortment was like. Hopefully, it’ll let other junior trainees know that nobody’s alone in the struggle.

Now, where shall I start…?

Continue reading

Life after Science One: A Journey in Computation and Creativity

Writer’s note: This blog post was written for the class of Science One 2015/2016, 9 years on after my own experience in Science One 2006/2007. My classmate, Jacob Bayless, is giving a talk to them titled “Life After Science One”, and reached out to me for some perspectives on learning computation. Here’s my piece, for him, and for this year’s class and beyond.

10 years ago, I joined UBC as a student in the Science One program. That year was a fun year, and one of the best educational experiences I’ve had. During Science One, we learned to think integratively across disciplines. For example, we saw how order of magnitude estimation, a common tool in physics, could be applied to ecology, biochemistry, and thermodynamics. As another example, we learned about the application of ordinary differential equations to ecology and immunology in predator-prey systems. That way of thinking – by bridging disciplines and meshing ideas – is something I’ve re-discovered, re-encountered, and re-applied over and over in my research career to date. It doesn’t stop, even after Science One.

There was something I wish was emphasized back in my year, which you all now have the privilege of learning: computation and computational thinking.

While at UBC, I did some quantitative classes, including multivariable calculus, introductory programming (Java was all the rage back then), and statistics, but nothing more than that. By the end of my undergraduate training, I was thoroughly trained in molecular biology, but utterly helpless with programming ecosystems. I only knew how to transcribe and translate DNA sequences in Java.

Later, I transitioned into doing computational work during my PhD. I was ending my 2nd year in the MIT Biological Engineering department, and in search of a good topic to work on for my thesis. I was a 2-month old Pythonista (this is what Python programmers call ourselves, so you all are Pythonistas now!) at that point, and I was teaching myself Python to improve cloning workflows by automating PCR primer design. On the recommendation of a friend, I checked out the Boston Python meetup group. There, I met Giles, a software developer with a gene synthesis company, and a former Broadie. I told him this idea I had to classify all of the internal genes of the influenza virus. Looking back, it’s a bad idea to try attempting this for a thesis project, but Giles and I were both naive enough about the problem that we set about talking through it. I drew him a matrix, he came back with an idea, we Googled stuff up, I came up with an idea, and drew another thing.

Rinse, wash, repeat. It was such an energizing and exciting time! Giles looks at our ideas, and says, “I think you need a clustering algorithm. Try… affinity propagation. It’s a relatively new one, but nonetheless has a mature implementation available in Python. Search scikit-learn for it.” In effect, he was asking a 2-month old Pythonista to do machine learning in Python. Well played, Giles; fast forward a few months, next thing I knew, he had effectively kickstarted the groundwork for my thesis, which would eventually evolve into a computational study of influenza’s capacity for reticulate evolution (through reassortment) and its importance in switching hosts (or ecological niches). My advisor Jon, though not a computationally trained person himself, trusted me with the freedom to learn, fail, and create under his mentorship, and I am hoping that in due time, we can reap the fruit of this trust.

We ended up never using machine learning in that paper above, but a few things happened that got me deeper into computation. My experience working with the scikit-learn library piqued my interest in machine learning tools as applied to biology. I learned about network analysis through Allen Downey’s book, “Think Complexity”, and incorporated it as my main modelling tool as well. I went to PyCon 2014 and 2015 (in Montreal), giving tutorials on data analysis and network analysis. While on conference, I also learned a ton as well about how to use the scikit-learn libraries, and good practices in the software development world that could be used in the research world. This year, I will be at PyCon 2016 giving a tutorial on statistical network analysis as well, while also continuing the learning journey in the Python and data science worlds. The learning journey continues; I soon discovered for myself that scikit-learn alone wasn’t enough, and I’ve started learning the internals of deep learning from a group up at Harvard, with the goal of applying it to developing highly interpretable models of phenotype from genotype. At the same time, the environment here (MIT & the Broad Institute) helps; there’s a Models, Inferences & Algorithms seminar series, during which we learn both the mathematical underpinnings of computational methods, and their application to biomedical problems.

For the past 9 years, the learning hasn’t stopped. I’ve found that thinking integratively, which is exactly what is taught in Science One, is a crucial element in the creative process. New ideas come from executing on the juxtaposition and composition of old ideas, in ways others have never done before. Computation has also played a huge role in enabling me to tackle problems that would otherwise be beyond my reach; from simple automation to modelling complex problems, it’s a superpower you’ll want to have. Computational thinking has helped me scale the scope of what I could solve. So, to SciOne 15/16, cherish the chance learn what you have learned so far. As soon as you can, find an outlet, a problem on which to apply what you’ve learned; go solve a problem for the world around you, whether big or small. Find people who will teach you, trust you with resources to finish the task, and pick you up when you’re down. And while you’re at it, have a ton of fun!


Last year, Jon (my advisor) suggested that we should write an R21 proposal on an idea that we’ve been thinking about for a while. The core of the idea was to generate a ton of phenotypic data matched with sequences, and create predictive machine learning models for surveillance purposes. The explicit application we were targeting was prediction of influenza (and other pathogen) risk. The implicit case we were making was that data science could not continue to rely on “found” data, and that data scientists have to begin thinking through how to scalably generate high dimensional data. Now that we’ve got back the study section’s reviewer comments, I thought I’d write a few thoughts on the process, and things I’ve learned along the way.

Writing this proposal was very illuminating. The proposal writing style demands clarity, succinctness, and logic. My old writing style tended towards rambling. Over the writing process, I became more and more sensitive to longwinded sentences and paragraphs that tried to cram two ideas inside them. I also started noticing where I would instinctively write filler phrases. I think I’ve more or less eliminated such phrases, sentences and paragraphs, except when necessary.

The R21 type of projects have to strike a paradoxical balance. The NIH website states that the proposals should describe innovative, high-risk ideas that have potential for great impact. Layman (psycho)analysis of the NIH tells us that reviewers are going to expect things that work. This tension point is where we focused most of the writing and thinking action.

Before we submitted, we took a step back and tried to evaluate, from the perspective of a reviewer, where the proposal might fall short. Because this was a single-PI submission, we predicted that the reviewers would be most concerned about a lack of machine learning expertise on the research team.

Just as we thought, this was exactly the point where the proposal fell short. I did make a number of logical errors in describing the approach, such as providing preliminary data using artifical neural networks when instead I had written in ensemble machine learning methods. But apart from that, the lack of a co-PI who had ML expertise was the biggest shortcoming for the proposal.

Apart from this point, though, the reviewers viewed the proposal as having great potential for impact in infectious disease and beyond. They recognized that real-time genotype-phenotype prediction could have a great impact in new disease outbreaks, and that a dataset was necessary to make this happen. They also liked the public-facing service, made freely accessible and available to anybody to use. They stated that the research environment was highly suited to the work. The experimental methodology was also regarded positively.

All-in-all, very encouraged by the reviews. I’m right now collaborating with ML experts at Harvard on extending graph-based deep learning code to protein graphs, which I think will dovetail really well with this proposal. We’re looking forward to resubmitting, this time with collaborators, probably the guys up at HIPS.


Today, I gave a webinar to the IRD and ViPR technical and advisory board meeting. There were a number of challenges to giving a webinar that I want to reflect on here, as a note to my future self and others who may read this entry. In contrast to my previous entry, I’m choosing to write this down fairly soon after doing the presentation, so that I can remember the details clearly.

In the lead-up to the webinar, I felt pretty well-prepared. I had done at least 3-4 rounds of narration on my own, so I was very familiar with the content, transitions, and flow. I also had a set of slides prepared for the Q&A session at the end. What I did not expect, but now have come to learn about, was the challenge of interacting with an audience that I can only hear but not see.

Visual communication is important. During the webinar, because I could not see my audience, I made the mistake of perceiving audience questions as being communicated in an aggressive fashion, rather than in a neutral or friendly fashion. I went into a defensive posture, and tended to respond to questions with a lot of rambling detail rather than answering the question directly. I think this was a weakness in my presentation this time round. Jon’s feedback at the end also also corroborated this.

I also made the mistake of not being thoroughly prepared for the Q&A. The questions that were asked were questions that I had dealt with about 1.5 years ago, when I was in an earlier stage of the work I was presenting. However, because my focus now has shifted to other work, I was not prepared to tackle those questions again. I think at a bigger level, I didn’t do the necessary “mental model” preparation for the talk, in which I should have rehearsed multiple “what-if” scenarios. (I’ve been reading Charles Duhigg’s “Smarter Faster Better: The Secrets of Being Productive in Life and Business”, and the ‘construction of mental models’ is a core concept illustrated in there.)

To get around this, I think the next time I do a webinar, I should make the following preparations:

  1. Construct a mental model of the presentation as being a “dialogue”, rather than a “defense”. (I think it applies to all research presentations, really.)
  2. A pen & paper to jot down the question as it’s being asked. It’s a tool to slow my mind down, to prevent myself from making snap-judgment assumptions about what is being asked.
  3. Be prepared to paraphrase the question back to the questioner prior to responding.
  4. Prepare for a broad spectrum of questions to be answered, not just those that were recently asked in my context.
  5. Recognize that it’s totally okay to have problems picked out during a presentation, even one that is more of a “sales-pitch” type.

There’s always a first, and this first webinar was definitely a big educational experience for me. Though I think I didn’t perform top-knotch in the Q&A portion, I hope I still left a positive impression on the IRD/ViPR team (distributed between the JCVI, Northrup Grumman).

Paper revisions

We’ve submitted back our reassortment paper post-review. I’ve refrained from writing about this for about a week, mainly so that I can give myself enough emotional distance to reflect on it. On the whole, I’m feeling very thankful to the reviewers for their highly constructive reviews.

During the month-long review period, I was bracing myself emotionally for a thrashing review (esp. from a 3rd reviewer); after all I had seen such a fate befall two of my colleagues and a collaborator, all of whom wrote manuscripts on work that I was a part of. While there were constructive and legitimate criticisms in those reviewer reports, the number of comments that reflected more emotional ranting than reasoned argumentation left me baffled at the peer review process. (Aside: they mostly came from the fabled 3rd… and sometimes a 4th reviewer.)

When I first went over the reviews I was stunned reading the reviews; in my meeting with my advisor Jon about the reviews, he quipped that his “faith in the review process has been restored.” Firstly, we didn’t have to deal with the oft-fabled “3rd reviewer”; in the life sciences, the 3rd reviewer is often joked to be the one who shoots down a manuscript unreasonably. Secondly, the two reviewers gave us very encouraging and constructive comments. In retrospect, the reviewers comments were direct, thorough, and really helped us refine the scope of the claims we were making. (We may have overstepped the reasonable bounds of our claims in some places in the text.) Reviewer #1 was also very helpful in finding places where, because of familiarity with this manuscript, I had forgotten one chunk of text that should have been present. Reviewer #2 was overall very encouraging and supportive of the manuscript, and only rasied points of clarification.

So, to the reviewers, thank you for the reviews. If I’m given the chance to do so, I will pass on to the next paper the same qualities of the reviews you gave me (directness, thoroughness, being reasoned). At this point, after all the editorial rejections by other journals, I’m just happy to have the paper reviewed, have its flaws pointed out. My hope is that our response was satisfactory, and I’m hoping we have a fair chance at having the reassortment paper published!


Flask, Jinja2, bootstrap.css – and thoughts on being a maker

I recently built a front-end GUI for one of my projects, which is a primer calculator for doing Gibson assembly with influenza segments.

(small detour) The back-drop to this is that 7 years after the invention of the Gibson assembly method in the synthetic biology world (where I used to be), influenza researchers are still stuck with restriction cloning. If they switched to the Gibson (and other seamless) assembly methods, their time wasted on troubleshooting restriction cloning could be drastically reduced. I wrote this simple utility to help bridge that gap.

I had learned HTML many years back. I think I was still a teenager back then. So coming back to it now, I was both delighted and surprised to see the many new developments in the markup language.

Anyways, I built the FluGibson web interface with two intents. Firstly, for it to be a quick-and-dirty, simple utility that an experimental influenza researcher could use in their day-to-day. The user provides a nucleotide sequence and a name for that DNA part (something that’s convenient for the user to remember), and selects a standard plasmid backbone that the flu community uses. It returns a series of cloning primers and sequencing primers, and a PCR protocol for amplifying the DNA parts using the Phusion polymerase. Secondly, for it to be a prototype of a tool that I hope gets implemented in the Influenza Research Database, where a researcher can do a one-click computation (and maybe even ordering) of the primers needed to clone an influenza segment from cDNA.

The FluGibson front-end is a Flask app, and as such runs in any modern browser. The backend is the FluGibson Python package that I wrote; it’s still a bit incomplete in terms of the examples, but given time, I’ll get those fixed up.

Being a maker, rather than a consumer, is a very empowering thing. Being a maker enables me to build what I need to have built to do what I need to get done. It takes time to learn the skill, but in the end I think it pays off. I’d like to encourage whoever’s reading this blog, go be a maker. Go make stuff that you think has tangible value for the world. It’s fun, it’s emotionally rewarding, and may bring a financial return. 🙂