PyFlatten: A package for flattening nested data structures

Yesterday, I released PyFlatten to PyPI – it’s a utility that can flatten nested data structures (e.g. list of lists; dictionaries of lists of tuples) into a single 1-by-N vector, while also returning an ‘unflattener’ function that can restore the original data structure from the flattened version.

The source code are available on GitHub, where I make clear that I can’t take credit for writing the code – I can only credit myself for factoring it out of autograd for others to use. The real heroes are David Duvenaud, Dougal Maclaurin, and Matt Johnson of the Harvard Intelligent & Probabilistic Systems group. With David’s permission I am releasing it for public use.

Hope it comes in handy for whatever your project is!

Recent Papers

Finally, after ~2 years of learning, working, collaborating and writing, there’s been a slew of papers from our research lab going out. Really happy to have finally contributed to our collective scientific knowledge, both through my own efforts and through working with others.

Here’s the list of papers, and I’m absolutely thrilled to have contributed to the scientific story in each of those:

  • Ma, E. J., Hill, N. J., Zabilansky, J., Yuan, K. & Runstadler, J. A. Reticulate evolution is favored in influenza niche switching. Proc. Natl. Acad. Sci. U.S.A. 201522921 (2016). doi:10.1073/pnas.1522921113 [link]
  • Hill, N. J. et. al. Transmission of influenza reflects seasonality of wild birds across the annual cycle. Ecology Letters (2016). [just accepted! not yet available]
  • Bahl, J. et al. Ecosystem Interactions Underlie the Spread of Avian Influenza A Viruses with Pandemic Potential. PLoS Pathogens 12, e1005620 (2016). [link]
  • Hussein, I. T. M. et al. A point mutation in the polymerase protein PB2 allows a reassortant H9N2 influenza isolate of wild-bird origin to replicate in human cells. Infection, Genetics and Evolution 41, 279–288 (2016). [link]
  • Hussein, I. T. M. et al. New England harbor seal H3N8 influenza virus retains avian-like receptor specificity. Scientific Reports 6, 21428 (2016). [link]
  • Bui, V. N. et al. Genetic characterization of a rare H12N3 avian influenza virus isolated from a green-winged teal in Japan. Virus Genes 50, 1–5 (2015). [link]

Okay, so the next question is – are you graduating? To which I will respond:



Jokes aside, I’m initiating the discussions with my advisor & committee now. The end is in sight!

SciPy 2016 Financial Aid Committee

This year, I had the privilege of serving on the SciPy (Scientific Python Conference) 2016 financial aid committee, and I will be headed to Austin, TX to present a tutorial on on fundamental and statistical network analysis.

In some ways, put in its most basic terms, being on the FinAid committee was really about finding the best ways to spend somebody else’s money – or else better known as “stewardship”. We will be releasing a document, at first internally to this year’s organizing committee chairs (Aric & Prabhu) on how we did the selection, with the final goal of getting their approval for releasing it publicly for documentation, as well as a suggestion for next year’s committee.

I’ve already received some emails from scholarship recipients expressing their thanks for being selected. As one who was in the same spot last year, I’m really happy to pay it forward this way. Looking forward to meeting them in Austin!

Lessons learned from publishing the reassortment paper

The paper can be found here (preprint, freely available; accepted at PNAS and in press).

In no order of importance, here are the things I would tell myself to do from the start.

Lesson 1. Computational work requires simulated data.

Creating simulated data is paramount. Just as writing an idea down for an audience forces out the details, creating simulated data for an algorithm forces out the assumptions.

Lesson 2. Use pre-submission inquiries!

Don’t waste time formatting papers for journals. Write it up, write the abstract, and use pre-submission inquiries to rapidly iterate over the journals that are likely to accept or reject the paper.

Lesson 3. Use pre-print servers.

Scientists funded by public money should have their work disseminated back to the public in due time. Put written work on pre-print servers; Knowledge Without Barriers is worth it! In fact, it may even pay off to be radical, and write the whole paper in the open.

Lesson 4. Dare to try for a wide audience.

If not for Jon’s encouragement, I might not have had the guts to try for the broad readership journals. In most cases, I know it won’t pay off; in this case, I’m thankful it did.

Lesson 5. Writing with clarity is difficult.

Crafting the scientific narrative around the data was a difficult iterative process. It was difficult taking that step of chopping off a ton of derivative data that did not contribute to the scientific insight being conveyed, but in the end, I think it was necessary.

Lesson 6. The non-linear path in science is real.

Israeli scientist Uri Alon described the non-linear path that a scientist takes. At the outset, I first thought of the scientific narrative to this project as being “a better reassortment finding algorithm”, and that’s where and how I focused my efforts (small datasets, simulations). It later changed to “here’s the state of reassortment in the IRD”, and that was reflected in expanded data scope and optimizations (hacks, really) to work with computing clusters and large datasets.

But only later I realized the really exciting problem we were solving was “quantifying reticulate evolution importance in context of ecological niche switching”, a broadly general problem with great basic scientific interest (if nonetheless lacking in public health importance).

Therein lies the tension of all creative work. One needs to convince people of one’s direction early on. Yet, later on, the direction may pivot, and one needs to be ready for that, and to convince stakeholders in it.

Lesson 7. A first-draft template for scientific project management.

I now think of it as a “folder” of stuff that should be kept together, version controlled, and done openly. This is just a first draft, still evolvable.

  1. Data – everything collected and used, in its raw-est form available.
  2. Code – for processing data, and for generating figures.
    1. New software packages should be kept as a sub-folder, isolated from the rest of the work.
    2. Software packages, for the scientist, are essentially code written that provide functions written to do the similar stuff over and over.
  3. Protocols – for experimental work.
  4. Manuscript – written openly as the work progresses.

Lesson 8: The transferrable things a graduate student has to learn.

By this, I mean stuff beyond the “good’ol do-your-experiments/write-your-code properly”. This is an incomplete list, hoping to expand it further.

  1. Prioritizing time to get the most important stuff done, not merely being efficient in crossing off todo lists.
  2. Saying “no” to good stuff, to leave time to say “yes” to the best stuff. (same as the above point, really.)
  3. Learning how to craft a narrative that ties data together logically, and connects the data to a specifically important problem.
  4. Leveraging others’ strengths to achieve common goals, and expanding one’s own strengths.
  5. Learning how to learn and apply new things really quickly. (Something I’d like to expand on later.)

Lesson 9: Keep writing.

Ultimately, the end product of our work is a written piece. Therefore, early on, start writing the narrative as an abstract. Get the narrative out there. And then rewrite and expand on the narrative until it is self-coherent, coherent with the data, and connects to an important problem.

Lesson 10: The problem space is infinite.

I don’t believe that scientific competition is healthy. The problem space is infinite; the narrative space is as well. Getting scooped is not something to be worried about. Work openly, and rapidly advance your work on hand.

The back-story to our publication on influenza reassortment.

For the reader of a newly-published article, all that we see is precisely that – the article itself. We rarely get to hear about the back-story of that paper, or the choices that were made, the struggles involved, and the emotional ride taken. I thought I’d take the time to document what the back-story to our manuscript on influenza reassortment was like. Hopefully, it’ll let other junior trainees know that nobody’s alone in the struggle.

Now, where shall I start…?

Continue reading

Life after Science One: A Journey in Computation and Creativity

Writer’s note: This blog post was written for the class of Science One 2015/2016, 9 years on after my own experience in Science One 2006/2007. My classmate, Jacob Bayless, is giving a talk to them titled “Life After Science One”, and reached out to me for some perspectives on learning computation. Here’s my piece, for him, and for this year’s class and beyond.

10 years ago, I joined UBC as a student in the Science One program. That year was a fun year, and one of the best educational experiences I’ve had. During Science One, we learned to think integratively across disciplines. For example, we saw how order of magnitude estimation, a common tool in physics, could be applied to ecology, biochemistry, and thermodynamics. As another example, we learned about the application of ordinary differential equations to ecology and immunology in predator-prey systems. That way of thinking – by bridging disciplines and meshing ideas – is something I’ve re-discovered, re-encountered, and re-applied over and over in my research career to date. It doesn’t stop, even after Science One.

There was something I wish was emphasized back in my year, which you all now have the privilege of learning: computation and computational thinking.

While at UBC, I did some quantitative classes, including multivariable calculus, introductory programming (Java was all the rage back then), and statistics, but nothing more than that. By the end of my undergraduate training, I was thoroughly trained in molecular biology, but utterly helpless with programming ecosystems. I only knew how to transcribe and translate DNA sequences in Java.

Later, I transitioned into doing computational work during my PhD. I was ending my 2nd year in the MIT Biological Engineering department, and in search of a good topic to work on for my thesis. I was a 2-month old Pythonista (this is what Python programmers call ourselves, so you all are Pythonistas now!) at that point, and I was teaching myself Python to improve cloning workflows by automating PCR primer design. On the recommendation of a friend, I checked out the Boston Python meetup group. There, I met Giles, a software developer with a gene synthesis company, and a former Broadie. I told him this idea I had to classify all of the internal genes of the influenza virus. Looking back, it’s a bad idea to try attempting this for a thesis project, but Giles and I were both naive enough about the problem that we set about talking through it. I drew him a matrix, he came back with an idea, we Googled stuff up, I came up with an idea, and drew another thing.

Rinse, wash, repeat. It was such an energizing and exciting time! Giles looks at our ideas, and says, “I think you need a clustering algorithm. Try… affinity propagation. It’s a relatively new one, but nonetheless has a mature implementation available in Python. Search scikit-learn for it.” In effect, he was asking a 2-month old Pythonista to do machine learning in Python. Well played, Giles; fast forward a few months, next thing I knew, he had effectively kickstarted the groundwork for my thesis, which would eventually evolve into a computational study of influenza’s capacity for reticulate evolution (through reassortment) and its importance in switching hosts (or ecological niches). My advisor Jon, though not a computationally trained person himself, trusted me with the freedom to learn, fail, and create under his mentorship, and I am hoping that in due time, we can reap the fruit of this trust.

We ended up never using machine learning in that paper above, but a few things happened that got me deeper into computation. My experience working with the scikit-learn library piqued my interest in machine learning tools as applied to biology. I learned about network analysis through Allen Downey’s book, “Think Complexity”, and incorporated it as my main modelling tool as well. I went to PyCon 2014 and 2015 (in Montreal), giving tutorials on data analysis and network analysis. While on conference, I also learned a ton as well about how to use the scikit-learn libraries, and good practices in the software development world that could be used in the research world. This year, I will be at PyCon 2016 giving a tutorial on statistical network analysis as well, while also continuing the learning journey in the Python and data science worlds. The learning journey continues; I soon discovered for myself that scikit-learn alone wasn’t enough, and I’ve started learning the internals of deep learning from a group up at Harvard, with the goal of applying it to developing highly interpretable models of phenotype from genotype. At the same time, the environment here (MIT & the Broad Institute) helps; there’s a Models, Inferences & Algorithms seminar series, during which we learn both the mathematical underpinnings of computational methods, and their application to biomedical problems.

For the past 9 years, the learning hasn’t stopped. I’ve found that thinking integratively, which is exactly what is taught in Science One, is a crucial element in the creative process. New ideas come from executing on the juxtaposition and composition of old ideas, in ways others have never done before. Computation has also played a huge role in enabling me to tackle problems that would otherwise be beyond my reach; from simple automation to modelling complex problems, it’s a superpower you’ll want to have. Computational thinking has helped me scale the scope of what I could solve. So, to SciOne 15/16, cherish the chance learn what you have learned so far. As soon as you can, find an outlet, a problem on which to apply what you’ve learned; go solve a problem for the world around you, whether big or small. Find people who will teach you, trust you with resources to finish the task, and pick you up when you’re down. And while you’re at it, have a ton of fun!