Last year, Jon (my advisor) suggested that we should write an R21 proposal on an idea that we’ve been thinking about for a while. The core of the idea was to generate a ton of phenotypic data matched with sequences, and create predictive machine learning models for surveillance purposes. The explicit application we were targeting was prediction of influenza (and other pathogen) risk. The implicit case we were making was that data science could not continue to rely on “found” data, and that data scientists have to begin thinking through how to scalably generate high dimensional data. Now that we’ve got back the study section’s reviewer comments, I thought I’d write a few thoughts on the process, and things I’ve learned along the way.

Writing this proposal was very illuminating. The proposal writing style demands clarity, succinctness, and logic. My old writing style tended towards rambling. Over the writing process, I became more and more sensitive to longwinded sentences and paragraphs that tried to cram two ideas inside them. I also started noticing where I would instinctively write filler phrases. I think I’ve more or less eliminated such phrases, sentences and paragraphs, except when necessary.

The R21 type of projects have to strike a paradoxical balance. The NIH website states that the proposals should describe innovative, high-risk ideas that have potential for great impact. Layman (psycho)analysis of the NIH tells us that reviewers are going to expect things that work. This tension point is where we focused most of the writing and thinking action.

Before we submitted, we took a step back and tried to evaluate, from the perspective of a reviewer, where the proposal might fall short. Because this was a single-PI submission, we predicted that the reviewers would be most concerned about a lack of machine learning expertise on the research team.

Just as we thought, this was exactly the point where the proposal fell short. I did make a number of logical errors in describing the approach, such as providing preliminary data using artifical neural networks when instead I had written in ensemble machine learning methods. But apart from that, the lack of a co-PI who had ML expertise was the biggest shortcoming for the proposal.

Apart from this point, though, the reviewers viewed the proposal as having great potential for impact in infectious disease and beyond. They recognized that real-time genotype-phenotype prediction could have a great impact in new disease outbreaks, and that a dataset was necessary to make this happen. They also liked the public-facing service, made freely accessible and available to anybody to use. They stated that the research environment was highly suited to the work. The experimental methodology was also regarded positively.

All-in-all, very encouraged by the reviews. I’m right now collaborating with ML experts at Harvard on extending graph-based deep learning code to protein graphs, which I think will dovetail really well with this proposal. We’re looking forward to resubmitting, this time with collaborators, probably the guys up at HIPS.