Dividing People into Small Groups with Python

In our Bible Study small group, I have found through empirical observation that when the group size is large (>5 people) and homogeneous (all guys/girls, all believers, all Bible study leaders), Bible study tends to be either too flat or too chatty, too boring or too distracted, and all-round just not beneficial for learning. On the other hand, when the group is small (3-5 people) and diverse (guys & girls, baptized + seekers together, counsellors spread out), learning takes place. (Outside of Bible study groups, I find this to be true anyways.)

It’s challenging to do this division by hand though, as there can always be subtle biases that creep in. So I decided to use a bit of information theory and Python to do this division in an unbiased fashion. The result? My own hand-crafted small group web app that keeps track of group members in a larger group, uses a simple genetic algorithm for shuffling them into optimally diverse groups of people.

The data categories used are simple, and by no means do I use this to “categorize” people for privileges, they’re only used for assigning responsibilities in the group. We use gender (M, F), faith status (baptized, believer, seeker, unknown), and role (facilitator, counsellor, none). The algorithm essentially works as such:

  1. Determine the number of small groups to keep the group size within 3-5 people.
  2. Randomly distribute individuals across the groups, by first distributing the facilitators, and then everybody else.
  3. Until max number of tries has been reached:
    1. Scoring Step: Compute Shannon entropy within each group, and sum up Shannon entropy scores.
    2. Proposal Step: Propose to swap two random individuals.
    3. Comparison Step: Compute new Shannon entropy score under the swap. If it does not decrease Shannon entropy and passes the “exclusion criteria”, accept swap. Else, pass.
  4. Return all the small groups.

A note on the comparison step: In other algorithms I’ve seen, acceptance is conditional if and only if the score (Shannon entropy) is increased, but in this case, not decreasing is ‘good enough’. I have my engineer hat on.

I added a way to include “exclusion criteria”, such as the scenario where it would be inappropriate to put two people in the same group, for example, where there is a simmering conflict in the midst, or where the relationship between the two could be distracting to learning. Right now, that functionality is baked into the back-end, but I am designing an implementation to make it accessible through the front-end.

The web app is written in Python, and uses only two packages: tinydb and Flask. Front-end uses HTML with Jinja2 templating and Bootstrap CSS. I wrote the GUI using Flask because I didn’t need fancy stuff that Django offered, and was simple enough for me to run locally. I opted for tindyb only because it was an even simpler, lightweight version of a database (as a JSON) file, and was sufficient for what I needed too. Of course, I’m quite sure this can be re-implemented in Django/SQLite, and made infinitely more fancy :). The code for small-group is available on GitHub, along with instructions for how to use it. Enjoy!

PyCon Tutorials (Days 1 & 2)

In a flash, the PyCon 2016 tutorials are over!

My session on network analysis was on the first day, in the morning. Overall, things went smoothly, and because of the competency level of the class, I was able to cover all of the material, including the ones that we usually don’t have enough time to get to (computational statistical inference on graphs, and bipartite graphs).

Most of the time, at the end of the workshop, I hear feedback on how to improve specific material, and details on what was new or useful for the participants. However, this time round, there was little of it. I was initially a bit disappointed, as I usually find ways to use the feedback to decide what to tweak for the next iteration. Later, over lunches and coffees (or in other tutorials), some participants did share their thoughts and feedback, and it was overall positive. Last night, I also shared some of these thoughts with David Baumgold (who led a Git tutorial) over a group trip to Powells, and that was a nice cheer-up as well.

I learned a bit about Lektor from David. Seriously thinking about moving my personal site out of WordPress and into Lektor, and off from BlueHost and onto DigitalOcean. Speed, cost, and customizability what I’m really thinking about right now.

Speaking of Powells: that bookstore is big! I had to ask for a bit of help to find the “science” section:

Me: “Hi! I’m looking for books in the sciences. Where should I go?”
Staff: “Hmm, did you mean ‘science fiction’ or the ‘hard sciences’?”
Me: “Ah yes, I meant the ‘hard sciences’. I’m a ‘hard scientist’ myself.”

They had books from conference proceedings, “open problems in computational sciences”, deep physics books… I was wowed, but didn’t buy anything; I ended up getting two books on minecraft instead. 😛 (They’re not for me, they’re for a colleague’s son.)

I also saw Panic’s sign – you can actually control the colour of their sign through a web app! Totally agree with David – that corner of Portland is one ‘magical’ corner.

On the second day, I decided to help out Prof. Allen Downey with his tutorials. I know Allen through the Boston Python User Group and from being a PyCon tutorial instructor before. His tutorials are always fun, hands-on, entertaining, and most importantly, a chance to learn something new. I like his philosophy too – leveraging the very practical skill of computation to learn more abstract things like statistics. He led two tutorials, one on Bayesian statistics and one on Computational statistics. Highly recommend attending his tutorials at PyCon!

It just so happened that my allergies flared up today as well. Two people, one an attendee and one an AV staff member (Jacob), offered ibuprofen to help deal with the general discomfort. Much kindness shown here.

Looking forward to the next few days of talks. Keeping the learning going!

Portland, OR

Now, my first thoughts on Portland… It’s a lovely city, not unlike Austin but without Austin’s heat and humidity. Very biker friendly, and the public transit beats Boston’s hands-down. The TriMet, as they call it, is modern, clean, efficient, and cost-effective. The residents here did a great job investing in communal infrastructure early on. It’s a sprawling city too; according to some who drive around here, it takes about 30 min to drive from one corner of the city’s quadrants to the opposite corner, and that’s taking the highway. Tattoos, alternative music, dark symbolism all seem to be vogue out here. People are laid back, very friendly. There are lots of independent businesses; I don’t see the symptoms of massive commercialization that I see in other big American cities. From talking with locals who are helping with PyCon, there’s a general focus on “the good life”, rather than the focus on “achievement” on the East Coast.

No doubt it’s an attractive for many, but I’m not sure that, given my own personal life history, I would be able to be sane in Portland. Without intending to devalue ‘the good life’, there are some for whom a mission/purpose-driven life matter more than the ‘good life’ that Portland offers. People in Boston give the city a different vibe, one that is nerdier and health-oriented. Knowing that I’m in a place wth lots of really smart and articulate people keeps my natural ego in check as well, something I think can only be beneficial in the long-run. I’m a person excited by the possibilities offered by ideas, and Boston is brimming with them.

Life’s all about tradeoffs, I guess. 🙂

PyCon 2016

Stage 2 of the conference tour starts today. I am at Logan right now, waiting for the JetBlue flight to Portland, OR, for PyCon 2016. There, I will be delivering a tutorial on network analysis, as well as help Allen Downey TA his Computational Statistics tutorial (assuming enough people join in).

I hope to see Portland as well; in my mind it’s always been one of those cities whose ‘green’ culture is worth experiencing.

Boston, I’ll be back in a week. ‘Till then!


I had a ton of fun delivering a workshop on network analysis fundamentals at ODSC East yesterday! This is my bullet-point journal version of my thoughts over ODSC East.

  1. Learned a ton from Bang Wong (of the Broad Institute) and Mark Schindler (of GroupVisual) about DataViz & User Experience (DVUX).
  2. Didn’t expect that the workshop would be over-subscribed! I was expecting the topic to be a bit more niche. Lots of kind tweets and feedback. Material are all available on GitHub.
  3. Invited to contribute content to DataCamp on network analysis. Timeline approximately Fall 2016 or Spring 2017. Strongly considering it.
  4. Talked one-on-one with a manager in the Facebook infrastructure data science team. FB gets a lot of stick for privacy reasons; after this talk, I realize they have bigger, altruistic plans that rarely get talked about. The short story is that there always some degree of tradeoff, and it sometimes takes a company amassing resources in order to do things that require a big jump rather than incremental improvements.
  5. I like Dask, great talk by Matthew Rocklin (slides). Time to try it out.
  6. Great to see biological applications featured at ODSC, especially on Sunday. Neglected tropical diseases and big microscopy analysis.

the tweets

The tweets are archived here. It’ll serve as my “feel good” memory stack if I ever need to return to it.

New funding from the Broad Next10!

(As is now become somewhat habitual, I’m reporting a week late to get some clarity in thought.)

Really humbling and yet exciting week last week. With my colleagues Tony and Jared (Blainey lab) at the Broad, we won a $40,000 Broad Next10 (Bn10) grant to conduct exploratory and hopefully “catalytic” experiments to develop influenza polymerase phenotyping assays that can be done at scale and at low cost, with the stretch goal of making it plug-and-play for other viral polymerases. We also won another $40,000 Bn10 grant scale the phenotyping of influenza neuraminidase drug resistance to oseltamivir (a.k.a. tamiflu).

It’s humbling because finally there’s a team of people who think these ideas are worth taking a risk on, and are willing to take a quantifiable $80,000 (total) gamble on it. It’s also an exciting time, because I have been working on the (cheaper) computational side of things for a while, and I have become convinced that endless optimization of the computation cannot beat simply having better data measured, and this funding enables us to run some experiments towards scalably generating that data. We have one year to accomplish this goal, and we are planning to treat this money as “accelerator” money to get a minimum viable prototype out and ready.