SciPy 2015: Bokeh & Blaze Tutorial

I sat in the Bokeh/Blaze tutorial this afternoon. It was a challenging tutorial, but a good one nonetheless. I could tell that our tutorial leader, Christine, put in quite a bit of effort in preparing the tutorial content, so kudos to her on that! And the rest of the bokeh team were on hand to support her, and act as TAs for us, so this was also a plus point for Continuum Analytics.

Tutorial Pace

Initially it was challenging for most of the crowd. I had futzed around a bit with bokeh before, so the first three exercises were straightforward. But the rest of the package was a bit cryptic, especially for someone like me coming from the matplotlib world. I actually wonder if the way the objects are organized mirrors matplotlib somehow? For example, in matplotlib, we have the figure object, the Axes objects, and then all of the other objects that the Axes object contains as the plot is being drawn. It seemed like the bokeh objects are organized in a similar fashion, giving coders quite a bit of flexibility in creating a plot.

Anyways, I digress. The tutorial pace was fast initially, but Christine toned it back a bit. We got quite a bit of opportunity to try our hand at coding nonetheless. I think API familiarity was the biggest issue for us in the crowd; I am guessing that if we were more familiar with the API, then we would be able to more easily figure out the coding patterns that are used, and thus have more success in our own attempts at the tutorial content.

Towards the end of the bokeh portion of the tutorial, I found it helpful that Christine pushed her content to Github, so that we could follow along without worrying too much about falling behind on the coding content. I think in the case of plots, I’m of two minds on whether it’s better to modify existing code or to try writing it from scratch – I guess they each have their own pedagogical place. As for the blaze portion, because the blaze API is so similar to the pandas API, I found it very easy to follow along as well.

Tutorial Content

My biggest takeaways were three-fold.

The first takeaway was how the API exposed allows for a high-level, medium-level and low-level construction of web figures.

The second takeaway was how to do dynamically-updating plots, at least at a basic level. That said, it still isn’t intuitive just yet, maybe the API design has to be a bit more matured for it to be so. That said, I’m stoked to learn more about how to do it!

The final takeaway was how to convert between different data sources. The blaze API is pretty cool – I hope they continue to mirror the pandas API!

Other Thoughts

There were a lot of questions raised during the tutorial, particularly on package features. For example, convenience on doing particular operations was a big one. Can it be made easier to do callbacks without the server? Can it be made easier to make different types plots? Can it be made more intuitive to modify a particular part of the figure (i.e. styling)?

I hope the bokeh team is listening! On our part in the community, I think it’s important also to keep identifying these issues, and posting them on their Github repository. I will begin to make this a habit myself :).

One thing that I think might help is to have some illustration of what objects are being drawn to the screen, and their attributes that are modifiable. That would really help with getting familiar with the API. I remember struggling massively through the matplotlib API, simply because I couldn’t map object name to figure element and vice-versa. I’ve put in a Github issue; for my coming projects, I’m thinking of developing a web-based “live figure” to break the mould of academic publishing. Static figures have their place, but live “dynamic” figures for time-aware studies also have their place too, and those figures’ most important, time-sensitive insights cannot be adequately summarized by a static version – e.g. spread of Ebola as the outbreak progresses.

Overall, I’m really glad I took this tutorial, and I’m looking forward to building my next set of datavizs’ in Bokeh!

SciPy 2015: Cython Tutorial

This morning, I sat in the Cython tutorial by Kurt Smith, of Enthought, at SciPy 2015. We’re only halfway in at the time of starting the writing, but it has been a really useful tutorial!

Tutorial Content

Kurt has designed this tutorial to be an intermediate tutorial, where participants are expected to have had some experience with Python before.

During this tutorial, we covered two practical ways to speed up Python code. One is to declare Python variables, functions or objects as C types using the cdef statement, and the other is to simply wrap existing C code.

Tutorial Pace

It was a very well-paced tutorial, and Kurt did a great job of structuring the content such that each exercise followed roughly the same pattern. We spent approximately 20-30 minutes on each section, partly because of the really advanced questions (from my point of view) that other participants asked.

Coding Practice

We got four activities to get used to doing things. Each practice activity was small but not trivial – basically adding the cdef declarations or writing the wrapper code, not needing to write complex algorithms.

Reflections

Already in the first half of the tutorial, I think I have gained two superpowers. The first is that I can speed up my own code without needing to write C, if I need to, simply by adding cdef and cpdef statements where needed. The second is that I can use pre-existing C code that others have written if I need to. Both ways, I can easily speed up my computation with minimal effort. Having Python under my belt to do scientific computing was already a superpower; now, faster Python is also in the repertoire!

From an education standpoint, I think it is also true that knowing the “howto” steps is really important in learning programming. For the majority of scientists, why certain things work on the computer might not be as important as knowing how to make certain things work – as computational scientists, we’re most interested in getting answers, and preferably getting them fast. The simple and structured examples, helped a lot with this.

Pushing Past Myself

Yesterday, I decided to go for a 5K run on my own. This was a break from my usual routine (which is a routine, now that I’ve finished it at 4 regular intervals) – I solo’d it, rather than running with my running buddy Lin (I didn’t ask him ahead of time for today’s run). I’m no runner – I only got back into it at the age of 27, which meant a whole 9 years (since junior college) when I last did running for martial arts training. But after struggling for over a year to get past jogging 2 km, this year I’ve attempted four 5 km runs, each one better and better, but none breaking the 40 minutes mark.

Yesterday, I did it.

Call me out for waxing philosophical, I think there’s stuff I experienced during this run that reflect some of what I’ve learned over the past 4 years of graduate school.

Importance of Starting

My initial goal was just to finish the 5K run on my own, which considering I had started it on my own, was already an accomplishment. I’m usually a self-starter when it comes to brainiac things, but physical exercise is one where it’s a bit tougher to get me started. The real barrier is in the initial stages – just getting out. Sometimes it’s too cold. Sometimes I’m too full. Sometimes I’m making too much progress on my code.  Whatever it is, I usually can find a reason, or sometimes just an excuse, for not going out for a run.

This time round, I just went, “Screw the reasons. Yeah, I might be a bit trippy from a lack of sleep last night (finished The Martian in one sitting), and I’m mentally fatigued from running my colleague’s stats and (re)writing a Python utility to automatically design Gibson assembly primers from FASTA files. Screw it. I’m just going to run.” With that initial commitment, and the execution, soon enough I found myself way too far in to back out.

Importance of Intermediate Goals

Okay, so it’s not a marathon, but 5K ain’t a 100 m sprint either. It’s easy to just give up running and switch to walking, if there’s no intermediate goals for the running. I found myself jogging at a comfortable pace when I was doing 0.7 km every 5 minutes. So I gave myself goals at 0.7 km intervals. Since RunKeeper gives me reports every 5 minutes, I paid the most attention at 15 min., when I was starting to feel the fatigue in the legs the most. When I learned that I was on pace (2.1 km), that gave me a boost to continue going. When I learned that at 30 minutes in, I was at 3.8 km, though I was about 400 m off from where I should be at 0.7 km/5 minutes, I realized I might just be able to get a chance at breaking the 40 minute mark if I just sped up a little bit. Setting these intermediate goals kept me paced and going.

In graduate school, it’s at least 5 years on average. Intermediate goals are hugely important. Intermediate goals are those experimental or computational wins that tell you whether your project is worth pursuing or not. They’re also the measurable progress goals once you’ve determined the feasibility of your project. And once you have those goals and start hitting them, the endorphin rush keeps you going.

Importance of Your Internal Monologue

It’s easy to give up the running and just switch to walking. But some chantable mantras really helped. When I got the stitches, I started muttering, “Ok, walk it off, walk it off, walk it off…” When my legs began cramping, and I had to walk, I chanted, “Just a bit faster, just a bit faster…”

In my case, it was an externalized internal monologue; others might not resort to chanting it. It was borne out of a belief that I could do it. My internal monologue was that, I can do it, and this time might be the time, and (not but) if I fail at it, I still have another chance.

Graduate school really consists of the many small wins that accumulate up into a thesis. Those small wins only come through struggling the many small losses that come by. (I’ve failed spectacularly once while in graduate school, but got a second chance; I’ve had my submitted manuscript rejected twice editorially, and I’ve doubted my abilities to do good research that is also recognized by others; I’ve nearly wiped out valuable old data because of command-line incompetence as well. So there, we’ve all been through our own version of failures.) My own internal monologue has changed from “success-seeking” to “resilience building”. I think resilience, in the long-run, is much more valuable than success. And in graduate school, over the 5 years that we’ll spend here, it’s probably the most valuable thing we’ll take away.

Importance of Support

These past three runs, I’ve chosen to post it to Facebook, knowing fully that I’ve got a few friends whose running records would entitle them to laugh off my run as “peanuts”. But no, that’s not who they are. A few more have run at least one 5K. Some have done Iron Man and Tough Mudder races (I think that’s what it’s called). I’m not in this running business to run these races. Damn, I just trying to get fit, having subjected my body to unfit habits for the past 9 years. The support I got from my friends, at least for those who saw it and clicked the “Like” button, made all the efforts feel that much better, and just totally amplified the euphoria of finally finishing the run sub-40.

It does get tough, in grad school. Experiments don’t work, and models don’t fit the data (or is it the other way? lol). Some feel the heat from their main thesis advisor. Others feel neglected from an unresponsive advisor. There can be a ton of negativity during graduate school. Having a group of peers in a variety of contexts can keep your spirits up. Share your small wins with them. Keep things positive. Encourage them, just as you are encouraged by them.

Importance of Sprinting

Hitting timing targets and pacing is important. But at times, it may be more important to sprint rather than pace. During my last 5 minutes, when I could see the goal of finishing sub-40 in sight, I picked up the pace – not necessarily sprinting per se, but my stride went up. According to the RunKeeper data, I had went from approximately 8 minutes per km down to 5 minutes. I could feel my strides get longer, as I tried to push myself to finish within the 40 minute timeframe.

I think in graduate school, that plays out in many different ways. For me, it’s been when I find myself on a coding roll – hitting small win after small win at a faster pace – that I decide to keep the momentum rather than stop. For other things, it may be a grant deadline or an internal manuscript submission deadline, in which sprinting is necessary to get the thing done.

Of course, after the sprint, don’t forget to rest and recover! Which is what this July 4th weekend is going to be about. Happy 4th!

Evaluate on a Spider Plot

Observation:

  1. Humans like to measure things.
  2. Humans also like to evaluate things.
  3. Humans usually like to have only one metric for evaluations.

Example: Academic performance.

  1. Performance is measured using grades: letter, or percentage.
  2. Performance is evaluated on how good or bad the letter/percentage is.
  3. All of that gets condensed to one number: GPA/Average.

It’s easy to evaluate stuff on a single number. But it also creates some problems. Nuances are lost. Subspecialty strengths are not shown. Could we do better? I think so.

Proposal: Rather than evaluate stuff on a single line, I think we should start evaluating people on a multi-dimensional spider plot. Keep the metrics, but start evaluating people on more metrics than one.

Example: Job performance

  1. Peer-reviewed perception of performance.
  2. Company metrics attained.
  3. Cross-departmental engagements.

Basically, measure more things that are valued than just the single thing that seemingly captures everything… but doesn’t.

Thoughts?

Test All Your Data!

On Monday, I gave my first ever lighting talk at the Boston Python meetup. Though I was kinda nervous right before the talk, overall, I found it to be really fun! My main message was that we should write tests for our data when doing data analysis.

What’s the benefits?

  1. Things that should be true about the data at this point, and in the future, will be formally coded in an automatically runnable script.
  2. If the data ever needs to change, this serves as an automated way of having a sanity check in place prior to running data analysis.
  3. If the data ever changes inadvertently, we have tests to ensure data integrity.

My slides are attached here (Test All The Data!), hopefully they’ll come in handy to someone, someday. Or maybe to you, my reader. :-)

Q&A

From memory, these were some of the questions that I fielded right after the talk.

Q: What happens if your data cannot fit in memory?
A: Try reading in just a random subset of the data, just the head, just the tail, or streaming it row/line by row/line.

Q: When do you know you have enough tests written before you can begin doing other things, such as modelling?
A: You don’t. Just keep writing them. You’ll never have enough tests. You will discover more assumptions about your data as you go along, which you can code up in your test script.

Q: What’s your general workflow for writing tests?
A: Data analysis occupies a big chunk of my time. If I am thinking about the data, I will realize I have assumptions I’m making about the data. At those moments of realization is the best time to jot down the assumption to be encoded. Sometimes the epiphanies come in the shower or while sitting on the toilet bowl or eating a burrito. Have a notebook handy :).

Prior to embarking on the data analysis journey, I will run py.test. Following data analysis, I will run those tests one more time. At the end of adding in a test, I will also run py.test. It’s now become a reflex.

Thoughts on Open Data Science Conference

  1. Great turnout! People from everywhere – Conneticut, Michigan, Boston, Ukraine etc. Many companies as well.
  2. Hiccups with the workshops.

On Talks

All-round mostly high quality talks. Favorites: Booz Allen Hamilton’s opener (4th of the opening talks), in which they encouraged Data Scientists to go beyond correlations and move into finding out causations. Best point I’ve seen about the data science world.

On Job Booths

Darn! It looks like the Broad is more interested in post-docs than staff computational biologists. :( I hope my impressions are wrong.

On Workshops

I would run the workshops in the style of PyCon Tutorials. Concentrate them, require participants to pre-register for them, and perhaps pay a refundable deposit for their spot. Refundable deposits via EventBrite are not impossible.

Other Suggestions

Try booking a venue much closer to the entrance, perhaps?

Food – okay, maybe at a low registration fee, not possible to provide. But catering for 1000+ people might be more cost effective than having us buy our own lunches?

How to do Testing as a Practice in Data Analysis

Recently, I’ve just wrapped up the data analysis tasks for one of my projects, and have finally finished the writing up to the stage of being ready for submission. I’m still going to choose to keep it under wraps until it finally gets published somewhere – don’t want to count chickens before they’re hatched! But the paper isn’t the main point of this post; the main point is on the importance of doing tests as part of the data analysis workflow, especially for large data analysis.

Why “data testing”?

Data comes to us analytics-minded people – but rarely is it structured, cleaned, and ready for analysis & modelling. Even if it were generated by machines, there may be bugs in the data – and by bugs, I mean things we didn’t expect to see about the data. Therefore, I think it’s important for us to check our assumptions about the data, prior to and during analysis, to catch anything that is out of whack. I would argue that the best way to do this is by employing an “automated testing” mindset and framework. Doing this builds a list of data integrity checks that can speed up the catching of bugs during the analysis process, can provide feedback to the data providers/generators. In the event that the data gets updated, we can automatically check the integrity of the datasets that we work with prior to proceeding.

How to begin data testing?

As I eventually found out through experience, doing data tests isn’t hard at all. The core requirements are:

  1. An automated testing package (e.g. pytest)
  2. A Python script called test_data.py, with functions that express the assumptions one is making about the data.

To get started, ensure that you have Python installed. As usual, I recommend the Anaconda distribution of Python. Once that is installed, install the pytest package by typing in your Terminal: conda install pytest.

Once that is done, in your folder, created a blank script called test_data.py. In this script, you will write functions that express the data integrity tests for your data.

To illustrate some of the basic logic behind testing, I have provided a Github repository of some test data and an example script. The material is available on https://github.com/ericmjl/data-testing-tutorial. To use this ‘tutorial’ material, you can clone the repo to disk, and install the dependencies mentioned on the Github site.

The example data set provided is a Divvy data set, which is a simple data set on which we can do some tests. There is a clean data set, and a corrupt data set in which one cell of the CSV file has a “hello”, and another cell has a “world” present in the longitude and latitude columns, defying what one would expect about two of the data columns. If you inspect the data, you will see that there are different columns present.

To begin testing the data, we can write in the following lines of code in the test_data.py:

import pandas as pd 
data = pd.read_csv('data/Divvy_Stations_2013.csv', index_col=0)
def test_column_latitude_dtype():
    """
    Checks that the dtype of the 'Latitude' column is a float.
        """
    assert data[’atitude’].dtype == float

If you fire up a Terminal window, cd into the data-testing-tutorial directory, and execute py.test. You should see the following terminal output: 1 passed in 0.53 seconds.

If you now change the function to check the data_corrupt DataFrame, such that the assert statement is:

    assert data_corrupt['altitude'].dtype == float

The py.test output should include the error message:

> assert df['latitude'].dtype == float
E assert dtype('O') == float

At this point, because the assertion statement failed, you would thus know that the data suffered a corruption.

In this case, there is only one data set that is of interest. However, if you find that it’s important to test more than one file of a similar data set, you can encapsulate the test code in a function call embedded in the test function as such:

def test_column_latitude_dtype():
    ”””
    Checks that the dtype of the 'Latitude' column is a float.
    ”””
    def column_latitude_dtype(df):
        assert df['latitude'].dtype == float
    column_latitude_dtype(data)
    column_latitude_dtype(data_corrupt)

In this way, you would be testing all of the data files together. You can also opt to do similar encapsulation, abstraction etc. if it helps with automating the test cases. As one speaker once said (as far as my memory can recall), if you use the same block of code twice, encapsulate it in a function.

How to continue testing?

You can extend the script by adding more test functions. py.test is able to do automated test discovery, by looking for the test_ prefix to a function. Therefore, simply make sure all of your test functions have the test_ prefix before them.

Need some ideas on when to add more tests?

  1. Any time you come up against a bug in your data, add a test.
  2. Any time you think of a new assumption you’re making about the data, add a test.

Happy Testing! 😀