Software Engineering Skills for Data Analytics

When you think about software engineering skills, you probably don’t think about the analytics types, or data scientist (DS) teams. This is a reasonable thought. Data scientists aren’t in the business of building software, they’re in the business of using software to analyze data. That said, I think it’s still important for a data scientist (or analytics person, for that matter), to know some basic software engineering skills. Here’s the why, followed by the what.

Continue reading

On the ‘humbleness’ of conference attendees

Conferences are made up of people, just like any other group of human beings grouping together.

What makes one differ from another really boils down to the people.

I read a tweet recently that described the SciPy 2015 conference as having really ‘humble’ attendees. This was exactly my feeling! It was great to see such a community of developers and scientists who, knowing that while they may be domain experts there is still much to learn, choose to carry themselves in a really humble way.

I think this was one of the reasons why I really enjoyed SciPy. It was devoid of the ego that plagues other field-specific conferences. Yet another reason to go again!

SciPy 2015 – Done!

The conference is over! I get to go home now, but I also will miss being a part of the community. Hope to go back next year!


  1. It was fun to set so many new people from a variety of disciplines. This conference was right in the sweet spot of nerdy-ness and social interaction. I think I lost track of the number of business cards I handed out to people, but it’s definitely dented my collection!
  2. I learned a lot, particularly about the variety of packages that other people have developed to solve a variety of problems.
  3. Swag! This time it was in the form of a free book by Kurt Smith, on using Cython to speed up Python code. Also got a bunch of stickers. At PyCon, I didn’t know where to stick them, and I was hesitant to stick them to my laptop (I like a clean laptop), so I stuck them to my poster tube instead.
  4. I gave a lightning talk on Testing Data, for data integrity purposes. Later, I was contacted by Greg, who leads the Software Carpentry (SWC)initiative, on providing some material on doing data tests. Looks like it could be fun! And I cannot wait to get my own SWC instructor training – c’mon Greg!
  5. My roommate, Chuan, was a physician from China, who was in the Houston area doing a year of research. I had a great time conversing about code and culture with him, and I learned a lot about contemporary Chinese medicine from him.
  6. Finally, I participated in my first ever coding sprint! It was with the matplotlib team. It was a great learning experience, participating in the actual modern git workflows of software development. I helped with making changes to the documented examples, a task suitable to first timer sprinters (as it doesn’t risk messing up the established code logic). Seeing my first merged PR to a major software project gave me an absolute thrill :). I also got to observe how to do Continuous Integration with auto testing. Next conference I will most certainly make time for at least part of a coding sprint.
  7. I missed a bunch of talks on the second and third day, because I needed some headspace to finish up thinking about this paper that I am writing. However, because of the great efforts by the AV team that Enthought hired, it’s possible to view them online after the conference. This also have those who couldn’t attend the conference a chance to access the conference materials. Kudos to them!

This year’s conference was a really great experience. I learned lots, learned about the many people doing cool stuff with the scientific Python stack, and made new connections with them too. I would highly recommend joining SciPy 2016, and I hope to make it an annual thing with PyCon!

SciPy 2015: Talks Day 1

Morning Session

  1. Will millenials ever get married?: Survival analysis made simple.
  2. Distarray: A tool to do distributed computing on numpy arrays.
  3. Teaching with Jupyter: nbgrader for autograding of Jupyter notebooks, and JupyterHub to provide a uniform environment for teaching and learning.
  4. Open Source data archive for melanoma screening: How to build a system that helps diagnose melanoma.

Mid-Day Session

  1. Story time with Bokeh: State of Bokeh’s plotting package.
  2. VisPy: Harnessing The GPU For Fast, High-Level Visualization: A very impressive package for doing real-time visualization!
  3. HoloViews: Interactive data visualizations made really easy.

Afternoon Session

  1. Deep learning crash course: A real crash course!
  2. PyStruct (structured prediction): a package for doing structured prediction.

I then left to get a breather, and prepare a bit for my lighting talk!

SciPy 2015: Computational Statistics II Tutorial

The final tutorial that I sat in today was the intermediate computational statistics tutorial. This was led by Chris Fonnesbeck, prof at Vanderbilt University, fellow Vancouverite, also one of the maintainers of the PyMC3 package.

Tutorial Content

In this tutorial, Chris covered:

  1. Data cleaning/preparation – using pandas.
  2. Density estimation – using the numpy and scipy packages; mechanics: method of moments and maximum likelihood estimators.
  3. Fitting regression models.

Tutorial Pace

The initial part of the tutorial was heavily pandas oriented. I think it was useful for the fairly large fraction of the class that was not well-versed with pandas. In my own case, however, I skipped forward to the second notebook in order to explore a bit. The time spent on pandas was about 1 hr 45 minutes; we only got to the second topic at 2:45 pm.

The latter parts were quite useful. I think the mechanics of thinking through statistical modelling problems isn’t commonly emphasized in stats classes. As such, just like I had mentioned in my review of the first tutorial, the mechanics on “how to do stuff” proved to be really helpful.

Overall Thoughts

This was the one that I was particularly anticipating, as I was hoping to learn the mechanics of doing Bayesian statistical analysis in PyMC3. However, the tutorial content was not that, possibly because this material was already covered last year and recorded (for YouTube posterity). Instead, I was pleasantly surprised by the content covered here instead. Definitely was an expansion of my thinking.

Two full days of learning has been quite an intellectual adventure! Many thanks to all of the tutorial leaders for their preparation and hard work; count me as one more person who’s learned lots!

SciPy 2015: Geospatial Data Tutorial

This morning, I attended the Geospatial Data tutorial. It wasn’t a filled lecture hall, but that was likely because the topic is a bit more specialized. That said, I think it was a tutorial with great content. Part of my own research work may eventually incorporate working with geospatial data – to predict influenza viral reassortment, in particular. Therefore, I was looking forward to learning more about the packages that are used to read and manipulate geospatial data. This tutorial provided the overview I was looking for.

Tutorial Content

In this tutorial, we were taught how to inspect and manipulate geospatial data in a Pythonic fashion.

The first thing we learned was how to inspect geospatial data using the fiona package. Most important point I learned was that geospatial data is usually stored as a GeoJSON format.

The second thing we learned was the shapely package, which allowed us to draw arbitrary shapes and perform set operations on them. This one went smoothly, and I found the trivial examples provided to actually be quite instructive and informative.

The third thing we learned was rasterio, where we learned how to load raster images of geographic regions, and combine them with their geographic metadata.

The final thing that I picked up was geopandas. Arguably the easiest portion to follow, I was also pleasantly surprised as to how many common/intuitive operations that I could think of were also represented in the API.

Tutorial Pace

I hit a snag using the fiona and rasterio packages, so I eventually settled on following the tutorial on the projector screen instead.

Apart from that, it was the geopandas section that was the easiest to follow, as the API was very similar to the pandas API.

Other Thoughts

Overall, I think this tutorial was a great overview of the packages that can be used to read and manipulate geospatial data. I can tell that our tutorial leader, Kelsey, placed quite a bit of effort in preparing the variety of data sets and examples. Perhaps a bit more environment testing prior to the tutorial may have helped us; I was getting tripped up quite a bit on fiona and rasterio installation. On a separate note, I’ve noticed that most of the installation or usage issues came because of libraries not being found/linked properly. That may be a burden for the package authors to address, rather than the tutorial leaders.

SciPy 2015: Bokeh & Blaze Tutorial

I sat in the Bokeh/Blaze tutorial this afternoon. It was a challenging tutorial, but a good one nonetheless. I could tell that our tutorial leader, Christine, put in quite a bit of effort in preparing the tutorial content, so kudos to her on that! And the rest of the bokeh team were on hand to support her, and act as TAs for us, so this was also a plus point for Continuum Analytics.

Tutorial Pace

Initially it was challenging for most of the crowd. I had futzed around a bit with bokeh before, so the first three exercises were straightforward. But the rest of the package was a bit cryptic, especially for someone like me coming from the matplotlib world. I actually wonder if the way the objects are organized mirrors matplotlib somehow? For example, in matplotlib, we have the figure object, the Axes objects, and then all of the other objects that the Axes object contains as the plot is being drawn. It seemed like the bokeh objects are organized in a similar fashion, giving coders quite a bit of flexibility in creating a plot.

Anyways, I digress. The tutorial pace was fast initially, but Christine toned it back a bit. We got quite a bit of opportunity to try our hand at coding nonetheless. I think API familiarity was the biggest issue for us in the crowd; I am guessing that if we were more familiar with the API, then we would be able to more easily figure out the coding patterns that are used, and thus have more success in our own attempts at the tutorial content.

Towards the end of the bokeh portion of the tutorial, I found it helpful that Christine pushed her content to Github, so that we could follow along without worrying too much about falling behind on the coding content. I think in the case of plots, I’m of two minds on whether it’s better to modify existing code or to try writing it from scratch – I guess they each have their own pedagogical place. As for the blaze portion, because the blaze API is so similar to the pandas API, I found it very easy to follow along as well.

Tutorial Content

My biggest takeaways were three-fold.

The first takeaway was how the API exposed allows for a high-level, medium-level and low-level construction of web figures.

The second takeaway was how to do dynamically-updating plots, at least at a basic level. That said, it still isn’t intuitive just yet, maybe the API design has to be a bit more matured for it to be so. That said, I’m stoked to learn more about how to do it!

The final takeaway was how to convert between different data sources. The blaze API is pretty cool – I hope they continue to mirror the pandas API!

Other Thoughts

There were a lot of questions raised during the tutorial, particularly on package features. For example, convenience on doing particular operations was a big one. Can it be made easier to do callbacks without the server? Can it be made easier to make different types plots? Can it be made more intuitive to modify a particular part of the figure (i.e. styling)?

I hope the bokeh team is listening! On our part in the community, I think it’s important also to keep identifying these issues, and posting them on their Github repository. I will begin to make this a habit myself :).

One thing that I think might help is to have some illustration of what objects are being drawn to the screen, and their attributes that are modifiable. That would really help with getting familiar with the API. I remember struggling massively through the matplotlib API, simply because I couldn’t map object name to figure element and vice-versa. I’ve put in a Github issue; for my coming projects, I’m thinking of developing a web-based “live figure” to break the mould of academic publishing. Static figures have their place, but live “dynamic” figures for time-aware studies also have their place too, and those figures’ most important, time-sensitive insights cannot be adequately summarized by a static version – e.g. spread of Ebola as the outbreak progresses.

Overall, I’m really glad I took this tutorial, and I’m looking forward to building my next set of datavizs’ in Bokeh!