Final Bootcamp Blitz: Proposal to Presentation in 4 weeks

In August, I was nearly brought to tears trying to write my first Python function. By November, fellow classmate Matteo Jucker Riva and I presented results from a deep learning computer vision algorithm we had created to help researchers from the University of Zurich and ETH study climate change in Switzerland.

The Propulsion Academy Data Science Bootcamp can be divided into two parts. Part one, a two-month long cram session beginning with the review of an entire semester’s worth of statistics in six days and ending with us creating artificial neural networks for natural language processing. But employers care not only about what a potential employee studied but what that potential employee can do. Which is why Propulsion Academy created bootcamp part two, final projects. Students take raw industry datasets and attempt to create a refined product in under 4 weeks.

The combined herbaria of the University of Zurich and ETH provided Matteo and I with 6000 high-resolution photos of plant specimens in the Brassicacea family collected in Valais, Switzerland over the last 150 years. Reviewing herbarium specimens for flowers or fruits and combining that with collection dates is an excellent way to assess climate change’s effect on the Swiss environment, but reviewing specimens by hand is time consuming. Dr. Alessia Guggisberg at the herbarium asked us to:

  1. Classify if a plant specimen had flowers or fruits. And,
  2. Count the number of flowers and fruits
Raw images were 6000x4000 pixels. This sample is cropped to 1000x1000 pixels. The two plants on the left mostly have fruits (seedpods). The third plant has a few flowers. Often, far less than 1% of the pixels in any image were relevant to the model.

Data Preparation (week 1):

The Brassicacea family is incredibly diverse and include agricultural crops such as cabbage, radish, mustard, and wasabi. The wild species are equally as diverse as the agricultural so our first step was to show the computer what it is looking for. In our case that meant days of segmenting the images by outlining flowers and seeds. We were also lucky enough to find a dataset from a similar study and bring those images and outlines into our workflow.

Segmenting flower heads using datatorch.io

Experimentation (weeks 2 + 3):

Bootcamp education is beautifully broad but we needed to go deep. We began our image classification challenge by following some of the model frameworks we’d learned in the course but soon figured out that wasn’t going to be enough. Matteo experimented with PyTorch and FasterNet models while I dove into UNet models. Both came close to solving the dual nature of our problem but we were hitting roadblocks that we didn’t have the time to get around. At the eleventh hour (aka shortly before we were going to be cut off and had to start making our presentation) we discovered the beauty of MaskRCNN’s ability to answer both questions in one model and went all in.

Our experimental process looked something like this.

Presentation (week 4):

We correctly classified whether a photo had flowers or fruits 67% of the time. And counted 34% of flowers or fruits on the specimens. On Friday November 20th we had the pleasure of presenting our results to a Zoom audience of over 150.

Lessons and Improvements (ongoing):

Throughout the bootcamp we heard repeatedly that data preparation is going to be an enormous part of our new lives as data scientists. We didn’t take the advice to heart until this project. All our ongoing improvements revolve around increasing the quality of our training data: segment more photos, segment a greater diversity of species, and segment those photos at a finer level.

Matteo and I are incredibly grateful to the staff of Propulsion Academy for teaching us everything we needed to tackle such a challenge, Barry Sunderland of the ETH Library Lab for writing the project proposal, and Dr. Alessia Guggisberg for providing the questions and data. We will use every lesson from the last few months as we continue diving into new problems as professional data scientists.

Aspiring data scientist. Previously: ecologist, researcher, and science communicator. Want to work at the intersection of data and communications.