Gym Data

4 minute read

Published:

Prior to the pandemic, my school’s gym would tweet out how many people were in the weight room at a given time. This was useful for an instantaneous understanding of how busy the gym was, but was not useful if you wanted to answer how busy the gym might be later. During my PhD, I managed to create a little python script to scrape some tweets, extract the number of people in the weight room, and model the data with a linear model. The model did OK, but I wanted to take it a step further and estimate the effects of days of weeks, months, special events, and even weather (because, as the gym’s manager told me when I met with her, the weather is a major factor in people going to the gym).

So long story short, I made GyMBRo (short for Gym Monitoring By Robot). GyMBRo is some python code to scrape tweets as far back as 2014, do some lite data engineering, and fit a boosted tree to the data. GyMBRo also tweeted out predictions on the hour and had an MAE of 9 people when testing on 2019 as a hold out year.

Alas, the pandemic has ended GyMBRo’s legacy, but the data remains. I’ve since stopped caring about point predictions because I feel like I can do that satisfactorily. What I’m more interested in now is using the data to learn more about some new modelling techniques and get some reasonable estimates of prediction uncertainty. For example I recently used the data to learn about gamlss with some modest success. Is that model perfect? No. Is it good enough? Absolutely.

But there is always more I can do, so I figure I would open the data up to other people. This blog post is intended to explore the data a little bit and do a bit of knowledge transfer from me to you. Let’s begin.

A Little About The Gym

The gym is split into two areas: Weight Room (WR in the tweets) and the Cardio Mezzanine (CM). The WR is usually busier, and so the way I extract this information from the tweets is to find the largest number in the tweets. It is a heuristic and it seems to work sufficiently well. The gym tries to tweet every half hour, but will be a few minutes early/late because humans are tweeting. The gym may sometimes go hours without a tweet. Very rarely do they not tweet at all, but do have a few days in the data where there are only a couple tweets. Shown below is a plot of the weight room counts versus the time they were made. We can see some banding on the half hours. For example, we see 7 bands between 8 am and 11 am corresponding to: 8 am, 8:30 am, and so on. There is some imprecision in the timing of the tweet as I mentioned earlier, leading to fuzzy bands and not straight lines.

The WR can hold maybe 200 people before it gets uncomfortably crowded. The busiest I’ve ever seen it is about 205. The busiest times are weekdays in January and September. I anticipate this is a “new (academic) year, new me” effect which quickly vanishes come midterm time (October and February respectively). Because the gym is on campus, it primarily serves students. Hence, dates which effect students will also effect the gym. For example, the gym’s activity will slowly decrease as thanksgiving and reading week approach. This is because students start to leave through out the week leaving fewer students on campus, and hence fewer patrons. A similar but opposite effect happens as we lead away from “special days” (e.g. the gym becomes busier in the days leading into the start of the term because more students are coming back to the city) Side note: Including time_to_holiday and time_since_holiday reduced GyMBRo’s MAE from 11 to 9. Very proud of engineering that feature. The summer months (May to August) are noticeably less busy. Shown below is a plot of the median WR count for each month in each year. You can see how midterms result in fewer people going to the gym in order to study or maybe just relax. The start of the academic and calendar year are usually have the highest median count.

The Data

You can download the data here. The two most important columns are created_at which is the time stamp that the tweet was made, and y which is the associated WR count in the tweet. From the time stamp, you can extract things like year, month, time, etc. I’ve also included some weather data, but I don’t think the effects are going to be very big as compared to time/day/month effects.