Demetri Pananos Ph.D - Did You Do Your Homework?

This tweet made its way to my corner of the Bird Site. I’m a glutton for math homework punishment so I attempted it (and failed) but do not confuse my interest as legitimizing the tweet.

Frankly, I think asking questions found in Wooldridge is not a good litmus test of anything aside for your memory from undergrad. But that isn’t the point of this post. The point is to reply to Senior Powerpoint Engineer’s (SPE for short here, or should I call them @ryxcommar) blog post which came as a response to the aforementioned tweet.

If I can summarize the post (which you should read), SPE makes the following points:

The question should not be if you should ask this tricky math question to candidates, but rather you should ask yourself if you really need to hire a data scientist or some other role like a software engineer, analytics engineer, data analyst, ML engineer, etc.

The term “data scientist” signals some sort of advanced knowledge in applied math/stats. Many bootcamps go about teaching modelling material as if that is most of the job (it isn’t) and even then most data scientists don’t actually understand one of the most foundational models well enough to even begin answering a homework question correctly. If they aren’t toiling and they don’t understand the theory, what are they really doing?
Additional anecdotes about how simple heuristics beat some data scientist’s logistic regression.

The post is paired nicely with his thread on why 2023 is a bad year to bootcamp your way to a data science job, which hits on a some of the same points.

In brief, I agree with SPE and we’ve all heard it before. Data science is suffering from being a bullshit job, but clearly quantitative thinking is useful in business. What should be taught if not XGBoost API’s and linear algebra theory? What are we really looking for?

Value Proposition

My main contention here is the value proposition of a data scientist. If my summary is a fair, then SPE’s point is that data scientists are not in the trenches working with the data, instead they “twiddle with models in Jupyter”. But even then they don’t understand those models deeply enough (as evidenced by their inability to do homework questions). If they aren’t working with data and they can’t do science, what are they doing? What is their value? You probably don’t know, so think if you need a data scientist or some other position.

That feels reductive. Those can’t be the only two measures of a good data scientist, but granted that we need to answer “what do you do here”. I can only speak to my (limited) experience in tech, but many of the problems facing data scientists are not purely tech or math problems, so it doesn’t make sense to value data scientists on their ability to solve just those kinds of problems. In my experience, the majority of problems data scientists face are people problems.

Along the lines of people problems, the value proposition of a data scientist– in my humble opinion– is to get people to think about their problems scientifically. That is a lot harder than it sounds when the people you work with have not spent 4-10 years ostensibly¹ doing science. But it is also more valuable than just writing dbt models and dashboarding.

Here is an example from my work about getting people to think scientifically. Data analysts routinely made dashboards for the A/B tests we ran prior to my ownership of experimentation protocols. Many of the top of funnel metrics we measured were measured in terms of one another (e.g. Signup Rate = Signups per Visitor, Activation Rate = Activations per Signup, which is the ratio of Activation per Visitor and Signups per Visitor.) etc. Fine from a business perspective.

When it comes to inference, I don’t need to tell you that if you measure those two metrics as I’ve written them then anything that increases signup rate is going to decrease activation rate, so its going to look like everything is a stalemate. But I did have to share that with most everyone I talked to about experimentation, data analysts included. And it wasn’t just one person or team doing it, it was nearly the entire company. Measuring the right thing (and most importantly, knowing how to explain that to people because it isn’t as easy as mentioning conditioning on a post treatment variable) is like 20% math problem and 80% people problem due to all the stakeholder management that has to go into explanation why we are now changing the metrics, and how the numbers don’t match, no we can’t keep doing it like we’ve been doing it, what the source of truth is now, and on and on and on. It seems cheesy for me to say “look how my scientific thinking has changed the company for the better” but I think this is the type of work we should be looking for and training data scientists for. This is the answer to “what do you do here”, and what data scientists should be doing instead of screwing around with ChatGPT.

I think SPE would call this a “job perk” – thought leadership for internal decision making. That assumes you’re in a company that thinks scientifically and you’re free to do higher level stuff. We should all be so lucky. Thought leadership isn’t a perk, it is the job. You’re leading people to think scientifically. This is the disconnect I think SPE mentions. If you buy my rant, the title data scientist can imply an advanced understanding of applied math and not be expected to answer homework questions because remembering tricks from econometrics class isn’t used regularly in thinking scientifically. Candidates and interviewers aren’t being measured for/measuring the right thing (the ability to think scientifically and convince/teach others to follow suit) and here we are talking about if remembering \(E[\hat{\beta}_1] = \frac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(X)}\) is a good barometer for being a data scientist or not.

We Agree

None of this undermines SPE’s points, it isn’t like bootcamps are teaching how to think scientifically. But it isn’t news that data science is over hyped, nor is it news that fresh grads aren’t well equipped for the job, nor is it news that correct linear algebra problem sets are not a good indicator for success on the job. I can’t rebut something we all agree is true, but I can ask for solutions, next steps, jeez anything than another “zero interest rate phenomenon” joke.

If I could amend SPE’s blog post, it would be as follows:

But I don’t think “should you ask it?” is the right question. The real question here is: “Should you be hiring a data scientist? If so, are you willing to listen to what they have to say? Are you capable of recognizing scientific ability and probing to further test it?” And the answer to that is probably not. – (Emphasis mine).

Footnotes

I say “ostensibly” because I’m not sure what I did was “science” as much as it was survival.↩︎