Hiring anyone is hard, but “data science” poses its own particular challenges:
What are the prerequisites? Are there requirements around programming languages, experiences, or domain expertise? Are we looking for someone to build explanatory models or to maximize predictive accuracy with black-box models and large-scale ensembles? Do they need to have fluency in algebraic topology for novel pre-processing approaches or fluency with real-time distributed computing environments?
We encourage intelligent and driven university students to apply as summer interns in order to learn and improve their data science skills through an intense hands-on work experience.
AFS is a quickly growing small company with a small team, so we only take on interns that we can treat as full team members. Our interns work on projects appropriate for their skill levels and interests, and those projects go through our full QA process and get deployed to production.
We could exclusively hire interns who have previous experience with machine learning or “data science"…but we’d miss out on great candidates who are smart and driven to learn.
So how can we structure a data science interview for students who may not know data splitting, feature engineering, pre-processing, model building and hyperparameter optimization, model stacking, and withholding set validation?
Many data science interviews take the Kaggle approach: the company supplies some data and potential interns and employees are asked to build a predictive model to be judged on root-mean-square error (RMSE) or mean absolute error (MAE).
This is great for pushing the envelope of predictive techniques, but is a small part of what data scientists actually do in industry.
Data science is central to all of our services at Analytical Flavor Systems. We didn’t build a data science team to optimize our product's marketing spend, sales funnel, or client retention – we built a data science team to build our product.
That means we need data scientists who can understand our clients (food and beverage producers) - data scientists who can take a nebulous business goal, create a set of quantitative decision metrics, and build predictive models to optimize those metrics. That’s a tall order!
The extensive role of data scientists at AFS means that we invest in their education across sensory perception (standard sensory science so they know what we’re improving and replacing), tasting experiences (so they appreciate the products we work on and understand how the data is collected), production knowledge (test batches in our R&D brewery and roastery so they understand the data they work with and how our predictions impact a client’s process), and data science tear-downs (a meeting focused on a particular project where the team collaboratively attempts to find and fix problems, try new techniques, and debate the philosophical implications of a model's construction).
Interns are full-time members of the team - we invest in them too.
We collect human sensory data (flavor profile reviews), environmental data, and production data at critical control points throughout the production process giving us hundreds of variables and an unparalleled look at how to model and optimize the creation of beer, coffee, spirits, wine, chocolate, etc., but we often only have a few reviews per critical control point on each batch in production.
We don't have big data (yet). We have Fat Data.
Fat data requires less complexity on the engineering side (no Hadoop, HBase, Hive, or Pig) and a lot more thought on the pre-processing, feature selection, feature engineering, and model building side. Fat data poses problems like multicollinearity, sparse subspace sampling, and observations with unique categorical combinations.
Using only partial least squares regression for generalized linear models (the plsRglm package in R), build an ensemble model to predict the quality score given to each wine from the Vinho Verde region of Portugal (see the data bullet in the Requirements section below to download the datasets). The data consists of chemical tests on wines and an overall assessment score averaged from at least three professional wine tasters. This is interesting data for AFS as the lack of consistent preferences among professional tasters is one of the reasons our company exists.
The rubric for assessment is explained in the Selection Criteria section below. The prediction model should be trained and will be tested on both red and white wine after joining the two datasets. We will split the supplied data into an 80% training set and a 20% hold-out validation set before running your training script with a random seed. We will use mean absolute error as a performance metric.
You will submit:
Your script for training the model or ensemble in a .R file
A short and informal paper explaining your exploratory analysis, findings, and reasoning for: data splitting, feature engineering, pre-processing, model building, hyperparameter optimization, model stacking, and withholding set validation
Deadline: The deadline for a summer internship is the last day of April. Applicants are accepted on a rolling basis
Programming Language: R
Model type: Partial least squares Regression for generalized linear models (plsRglm) or an ensemble of multiple plsRglm
Performance Metric: mean absolute error (MAE)
Runtime: We must be able to run the entire ensemble model resulting from your training script through a single call to
predict(model, newData=newData). We will split the supplied data into an 80% training set and a 20% hold-out validation set before running your training script with a random seed.
Luck: To avoid the potential for seriously unlucky individuals, we will train and run your model three times and take the average MAE of those runs.
We will assess your technical data science interview submission on the following criteria:
Ability to explain your reasoning and results: 50%
Clean and commented code: 20%
Accuracy (measured as 1 - MAE) – complexity penalty: 30%
We will use MAE + 1% per number of models in the ensemble as our evaluation metric. This means there are diminishing marginal returns and you will be penalized for marginally increasing accuracy or recall by using tens of models instead of better modeling techniques
It almost sounds like a fun idea to set up a leaderboard for submissions and get everyone into that competitive spirit, but we’re not going to do that! Remember: the written explanation of your decisions when building the model is worth almost twice as much as the performance of the model itself
Let’s keep life simple: just email me at JasonCEO@Gastrograph.com. Put “data science internship submission” in the subject line unless you think of something far wittier. If you’re cold-applying, perhaps introduce yourself in the email…
The following links might be helpful when running the interview gauntlet:
The Kaggle Ensemble Guide: http://mlwave.com/kaggle-ensembling-guide/
An example of prior work on this dataset: https://rpubs.com/Daria/57835
Categorical encoding of dummy variables: https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
Applied Predictive Modeling: http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485
Looking for a full-time data science position? We’re hiring! Trying to build your own data science team or hire data science interns? Feel free to steal this interview process! Are you a candidate with a particularly great solution you’d like to share with the world? Post it on your personal blog and we’ll take a look there! If you’ve read this far, you should definitely reach out and say hi.