Data Science Interview Preparation: Introduction
Generally speaking, there are 2 kinds of data scientists in top technology companies, such as a FANG (Facebook, Amazon, Netflix, Google) in Silicon Valley. Here is how I would separate them into these 2 camps.
1. Data scientists: Analytics/Inference track
Photo credit: https://unsplash.com
Those people whose daily jobs typically involve collecting data, running analytics and experiments, sharing reports and scorecard on product health and customer churn, and marketing campaigns performance.
After data analysis, they are often asked to develop innovative ideas and proposals to improve product features and validate those new ideas using techniques such as a/b testing.
In summary, those people use data to ‘tell a story’ and to drive business decisions.
Most of them come from statistics, mathematics, economics, psychology, physics, or other quantitative but non-computer science background.
The salary range for this track usually is not as good as a software engineer or a machine learning engineer, typically 15%–20% lower, but the advantage of this track is that it’s much easier to get into or find a job.
It can also be a great cornerstone for those interested in getting into a machine learning track later on.
2. Data Scientists: Machine Learning Engineering/Algorithm Development track
Photo credit: https://unsplash.com/photos/fch6vkbouCc
People who have been very successful in this track are usually hardcore computer scientists or software engineers. They understand basic or even advanced machine learning theories and implement ideas and make things happen.
The biggest unique advantage for those machine learning engineers or algorithm developers is they can quickly convert ideas into a prototype and create production-level source code that efficiently implements machine learning models into production, or external customer-facing environments, thanks to their computer engineering background.
The machine learning engineer track's salary is at least on par with the software engineer track, if not much higher. The bar to get into the machine learning engineer track is high. It usually requires a good understanding of machine learning theories and practices but also solid software engineering skills.
Table 1: skills comparison for 2 data scientists tracks: Analytics/Inference vs. Machine Learning/Algorithm
This article will focus on the analytics/inference track and walk you through my 7 steps to prepare for a data scientist job interview.
SQL is a must-know programming language for any data analytics professionals.
However, many college graduates or young professionals are starting their job search without a solid understanding of SQL or are struggling with coding questions — which ultimately costs them their dream jobs.
The SQL interview can bear other names such as Technical Analysis; during an interview at a FAANG company, you will be asked to perform a series of SQL operations to extract data and insights and answer follow-up questions about their products.
(*) FAANG: Facebook, amazon, apple, Netflix, and google
Given a table of user sign-up dates and their registered countries, write a query to produce the number of newly joined daily users in the last 30 days by our top 2 countries.
1. user_id |BIGINT
2. joined_at | DATE
3. country | VARCHAR
How to prepare
a. If you are an absolute beginner:
Consider taking an online SQL course to understand SQL and then jump into coding practices.
A resource to consider: Cracking the SQL Interview for Data Scientists to learn SQL basic SELECT statements to advanced WINDOW functions step by step, with many coding assignments to reinforce your learning.
b. If you are an experienced SQL user:
There is no better way to prepare for a SQL interview than practicing coding exercises.
A resource to consider: sqlpad.io, where you can practice and solve 80 SQL coding interview questions.
The 80 questions range from basic SELECT statements to advanced window functions, which will get you ready to answer a wide range of SQL interview topics.
(Full disclaimer: I am the author of both the Cracking the SQL interview for Data Scientists course and sqlpad.io.)
c. Pay special attention to WINDOW functions.
WINDOW functions are a family of SQL utilities that are often asked during a data scientist job interview.
Writing a bug-free WINDOW function query could be quite challenging for any candidate, especially those who just get started with SQL. It takes time and practice to master those functions.
2. Product Sense
Photo credit: https://unsplash.com/photos/7OFnb7NOvjw
One of the data scientists’ main responsibilities is to extract insights from data and work with product managers and engineering teams to deliver actionable plans to improve the product. Think about how you would measure the success of different parts of the product. Why do you think the placement of the text box is at that specific location? What can you do to improve it?
The interviewer will try to evaluate your ability to apply data to the real product problem, how you systematically approach and structure the problem, form a hypothesis with reasonable assumptions, design, and test hypotheses through A/B testing, and use data and facts to convince others to adopt your recommended approaches.
- If revenue dropped in a given week, what metrics would you look at to understand and why?
- How would you measure the health of our product search functionality?
How to prepare
- I highly recommend going through this book Lean Analytics: Use Data to Build a Better Startup Faster (Lean Series), which gives you a perfect sense of how startup companies use analytics to drive their product decisions. Top technology companies, especially those in Silicon Valley, regardless of their sizes, tend to think of themselves as still a startup, at least with a startup mindset in growing the company.
- If you still have time, consider reading this book: Cracking the PM interview. If you are short on time, I will go through those 3 chapters: product, case studies, behavior questions.
3. Data processing with Python/R
The interviewer will evaluate your skills in basic operations in Python/R, 2 of the most popular programming languages, in most data science teams in Silicon Valley.
The bad news is that you will most likely not even get a chance for a phone interview if you are not familiar with neither of the two languages.
The good news is that you don’t actually need to know both of them. Pick either one and become very good at it. Build a project using either R or Python.
A side note: from my observation, it is highly likely Python will become the dominant player because of its great ecosystem. It’s a general programming language and much easier to productionize and serve a python model on the internet, comparing to R.
If you are brand new to either R or Python and choose a language to start with, I would pick Python.
I used to be a heavy R user and have presented at useR!, but I completely switched to Python 5 years ago and never regretted it.
In addition to basic data processing, you will very likely be asked to perform a series of analytics, visualization, or modeling with the data sets to make sure you will be hands-on with the tool and get a sense of your experience level.
Read a CSV file into Python/R, handle missing data, build and train a classification model, evaluate its performance, and prepare a report and share the Jupiter notebook with the interviewer.
How to prepare
a. For Python people
- For people new to Python: datacamp has classes that cover pandas, matplotlib, seaborn and good enough for you to get started;
- After you familiarized yourself with basic data processing, you can jump onto sci-kit learn libraries which have some excellent tutorials including data processing, feature selection, and modeling with real data: https://scikit-learn.org/stable/tutorial/index.html
b. For R people
- Coursera’s R programming class can help you brush up your skills in a couple of weeks. https://www.coursera.org/learn/r-programming
In the end, if you still have time, I also highly recommend creating a Kaggle account, join a couple of competitions there, and read other top competitor’s R/Python code, which will significantly help you understand how to solve a real-world problem (e.g., normalize data, handle missing data, create ensemble learning to boost models performance), and become a better data scientist.
4. A/B testing
Photo credit: https://unsplash.com/photos/E1eCQdiO_E4
A/B testing is a statistical framework that helps validate an idea or a theory through data.
For example, a product manager wants to know if changing the color of a buy button from green to blue can encourage more purchases. As a data scientist, it is your job to work with the product manager and, quite often, the engineering team(can help implement the testing settings) to develop a testing plan.
You need to decide at least how many people will see the different colors of the button (sample size), and how many days will the testing run (usually multiples of a week, 7 days), and where should it be running (the US only, or some other small countries just in case testing group is a failure, you don’t want to have a very negative impact to the revenue).
The key assumption of A/B testing is that the control group and the testing group have to be independent. You will probably be asked several questions around this assumption.
You will also need to understand key concepts such as novelty effect, learning effect, A/A testing, Simpson’s paradox, etc.
The engineering team just invented a people-you-may-know widget. If it is implemented, a user will see their friends on the right-left corner of their homepage. How do you design an experiment to decide whether we should launch this feature or not?
How to prepare
Udacity has a free introduction class taught by practitioners from Google, which I highly recommend. As long as you get yourself through this class and feel comfortable with key concepts, and finished the home assignments, you should handle most of the A/B testing related questions. https://www.udacity.com/course/ab-testing--ud257
A side note: very often, you will be asked to make recommendations based on different scenarios, e.g., if the results are significant, what should the product marketing team do, and vice versa.
To answer this question, always use a framework, for example: if it is confirmed significantly positive, double down on this approach, expand this success story to other markets and repeat the test.
If it turns out the results are not significant or significant but negatively, come up with new theories and start testing new ideas.
It’s a never ending new ideas/proposals => A/B testing => recommendation cycle 😃.
5. Statistics/Statistical Inference
Photo credit: https://unsplash.com/photos/WY302kitn7U
A data scientist is a statistician who lives in San Francisco.
Jokes aside, as a data scientist, you will most likely encounter many situations happening in the real world, for example, missing data, unbalanced samples, how to decide sample size, perform hypothesis testing, form reasonable assumptions, explain to your business leaders what significance interval means. Therefore statistics skills are necessary to ace a data scientist interview.
What is Type I and Type II error, how do you explain p-value to a non-technical people? What are the assumptions for 2 sample t-test?
How to prepare
You can practice statistics questions on brilliant.org, which I found it quite easy to brush up my skills in preparing statistics interview questions quickly:
Side note: probability questions are not the same as statistics questions. You can think of probability questions are more about math, while statistics questions are more about dealing with real data.
For 2 fair dices with marks 1–6, how many times on average we have to roll, so the sum of the two dices ends up greater than 10?
How to prepare
brilliant.org is a good resource.
7. Behavioral questions
Behavior questions are probably the easiest part to prepare that has the most ROI (return on investment), but many people spend very little time on this and get caught off guard with questions like tell me a time when you disagreed with your boss.
- Tell me about your biggest failure/success/favorite project.
- Describe an unpopular decision you made with the product team. How did you handle the situation and implement it?
How to prepare
List your past 5 projects with interesting stories using the SAR framework (situation, action, and results) to demonstrate your leadership, successes, failures/mistakes, challenges(disagreement with your manager, coworker).
Find a partner and practice through a mock interview and get their feedback. The important thing is that your stories have to be ‘meaty’, and be prepared when an interviewer dive into the details.
I have written an article with detailed step-by-step instructions on how to prepare them. Feel free to check it out here.
Another resource to consider is amazon’s top leadership principle.
Those are the 7 areas I recommend to focus on interviewing analytics/inference track data scientist positions.
It is the same process I use to ace my interviews at some of the top tech companies.
I hope they are useful, and if you have any questions, please feel free to reach out to me.
Whether you are a first-time job seeker or a professional who wants to make a change to your career, you can find me on Twitter or online chat with me on sqlpad.io.