50 Best Data Science Interview Questions and Answers in 2021

50 Best Data Science Interview Questions and Answers in 2021


18 min read

As the name implies, data science is the study of gaining wisdom and knowledge from data through scientific methods and processes in various forms. If you are interested in becoming a data scientist, you may need to impress prospective employers with your skills. To do that, you must be able to pass your next data science interview in one go! In this post, we have compiled a list of the most common data science interview questions you can expect in your next interview!

Data Science has been ranked as one of the hottest professions and the demand for data practitioners is booming. These data science certification courses are intended for anyone interested in developing skills and experience to pursue a career in Data Science or Machine Learning.

Data Science Interview Questions Based on Basic and Technical Concepts

1) What are the feature vectors?

Feature vectors are n-dimensional vectors of numerical features that represent an object. In machine learning, feature vectors are used to represent the numerical or symbolic properties (called features) of an object in a mathematical form that makes it easier to analyze.

2) What are the steps in making a decision tree?

Follow the steps below to make a decision tree:

  • Take the entire data set as input.
  • Look for a split that will maximize the separation of the classes. A split is any test that divides the data into two sets.
  • Apply the split to the input data (divide step).
  • Repeat steps one and two to the divided data.
  • When you meet any stopping criteria, stop.
  • We refer to this process as pruning. If you have gone too far with your splits, clean up the tree.

3) What is Root Cause Analysis?

Root cause analysis was originally developed for analyzing industrial accidents, but now, this method is widely used across fields. Using this technique, you can isolate the root causes of faults or problems. An element is called a root cause if deducting it from the problem-fault sequence prevents the final undesirable event from occurring.

4) What is Prior probability and likelihood?

Probability prior is the proportion of the dependent variable in the data set, while the likelihood is the chance of classifying a given observant in the presence of some other variable.

5) What are Recommender Systems?

A subclass of information filtering systems that predict the ratings or preferences a user will give a product. There are many uses for recommendation systems, including movies, news, research articles, products, social tags, music, etc.

6) Explain cross-validation.

Cross-validation is a model validation technique used to determine how well the statistical analysis results will generalize to a set of independent data. Most often, it is used in situations where one wants to forecast, and an estimation of how accurate the model will be in real-life situations is needed.

Cross-validation involves using a data set to test the model during training (e.g. a validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

7) What is Collaborative Filtering?

Most recommender systems use this filtering process to identify patterns and information through collaborating perspectives, numerous data sources, and a variety of agents.

8) Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or an optima point. You wouldn't reach the global optima point. Depending on the starting conditions and the data, this will determine the outcome.

9) What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment repeatedly. A frequency-style approach begins with this theorem. It states that the sample mean, sample variance, and sample standard deviation all converge to what they are trying to estimate.

10) What is the goal of A/B Testing?

This is a statistical hypothesis test for randomized experiments involving two variables, A and B. A/B testing is intended to identify changes a web page can make to maximize or increase the outcome of a strategy.

11) What are the confounding variables?

In a statistical model, these are extraneous variables that correlate directly or inversely with both dependent and independent variables. A confounding factor is not accounted for in the estimate.

12) What is star schema?

This is a traditional database schema with a central table. A satellite table is used to map IDs to physical names or descriptions, and it can be connected to a central fact table using ID fields. These tables are known as Lookup tables and are commonly used in real-time applications, as they can save a great deal of memory. Sometimes, multiple layers of summarization are employed in star schemas to recover information faster.

13) Why is resampling done?

Resampling occurs in any of the following situations:

  • Using subsets of accessible data, or drawing randomly from data points, to estimate the accuracy of sample statistics.
  • Substituting labels on data points when performing significance tests.
  • Validating models by using random subsets (bootstrapping, cross-validation).

14) What are Eigenvalue and Eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by stretching, compressing, or flipping.

Eigenvectors are for understanding linear transformations. We commonly calculate the eigenvectors of correlations and covariance matrices in data analysis.

15) How regularly must an algorithm be updated?

You should update an algorithm when:

  • You want the model to evolve as data streams through infrastructure.
  • The underlying data source is changing.
  • There is a case of non-stationarity.

16) Explain what Regularisation is and why it is useful.

Regularisation is the process of adding tuning parameters to a model to induce smoothness and prevent overfitting. Most often, this is achieved by adding a constant multiple to an existing weight vector. Usually, this constant is the L1 (Lasso) or L2 (ridge). Using the regularized training data, the model predictions should minimize the loss function.

17) What are the types of biases that can occur during sampling?

These are the types of biases that occur during sampling:

  • Selection bias
  • Undercoverage bias
  • Survivorship bias

18) What is Survivorship bias?

Survivorship bias is the logical error to focus on aspects that contribute to surviving a process and ignore those aspects that do not, as they do not have prominence. In numerous ways, this can lead to incorrect conclusions.

19) What are the different kernel functions in SVM?

There are four types of kernels in SVM.

  1. Linear Kernel
  2. Polynomial kernel
  3. Radial basis kernel
  4. Sigmoid kernel

20) How do you work towards a random forest?

The underlying principle behind this technique is that a strong learner is formed by combining several weak ones. The steps are as follows:

  • Build several decision trees with bootstrapped training samples.
  • For each split, a random sample of mm predictors is chosen from all pp predictors as split candidates.
  • Rule of thumb: At each split m=p√m=p
  • Predictions: At the majority rule

21) What is the significance of the p-value?

p-value typically ≤ 0.05

This shows strong evidence against the null hypothesis; so you reject the null hypothesis.

p-value typically > 0.05

This means weak evidence against the null hypothesis, so you accept the null hypothesis.

p-value at cutoff 0.05 This is supposed to be marginal, meaning it could go either way.

22) How can you select K for K-means?

For K-means clustering, we select k using the elbow method. The elbow method uses k-means clustering on the data set with 'k' being the number of clusters.

Within the sum of squares (WSS), it is calculated as the sum of each member of the cluster's squared distance to its centroid.

23) What are Exploding Gradients?

Exploding gradients are errors that accumulate over time and result in very large weight updates to neural network models during training. In extreme cases, the weights can cause the input data to overflow, giving a NaN value. As a result, your model becomes unstable and is unable to learn from training data.

24) What is the difference between supervised and unsupervised machine learning?

Supervised Machine learning:

Supervised machine learning requires training labeled data. During supervised learning, the algorithm makes predictions on the data iteratively and adjusts for the correct answer based on the previous predictions. Although supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately.

Unsupervised Machine learning:

Unsupervised machine learning doesn’t require labeled data. In contrast, unsupervised learning models discover the structure of unlabeled data on their own. Note that, some human intervention is still required for validating output variables.

25) Explain how a ROC curve works?

ROC curves present a graphical representation of the difference between true positive and false-positive rates. In many instances, it is used as a proxy for the trade-off between sensitivity (true positive rate) and false-positive rate.

From all positive observations (TP/(TP + FN)), the true positive rate is the proportion of observations that were correctly predicted as positive. Likewise, the false positive rate is the percentage of observations that were reported as positive when they were negative (FP/(TN + FP)).

26) Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?

  • K-means clustering
  • Linear regression
  • K-NN (k-nearest neighbor)
  • Decision trees

You can use the K-NN (k-nearest neighbor) algorithm because it computes the nearest neighbor, and if it doesn't have a value, it only computes the nearest neighbor based on the rest of the features.

If you're using K-means clustering or linear regression, you need to do the preprocessing first, otherwise, they'll crash. Decision trees have the same problem, although there are some variations.

27) Explain the SVM machine learning algorithm in detail.

Support vector machine, or SVM, is a supervised machine learning algorithm used for both regression and classification. With n features in your training data set, SVM will plot the features in n-dimensional space, where the value of each feature corresponds to the value of a particular coordinate. Based on the kernel function, SVM divides classes using hyperplanes.

28) What are Entropy and Information gain in the Decision Tree algorithm?

ID3 is the core algorithm for building a decision tree. To build a decision tree, ID3 uses Entropy and Information Gain.


Decision trees are built from the root up and involve the partitioning of data into homogeneous subsets. ID3 checks the homogeneity of samples using entropy. Ultimately, if a sample is completely homogeneous, then it has entropy equal to zero, while if it is equally divided, it has entropy equal to one.

Information Gain

Information Gains are based on the reduction in entropy after an attribute has been split from a dataset. Choosing attributes that provide the most information gain is the key to building a decision tree.

29) What is pruning in the Decision Tree?

Pruning is a process of removing sub-nodes from a decision node in the opposite direction from splitting.

30) What is Ensemble Learning?

To improve the stability and predictability of a model, a diverse set of learners (individual models) are combined and improved, this process is called Ensemble Learning. Among the many types of ensemble learning, two of the most popular ones are described below.


When implementing similar learners on small sample populations, bagging takes the mean of all the predictions. You can use different learners on different populations when using generalized bagging. This helps us reduce the variance error, as you would expect.


Boosting is an iterative technique for adjusting the weight of an observation based on the last classification. The algorithm aims to increase the weight of observation if it was incorrectly classified. In general, boosting reduces bias errors and builds strong predictive models. However, they may overfit the training data.

31) What is Random Forest? How does it work?

In terms of machine learning, random forest is an effective method for performing both regression and classification tasks. Additionally, it can be used to reduce dimensionality and to treat missing values and outliers. This is a type of ensemble learning method, in which weak models are combined to form a strong model.

In Random Forest, several trees are grown instead of just one. Each tree provides a classification for an object based on attributes. The forest chooses the classification that has the most votes (overall, the trees in the forest) and in case of regression, it takes the average of the outputs across trees.

32) What cross-validation technique would you use on a time series data set?

In place of k-fold cross-validation, you should take into account the fact that time series are not randomly distributed data - they are inherently chronologically ordered.

For time-series data, you should use techniques like forward chaining--where you will model past data before looking at forward-looking data.

fold 1: training[1], test[2]

fold 1: training[1 2], test[3]

fold 1: training[1 2 3], test[4]

fold 1: training[1 2 3 4], test[5]

33) What do you understand by the term Normal Distribution?

It is common for data to be distributed in a variety of ways, either biased left or right, or all jumbled together. Even so, it is possible that data may be distributed around a central value without any bias to the left or right, and could even follow a bell-shaped distribution. The random variables are distributed in the form of a symmetrical bell-shaped curve.

34) What is Logistic Regression? Also, State an example of using logistic regression.

Logistic regression, also known as the logit model, is used to forecast the binary outcome from a linear combination of predictor variables. Logistic regression is a statistical model that uses the Logistic function to model conditional probability.

For example, if you want to predict the election outcome of a particular leader. The prediction outcome is binary in this case i.e. 0 or 1 (Win/Loss). Here, the predictor variables would be the amount of money spent on election campaigning, the amount of time spent campaigning, etc.

35) What is a Box-Cox Transformation?

The Box-Cox transformation is a statistical technique for transforming non-normal dependent variables into normal ones. If the data is not normal, most statistical techniques will assume normality. By using a box cox transformation, you can run a broader range of tests.

Developed in 1964, the Box-Cox transformation was named after statisticians George Box and Sir David Roxbee Cox.

36) What is Deep learning?

Essentially, deep learning is a field of machine learning influenced by the structure and function of the brain called an artificial neural network. Deep learning is merely an extension of Neural networks and is a subset of machine learning algorithms such as linear regression, SVM, neural networks, etc.

When it comes to neural networks, we consider a small number of hidden layers, while in deep learning algorithms, we consider a large number of hidden layers to better understand the input-output relationship.

37) What are Recurrent Neural Networks(RNNs)?

Recurrent neural networks (RNN) are the most powerful technique for processing sequential data and are used by Apple's Siri and Google's voice search feature. Due to its internal memory, it is the first algorithm that remembers its input, therefore it is well suited for machine learning problems that involve sequential data. In recent years, deep learning has made some amazing advances thanks to the use of this algorithm.

38) What is the difference between machine learning and deep learning?

Machine learning:

It is a field of computer science that enables computers to learn without explicit programming. Machine learning can be categorized into the following three categories.

  1. Supervised machine learning,
  2. Unsupervised machine learning,
  3. Reinforcement learning

Deep Learning

Deep Learning is a subfield of machine learning that uses algorithms that depict the function and structure of the brain called artificial neural networks.

39) What is Reinforcement learning?

In reinforcement learning, you learn what to do and how to map situations to actions. The goal is to maximize the numerical reward signal. The learner is not told which action to take but instead must discover which action will yield the most rewards. Reinforcement learning is based on the reward/penalty principle, which comes from the learning of humans.

40) What are the drawbacks of the linear model?

Following are the drawbacks of the linear model:

  • The assumption of linearity of the errors
  • It can't be used for count outcomes or binary outcomes.
  • There are overfitting problems that it can't solve.

41) What is TF/IDF vectorization?

tf–idf is short for term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This is a numerical statistic that tells us how important each word is to a collection or corpus of documents. Information retrieval and text mining often use it to weigh results.

The tf-idf value increases with the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to ensure that some words appear more frequently than others.

42) What is the difference between Regression and classification ML techniques?

Both Regression and classification ML techniques come under Supervised machine learning algorithms. Supervised Machine Learning involves training the model using labeled data, while training the model we have to explicitly label the data sets and the algorithm attempts to recognize patterns from input to output.

If our labels are discrete values, then it will be a classification problem, e.g. A, B, etc. However, if our labels are continuous values, then it will be a regression problem, e.g. 1.23, 1.333, etc.

43) What is a p-value?

A p-value can be used to determine the strength of your results when performing a hypothesis test in statistics. A p-value is a number between 0 and 1. It reflects how strong the results are based on the value. In the trial, there is a claim called Null Hypothesis.

Having a low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. Meanwhile, a high p-value (≥ 0.05) symbolizes strength for the null hypothesis which means we can accept the null. The hypothesis p-value of 0.05 indicates the Hypothesis could go either way.

In other words, if your p-value is high, your data likely contain a true null. With low p-values, your data are unlikely to have a true null.

44) What is Naive in a Naive Bayes?

Naive Bayes is based on the Bayes Theorem. According to Bayes' theorem, if you know anything about conditions associated with an event, the probability of the event increases.

The Algorithm is naive because it makes assumptions that may or may not prove to be correct.

45) What are dimensionality reduction and its benefits?

The concept of dimensionality reduction refers to the process of converting large data sets into data sets with fewer dimensions (fields) to present similar information more concisely.

As a result, data is compressed and storage space is reduced. Furthermore, it reduces computing time since fewer dimensions run on fewer machines. This removes redundant features; for example, it makes no sense to store a value in two different units (meters and inches).

46) You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

Here are some ways to deal with missing data values:

We can simply remove the rows with missing data if the data set is large. Using the rest of the data, we can predict the values faster.

For smaller data sets, we can use pandas' data frame in Python to replace missing values with the mean or average of the rest of the data. It can be done in different ways, such as with df.mean() and df.fillna(mean).

47) What are the feature selection methods used to select the right variables?

For feature selection, we have two methods: filter and wrapper.

Filter Methods

This includes:

  • Linear discrimination analysis
  • Chi-Square

For selecting features, the best analogy is "bad data in, bad answer out." Whenever we choose or limit features, it is all about cleaning up the data.

Wrapper Methods

This includes:

Forward Selection: We test one feature at a time and keep adding them until we find a good fit. Backward Selection: All features are tested, and we start removing them to see what works. Recursive Feature Elimination: Recursively examines all the features and how they pair with each other.

The wrapper method is extremely labor-intensive, and a high-end computer is required if a lot of data analysis is to be performed.

48) How can you avoid overfitting your model?

In overfitting, the model is set only for a very small amount of data and ignores the bigger picture. Overfitting can be avoided in three main ways:

-Reduce the number of variables in the model, thereby reducing some of the noise in the training data.

  • Employ cross-validation techniques, like k folds cross-validation.
  • Apply regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting.

49) Differentiate between univariate, bivariate, and multivariate analysis.


Univariate data contains only one variable. In univariate analysis, the aim is to describe the data and to identify patterns among them. Example: height of students

The patterns can be analyzed using mean, median, mode, dispersion or range, minimum, maximum, etc.


Two variables are involved in bivariate data. Typically, the analysis of this type of data focuses on the causes and relationships between the variables, and the relationship between the variables is determined. Example: Temperature and ice cream sales in the summer season.

From the table, it is evident that temperature and sales are directly proportional. The higher the temperature, the better the sales.


In multivariate data, there are three or more variables. It is similar to a bivariate but has more than one dependent variable. Example: data for house price prediction

You can analyze patterns using mean, median, mode, dispersion or range, minimum, maximum, etc. As you describe the data, you can guess what the price of the house will be.

50) What are the differences between supervised and unsupervised learning?

Supervised Learning

  • Uses known and labeled data as input
  • Supervised learning has a feedback mechanism
  • The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine

Unsupervised Learning

  • Uses unlabeled data as input
  • Unsupervised learning has no feedback mechanism
  • The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

If you have made it this far, then certainly you are willing to learn more about data science. Here are some more resources related to data science that we think will be useful to you.

Did you find this article valuable?

Support Yash Tiwari by becoming a sponsor. Any amount is appreciated!