 ### Regression techniques in Machine Learning

Machine learning has become the sexiest and very trendy technology in this world of technologies, Machine learning is used every day in our life such as Virtual assistance, for making future predictions, Videos surveillance, Social media services, spam mail detection, online customer support, search engine resulting prediction, fraud detection, recommendation systems, etc. In machine learning, Regression is the most important topic that needed to be learned. There are different types of Regression techniques in Machine Learning which we will know in this article.

### Introduction:

Regression techniques in Machine Learning such as Linear regression and Logistic regression are the most important algorithms that people learn while they study about Machine learning algorithms. There are numerous forms of regression that are used to perform regression and each has its own specific features, that are applied accordingly. The regression techniques are used to find out the relationship between the dependent and independent variables or features. It is a part of data analysis that is used to analyze the infinite variables and the main aim of this is forecasting, time series analysis, modeling.

### What is Regression?

Regression is a statistical method that mainly used for finance, investing and sales forecasting, and other business disciplines that make attempts to find out the strength and relationship among the variables.

There are two types of the variable into the dataset for apply regression techniques:

1. Dependent Variable that is mainly denoted as Y
2. Independent variable that is denoted as x.

And, There are two types of regression

1. Simple Regression: Only with a single independent feature /variable
2. Multiple Regression: With two or more than two independent features/variables.

Indeed, in all regression studies, mainly seven types of regression techniques are used firmly for complex problems.

• Linear regression
• Logistics regression
• Polynomial regression
• Stepwise Regression
• Ridge Regression
• Lasso Regression

### Linear regression:

It is basically used for predictive analysis, and this is a supervised machine learning algorithm. Linear regression is linear approach to modeling the relationship between scalar response and the parameters or multiple predictor variables. It focuses on the conditional probability distribution. The formula for linear regression is Y = mX+c.

Where Y is the target variable, m is the slope of the line, X is the independent feature, and c is the intercept. #### Additional points on Linear regression:

1. There should be a linear relationship between the variables.
2. It is very sensitive to Outliers and can give a high variance and bias model.
3. The problem of occurring multi colinearity with multiple independent features

### Logistic regression:

It is used for classification problems with a linear dataset. In layman’s term, if the depending or target variable is in the binary form (1 0r 0), true or false, yes or no. It is better to decide whether an occurrence is possibly either success or failure. 1. It is used for classification problems.
2. It does not require any relation between the dependent and independent features.
3. It can after by the outliers and can occur underfitting and overfishing.
4. It needs a large sample size to make the estimation more accurate.
5. It needs to avoid collinearity and multicollinearity.

### Polynomial regression:

The polynomial regression technique is used to execute a model that is suitable for handling non-linear separated data. It gives a curve that is best suited to data points, rather than a straight line.
The polynomial regression suits the least-squares form. The purpose of an analysis of regression to model the expected y value for the independent x of the dependent variable.
The formula for this Y=  β0+ β0x1+e Look particularly for curve towards the ends to see if those shapes to patterns make logical sense. More polynomials can lead to weird extrapolation results.

### Step-wise Regression:

It is used for statistical model fitting regression with predictive models. It is done automatically.
The variable is supplemented or removed from the explanatory variable set at every step. The main approaches for the regression are reverse elimination and bidirectional elimination and step by step approaches.
The formula for this: b = b(sxi/sy)
1. This regression provides two things, the very first one is to add prediction for each steep and remove predictors fro each step.
2. It starts with the most significant predictor into the ML model and then adds features for each step.
3. The backward elimination starts with all the predictors into the model and then removes the least significant variable.

### Ridge Regression:

It is a method that used when the dataset having multicollinearity which means, the independent variables are strongly related to each other. Although the least-squares estimates are unbiased in multicollinearity, So after adding the degree of bias to the regression, ridge regression can reduce the standard errors. 1. In this regression, normality is not to be estimated the same as Least squares regression.
2. In this regression, the value could be varied but doesn’t come to zero.
3. This uses the l2 regularization method as it is also a regularization method.

### Lasso Regression:

Lasso is an abbreviation of the Least Absolute shrinkage and selection operator. This is similar to the ridge regression as it also analyzes the absolute size of the regression coefficients. And the additional features of that are it is capable of reducing the accuracy and variability of the coefficients of the Linear regression models. 1. Lasso regression shrinks the coefficients aero, which will help in feature selection for building a proper ML model.
2. It is also a regularization method that uses l1 regularization.
3. If there are many correlated features, it picks only one of them and shrinks it to the zero.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Regression techniques in Machine Learning,Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Top 50 interview questions of Machine Learning

51. How to handle categorical variables in KNN?

Ans: Create dummy variables out of a categorical variable and include them instead of the original categorical variable. Unlike regression, create k dummies instead of (k-1).

For example, a categorical variable named “Department” has 5 unique levels/categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

52Can KNN be used for Regression? How to use KNN for Regression?

Ans: Yes, K-nearest neighbour can be used for regression. In other words, the K-nearest neighbour algorithm can be applied when the dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbours.

53Discuss the difference between KNN and K Means Algorithms.

Ans: KNN and k-means clustering both are very different algorithms that solve different problems and have their own meanings of what the variable ‘k’ is.  KNN is a supervised classification algorithm that will label new data points based on the ‘k’ number of nearest data points and k-means clustering is an unsupervised clustering algorithm that groups the data into ‘k’ number of clusters.

54. How to reduce the increased variance of the model other than changing k?

Ans: By using bagging-based decision boundaries. If not restricted in the number of times, one can draw samples from the original dataset, a sample variance reduction method would be to sample, many times, and then simply take a majority vote of the kNN models to fit each of these samples to classify each test data point. This variance reduction method is called bagging.

55. What is the effect of sampling on KNN?

Ans: Sampling does several things from the perspective of a single data point since kNN works on a point-by-point basis.

1. The average distance to the k nearest neighbours increases due to increased sparsity in the dataset.
2. Consequently, the area covered by k-nearest neighbours increases in size and covers a larger area of the feature space.
3. The sample variance increases.

A consequence of this change in input is an increase in variance. When we talk of variance, we refer to the variability in the predictions given different samples from the population. Why would the immediate effects of sampling lead to the increased variance of the model?

Notice that now a larger area of the feature space is represented by the same k data points. While our sample size has not grown, the population space that it represents has increased in size. This will result in higher variance in the proportion of classes in the k nearest data points, and consequently a higher variance in the classification of each data point.

56. What happens when we change the value of K in KNN?

Ans: Short Answer: The class boundaries of the predictions become more smooth as k increases.

Long Answer: What really is the significance of these effects? First, it gives hints that a lower k value makes the KNN model more “sensitive.” That is, it is more sensitive to the local changes in the dataset. The “sensitivity” of the model directly translates to its variance.

All of these examples point to an inverse relationship between variance and k. Additionally, consider how KNN operates when k reaches its maximum value, k=n, where n is the number of points in the training set) In this case, the majority class in the training set will always dominate the predictions. It will simply pick the most abundant class in the data, and never deviate, effectively resulting in zero variance. Therefore, it seems to reduce variance, k must be increased.

Final Verdict: In order to offset the increased variance due to sampling, k can be increased to decrease model variance.

57. What is the thumb rule to approach the KNN problem?

Ans:

2. Initialize the value of k
• Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
• Sort the calculated distances in ascending order based on distance values
• Get top k rows from the sorted array
• Get the most frequent class of these rows
• Return the predicted class for getting the predicted class, iterate from 1 to the total number of training data points.

KNN Code Snippet: 58What is SVM Algorithm?

Ans: SVM stands for support vector machine, it is a supervised machine learning algorithm that can be used for both Regression and Classification. In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors) Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away. In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.

59. What are support Vectors?

Ans: A support vector machine attempts to find the line that “best” separates two classes of points. By “best”, we mean the line that results in the largest margin between the two classes. The points that lie on this margin are the support vectors.

The vectors that define the hyperplane are the support vectors.

60. What is the purpose of the Support Vector in SVM?

Ans: A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the distance margin between the two classes. The extreme points in the data sets that define the hyperplane are the support vectors.

61. What are kernels?

Ans: SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of the kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be of different types.

There are four types of kernels in SVM.

1. Linear Kernel
2. Polynomial kernel
4. Sigmoid kernel

62. What is Kernel Trick?

Ans: Short Answer:  It allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space.

1. For a dataset with n features (~n-dimensional), SVMs find an n-1-dimensional hyperplane to separate it (let us say for classification)
2. Thus, SVMs perform very badly with datasets that are not linearly separable
3. SVM can now do well with datasets that are not linearly separable
4. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
5. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly)
1. For datasets with a lot of features, it becomes next to impossible to try out all the interesting transformations
6. Enter the Kernel Trick
• Thankfully, the only thing SVMs need to do in the (higher-dimensional) feature space (while training) is computing the pair-wise dot products
• For a given pair of vectors (in lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first
• We are saved!

63. Why is SVM called as Large Margin Classifier?

Ans: Short Answer: Because it places the decision boundary such that it maximizes the distance between two clusters.

Long Answer: choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by: Which is simply the functional margin normalized. So, these intuitions lead to the maximum margin classifier which is a precursor to the SVM.

64What is the difference between Logistics Regression and SVM? When to use which model?

Ans:

1. SVM tries to find the “best” margin (distance between the line and the support vectors) that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.
2. SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables.
3. SVM is based on the geometrical properties of the data while logistic regression is based on statistical approaches.
4. Logistic Regression can’t be applied to a nonlinearly separable dataset whereas SVM can be applied.
5. The risk of overfitting is less in SVM, while Logistic regression is vulnerable to overfitting.

65. When to Use Logistic Regression vs Support Vector Machine?

Ans: Depending on the number of training sets (data)/features that you have, you can choose to use either logistic regression or support vector machine.

Let’s take these as an example where:
n = number of features,
m = number of training examples

1. If n is large (1–10,000) and m is small (10–1000): use logistic regression or SVM with a linear kernel.
2. If n is small (1–1000) and m is intermediate (10–10,000): use SVM with (Gaussian, polynomial, etc) kernel
3. If n is small (1–100), m is large (50,000–1,000,000+): first, manually add more features and then use logistic regression or SVM with a linear kernel

66. What does c and gamma parameter in SVM signify?

Cost and Gamma are the hyper-parameters that decide the performance of an SVM model. There should be a fine balance between Variance and Bias for any ML model. (this is a science and an art – as we call it in empirical studies)

For SVM, a High value of Gamma leads to more accuracy but biased results and vice-versa. Similarly, a large value of Cost parameter (C) indicates poor accuracy but low bias and vice-versa.

Following table summarizes the above explanation –

The art is to choose a model with optimum variance and bias. Therefore, you need to choose the values of C and Gamma accordingly.

Optimum values of C and Gamma can be found by using methods like Grid search.

The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’.

The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.  If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of the high density of any pair of two classes.

• SVM’s are very good when we have no idea about the data.
• Works well with even unstructured and semi-structured data like text, Images, and trees.
• The kernel trick is a real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
• Unlike in neural networks, SVM is not solved for local optima.
• It scales relatively well to high dimensional data.
• SVM models have generalization in practice, the risk of over-fitting is less in SVM.
• SVM is always compared with ANN. When compared to ANN models, SVMs give better results.

• Choosing a “good” kernel function is not easy.
• Long training time for large datasets.
• Difficult to understand and interpret the final model, variable weights, and individual impact.
• Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic.
• The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact.

SVM code snippet: 68What is Naïve Bayes Algorithm?

Ans: It is a classification algorithm that predicts the probability of each data point belonging to a class and then classifies the point as the class with the highest probability.

Discuss Bayes Theorem.

Bayes’ Theorem gives us the probability of an event actually happening by combining the conditional probability given some result and the prior knowledge of an event happening.

Conditional probability is the probability that something will happen, given that something has occurred.  In other words, the conditional probability is the probability of X given a test result or P(X|Test).  For example, what is the probability an e-mail is spam given that my spam filter classified it as spam.

The prior probability is based on previous experience or the percentage of previous samples.  For example, what is the probability that any email is spam?

Formally

• P(A|B) = Posterior probability = Probability of A given B happened
• P(B|A) = Conditional probability = Probability of B happening if A is true
• P(A) = Prior probability = Probability of A happening in general
• P(B) = Evidence probability = Probability of getting a positive test

69. Why is Naïve Bayes Naïve?

Ans: In Layman’s Term: The simple meaning of Naive is willing to believe that that life is simple and fair, which is not true. Naive Bayes is naive because it assumes that the features that are going into the model are not related to each other anyhow Change in one variable will not affect the other variable directly.

Long Answer: Naive Bayes (NB) is ‘naive’ because it makes the assumption that features of measurement are independent of each other. This is naive because it is (almost) never true. Here is how it works even then – NB is a very intuitive classification algorithm. It asks the question, “Given these features, does this measurement belong to class A or B?”, and answers it by taking the proportion of all previous measurements with the same features belonging to class A multiplied by the proportion of all measurements in class A. If this number is bigger than the corresponding calculation for class B then we say the measurement belongs in class A.

70. What are feature matrix and response vectors?

Ans: Feature matrix:- The feature matrix contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features.

Response vectors:- The response vector contains the value of the class variable (prediction or output) for each row of the feature matrix.

71 Applications of Naïve Bayes Classification Algorithms?

Ans: Some of the real-world examples are as given below

• To mark an email as spam, or not spam?
• Classify a news article about technology, politics, or sports?
• Check a piece of text expressing positive emotions, or negative emotions?
• Also used for face recognition software.

72. What are the Advantages and Disadvantages of using the Naïve Bayes Algorithm?

1. Fast
2. Highly scalable.
3. Used for binary and Multiclass Classification.
4. Great Choice for text classification.
5. It can easily train smaller data sets.

Naive Bayes considers that the features are independent of each other. However, in the real-world, features depend on each other.

Naïve Bayes Code Snippet: 73. What is K-Means Clustering? What are the steps for it?

Ans: K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, original from signal processing, that is popular for cluster analysis in data mining.

If k is given, the K-means algorithm can be executed in the following steps:

• Partition of objects into k non-empty subsets
• Identifying the cluster centroids (mean point) of the current partition.
• Assigning each point to a specific cluster
• Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
• After re-allotting the points, find the centroid of the new cluster formed.

74. Why is the word “means” associated with the name of the K-Means algorithm?

Ans: The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

There are k-medoids and k-medians algorithms as well.

k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids choose datapoints as centers (medoids or exemplars).

k-medians is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median.

75. How to find the optimum number of clusters in K-Means? Discuss the elbow curve/elbow method?

Ans: The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS. Notice the elbow at k =3.

The optimal number of clusters can be defined as follow:

1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

76. What is the difference between K-Means and Hierarchical Clustering? When to use which?

Ans: Hierarchical Clustering and k-means clustering complement each other. In hierarchical clustering, the researcher is not aware of the number of clusters to be made whereas, in k-means clustering, the number of clusters to be made is specified before-hand.
Advice- If unaware of the number of clusters to be formed, use hierarchical clustering to determine the number and then use k-means clustering to make more stable clusters as hierarchical clustering is a single-pass exercise whereas k-means is an iterative process.

1) If variables are huge, then  K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.

2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

1) Difficult to predict K-Value.
2) With a global cluster, it didn’t work well.
3) Different initial partitions can result in different final clusters.
4) It does not work well with clusters (in the original data) of Different sizes and Different density.

KNN code snippet: 78. What is Hierarchical Clustering?

Ans: Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabelled data points having similar characteristics. Hierarchical clustering algorithms fall into the following two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters

79. What are the steps to perform Agglomerative Hierarchical Clustering?

Ans: Most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −

• Step 1 − Treat each data point as a single cluster. Hence, we will be having, say K clusters at the start. The number of data points will also be K at the start.
• Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in a total of K-1 clusters.
• Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in a total of K-2 clusters.
• Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.
• Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.

80. What is Dendrogram and what is its importance in Hierarchical Clustering?

Ans: A dendrogram is a type of Tree Diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.

The role of the dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple clusters of related data points depending upon our problem.

Parts of Dendrogram: Hierarchical Clustering Code Snippet: 81. What is Boosting?

Ans: Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set.

Purpose of Boosting: It helps the weak learner to be modified to become better.

How it evolved: The first Boosting Algorithm gained popularity was AdaBoost or Adaptive Boosting. Further it evolved and generalized as Gradient Boosting.

Ans: Adaboost combines multiple weak learners into a single strong learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally. To correct the previous error, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified. AdaBoost algorithms can be used for both classification and regression problems. 83. What is Gradient Boosting Method (GBM)?

Ans: Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation at every iteration like AdaBoost, the Gradient Boosting method tries to fit the new predictor to the residual errors made by the previous predictor.

GBM uses Gradient Descent to find the shortcomings in the previous learner’s predictions. The GBM algorithm can be given in the following steps.

Fit a model to the data, F1(x) = y

Create a new model, F2(x) = F1(x) + h1(x)

By combining weak learners after weak learners, our final model is able to account for a lot of the error from the original model and reduces this error over time. 84. What is XGBoost?

Ans: XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on computational speed and model performance. XGBoost provides:

• Parallelization of tree construction using all of your CPU cores during training.
• Distributed Computing for training very large models using a cluster of machines.
• Out-of-Core Computing for very large datasets that don’t fit into memory.
• Cache Optimization of data structures and algorithm to make the best use of hardware.

XGBoost Code Snippet: 85. What are the basic enhancements done to Gradient Boosting?

Ans: Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

We will look at 4 enhancements to basic gradient boosting:

1. Tree Constraints
2. Shrinkage
3. Random sampling
4. Penalized Learning
1. Tree Constraints: A good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.

Below are some constraints that can be imposed on the construction of decision trees:

• The number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.
• Tree depth, deeper trees are more complex trees, and shorter trees are preferred. Generally, better results are seen with 4-8 levels.
• The number of nodes or number of leaves, like depth, can constrain the size of the tree but is not constrained to a symmetrical structure if other constraints are used.
• Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered
• Minimum improvement to loss is a constraint on the improvement of any split added to a tree.
1. Penalized Gradient Boosting: Additional constraints can be imposed on the parameterized trees in addition to their structure. Classical decision trees like CART are not used as weak learners, instead, a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature. As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as L1 regularization of weights and L2 regularization of weights. The additional regularization term helps to smooth the final learned weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.
2. Weighted Updates: The predictions of each tree are added together sequentially. The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.
3. Stochastic Gradient Boosting: A big insight into bagging ensembles and the random forest was allowing trees to be greedily created from subsamples of the training dataset. This same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models. This variation of boosting is called stochastic gradient boosting. At each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

86. What is Dimensionality Reduction? Why is it used?

Ans: Dimensionality reduction refers to the process of converting a set of data. That data needs to having vast dimensions into data with lesser dimensions. Also, it needs to ensure that it conveys similar information concisely.

Although, we use these techniques to solve machine learning problems. And the problem is to obtain better features for a classification or regression task.

87. What are the commonly used Dimensionality Reduction Techniques?

Ans: The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

88. How does PCA work? When to use?

Ans: Short Answer: Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning.

High dimensionality means that the dataset has a large number of features. The primary problem associated with high dimensionality in the machine learning field in model overfitting, which reduces the ability to generalize beyond the examples in the training set.

PCA in Layman’s Term: Consider the 2D XY plane.

For the sake of intuition, let us consider variance as the spread of data – the distance between the two farthest points.

Assumption:
Typically, it is believed, that if the variance of data is large, it offers more information, than data that has a small variance. (This may or may not be true). This is the assumption which PCA intends to exploit.

I give you 4 points – {(1,1), (2,2), (3,3), (4,4)}
(all lie on the line X=Y)

What is the variance on X-axis?
Variance(X) = 4-1 = 3

What is the variance on Y-axis?
Variance(Y) = 4-1 = 3

Can we obtain new data with higher variance in some manner?
Rotate your XY system by 45 degrees anticlockwise. What happens? The line X=Y has now become the X(new)-axis. And, X = -Y is now the Y(new)-axis. Let’s compute the variance again (in the form of distance)

Variance(X(new)) = distance ((4,4), (1,1)) = sqrt(18) = 4.24
Variance(Y(new)) =requires some calculations.

89. What did we get by doing this rotation?
Ans: Original data – had the highest variance on any axis as 3. This rotation gave us a variance of 4.24

That was the intuitive explanation of what PCA does. Just for further clarification

Eigenvalues = variance of the data along a particular axis in the new coordinate system. In above example, Eigenvalue(X(new)) = 4.24.

Eigenvectors = the vectors which represent the new coordinate system. In above example, vector [1,1], would be an eigenvector for X(new), and [1,-1] eigenvector for Y(new). Since they are just directions – solvers typically give us unit vectors.

Getting transformed data
Once you have the eigenvectors, a dot product of the eigenvector with the original point will give you the new point in the new coordinate system.

Diagonalization: This is the part where you equate covariance to lambda*I. This is basically trying to find an eigenvector, such that all points would lie on the same line, and thus it will have only elements of variance, and covariance terms would be zero.

Steps of PCA:

1. Calculate the covariance matrix X of data points.
2. Calculate eigenvectors and correspond eigenvalues.
3. Sort eigenvectors accordingly to their given value in decrease order.
4. Choose first k eigenvectors and that will be the new k dimensions.
5. Transform the original n-dimensional data points into k-dimensions

PCA code snippet: 90. How does LDA work? When to use?

Ans: LDA is a way to reduce ‘dimensionality’ while at the same time preserving as much of the class discrimination information as possible.

How does it work?
Basically, LDA helps you find the ‘boundaries’ around clusters of classes. It projects your data points on a line so that your clusters ‘are as separated as possible’, with each cluster having a relative (close) distance to a centroid.

What was that stuff about dimensionality?
Let’s say you have a group of data points in 2 dimensions, and you want to group them into 2 groups. LDA reduces the dimensionality of your settings like so:
K(Groups) = 2. 2-1 = 1.

Why? Because “The K centroids lie in an at most K-1-dimensional affine subspace”. What is the affine subspace? It’s a geometric concept or *structure* that says, “I am going to generalize the affine properties of Euclidean space”. What are those affine properties of the Euclidean space? Basically, it’s the fact that we can represent a point with 3 coordinates in a 3-dimensional space (with a nod toward the fact that there may be more than 3 dimensions that we are ultimately dealing with).

So, we should be able to represent a point with 2 coordinates in 2-dimensional space and represent a point with 1 coordinate in a 1-dimensional space. LDA reduced the dimensionality of our 2-dimension problem down to one dimension. So now we can get down to the serious business of listening to the data. We now have 2 groups, and 2 points in any dimension can be joined by a line. How many dimensions does a line have? 1! Now we are cooking with Crisco!

So we get a bunch of these data points, represented by their 2d representation (x,y). We are going to use LDA to group these points into either group 1 or group 2.

91. What are the Steps for LDA?

Ans: Steps of LDA:

1. 1. Compute the d-dimensional mean vector for the different classes from the dataset.
2. Compute the Scatter matrix (in between class and within the class scatter matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest eigenvalue to from a d x k dimensional matrix w (where every column represents an eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new subspace.

This can be summarized by the matrix multiplication.

Y = X x W (where X is an n * d dimension matrix representing the n samples and you are transformed n * k dimensional samples in the new subspace.

LDA code snippet: 92. What is GDA?

Ans: When we have a classification problem in which the input features are continuous random variable, we can use GDA, it’s a generative learning algorithm in which we assume p(x|y) is distributed according to a multivariate normal distribution and p(y) is distributed according to Bernoulli.

Gaussian discriminant analysis (GDA) is a generative model for classification where the distribution of each class is modeled as a multivariate Gaussian.

• Dimensionality Reduction helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Dimensionality Reduction helps in data compressing and reducing the storage space required
• It fastens the time required for performing the same computations.
• If there present fewer dimensions then it leads to less computing. Also, dimensions can allow the usage of algorithms unfit for a large number of dimensions.
• It takes care of multicollinearity that improves model performance. It removes redundant features. For example, there is no point in storing a value in two different units (meters and inches).
• Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly.

• Basically, it may lead to some amount of data loss.
• Although, PCA tends to find linear correlations between variables, which is sometimes undesirable.
• Also, PCA fails in cases where mean and covariance are not enough to define datasets.
• Further, we may not know how many principal components to keep- in practice, some thumb rules are applied.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Top 50 interview question on Statistics

#### Interview question on Statistics

1. What are the different types of Sampling?
Ans: Some of the Common sampling ways are as follows:

• Simple random sample: Every member and set of members have an equal chance of being included in the sample. Technology, random number generators, or some other sort of change process is needed to get a simple random sample.

Example—A teacher puts students’ names in a hat and chooses without looking to get a sample of students.

Why it’s good: Random samples are usually fairly representative since they don’t favor certain members.

• Stratified random sample: The population is first split into groups. The overall sample consists of some members of every group. The members of each group are chosen randomly.

Example—A student council surveys 100100100 students by getting random samples of 252525 freshmen, 252525 sophomores, 252525 juniors, and 252525 seniors.

Why it’s good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.

• Cluster random sample: The population is first split into groups. The overall sample consists of every member of the group. The groups are selected at random.

Example—An airline company wants to survey its customers one day, so they randomly select 555 flights that day and survey every passenger on those flights.

Why it’s good: A cluster sample gets every member from some of the groups, so it’s good when each group reflects the population as a whole.

• Systematic random sample: Members of the population are put in some order. A starting point is selected at random, and every nth member is selected to be in the sample.

Example—A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.

2. What is the confidence interval? What is its significance?

Ans: A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. A confidence interval can take any number of probabilities, with the most common being a 95% or 99% confidence level.

3. What are the effects of the width of the confidence interval?

• The confidence interval is used for decision making
•  The confidence level increases the width of
• The confidence interval also increases
• As the width of the confidence interval increases, we tend to get useless information also.
• Useless information – wide CI
• High risk – narrow CI

4.  What is the level of significance (Alpha)?

Ans: The significance level also denoted as alpha or α, is a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. The researcher determines the significance level before conducting the experiment.

The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.

Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

5. What are Skewness and Kurtosis? What does it signify?

Ans: Skewness: It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in the data distribution. It differentiates extreme values in one versus the other tail. The asymmetrical distribution will have a skewness of 0.

There are two types of Skewness: Positive and Negative Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.

So, when is the skewness too much?

The rule of thumb seems to be:

• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Example

Let us take a very common example of house prices. Suppose we have house values ranging from \$100k to \$1,000,000 with the average being \$500,000.

If the peak of the distribution was left of the average value, portraying a positive skewness in the distribution. It would mean that many houses were being sold for less than the average value, i.e. \$500k. This could be for many reasons, but we are not going to interpret those reasons here.

If the peak of the distributed data was right of the average value, that would mean a negative skew. This would mean that the houses were being sold for more than the average value.

Kurtosis: Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Investigate!

Low kurtosis in a data set is an indicator that data has light tails or a lack of outliers. If we get low kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.

Mesokurtic: This distribution has kurtosis statistics similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.

Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. The peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.

Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.

Platykurtic: (Kurtosis < 3): Distribution is shorter; tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.

6. What are Range and IQR? What does it signify?

Ans: Range: The range of a set of data is the difference between the highest and lowest values in the set.

IQR(Inter Quartile Range): The interquartile range (IQR) is the difference between the first quartile and the third quartile. The formula for this is:

IQR = Q3 – Q1

The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile is, indicates how to spread out the middle 50% of our set of data is.

7.  What is the difference between Variance and Standard Deviation? What is its significance?

Ans: The central tendency mean gives you the idea of an average of the data points( i.e center location of the distribution) And now you want to know how far are your data points from mean So, here comes the concept of variance to calculate how far are your data points from mean (in simple terms, it is to calculate the variation of your data points from mean) Standard deviation is simply the square root of variance. And the standard deviation is also used to calculate the variation of your data points (And you may be asking, why do we use standard deviation when we have variance. Because in order to maintain the calculations in same units i.e suppose mean is in 𝑐𝑚/𝑚, then the variance is in 𝑐𝑚2/𝑚2, whereas standard deviation is in 𝑐𝑚/𝑚, so we use standard deviation most) 8.  What is selection Bias? Types of Selection Bias?

Ans: Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

• Sampling bias: a biased sample caused by non-random sampling
• Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
• Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
• Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
• Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
• Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

9.  What are the ways of handling missing Data?

• Delete rows with missing data
• Mean/Median/Mode imputation
• Assigning a unique value
• Predicting the missing values using Machine Learning Models
• Using an algorithm that supports missing values, like random forests.

10.  What are the different types of the probability distribution? Explain with example?

Ans: The common Probability Distribution is as follows:

1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution

1. Bernoulli Distribution: A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Example: whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

2. Uniform Distribution: When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

Example: Rolling a fair dice.

3. Binomial Distribution: A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.

• Each trial is independent.
• There are only two possible outcomes in a trial- either a success or a failure.
• A total number of n identical trials are conducted.
• The probability of success and failure is the same for all trials. (Trials are identical.)

Example: Tossing a coin.

4. Normal Distribution: Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:

• The mean, median, and mode of the distribution coincide.
• The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
• The total area under the curve is 1.
• Exactly half of the values are to the left of the center and the other half to the right.

5. Poisson Distribution: A distribution is called Poisson distribution when the following assumptions are valid:

• Any successful event should not influence the outcome of another successful event.
• The probability of success over a short interval must equal the probability of success over a longer interval.
• The probability of success in an interval approaches zero as the interval becomes smaller.

Example: The number of emergency calls recorded at a hospital in a day.

11. What are the statistical Tests? List Them.

Ans: Statistical tests are used in hypothesis testing. They can be used to:

• determine whether a predictor variable has a statistically significant relationship with an outcome variable.
• estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

Common Tests in Statistics:

1. T-Test/Z-Test
2. ANOVA
3. Chi-Square Test
4. MANOVA 12. How do you calculate the sample size required?

Ans: You can use the margin of error (ME) formula to determine the desired sample size. • t/z = t/z score used to calculate the confidence interval
• ME = the desired margin of error
• S = sample standard deviation

13. What are the different Biases associated when we sample?

Ans: Potential biases include the following:

• Sampling bias: a biased sample caused by non-random sampling
• Under coverage bias: sampling too few observations
• Survivorship bias: error of overlooking observations that did not make it past a form of the selection process.

14.  How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the

formula: X (standardized) = (x-µ) / σ

15. How to find the mean length of all fishes in a river?

• Define the confidence level (most common is 95%)
• Take a sample of fishes from the river (to get better results the number of fishes > 30)
• Calculate the mean length and standard deviation of the lengths
• Calculate t-statistics
• Get the confidence interval in which the mean length of all the fishes should be.

16.  What do you mean by the degree of freedom?

• DF is defined as the number of options we have
• DF is used with t-distribution and not with Z-distribution
• For a series, DF = n-1 (where n is the number of observations in the series)

17. What do you think if DF is more than 30?

• As DF increases the t-distribution reaches closer to the normal distribution
• At low DF, we have fat tails
• If DF > 30, then t-distribution is as good as the normal distribution.

18. When to use t distribution and when to use z distribution?

• The following conditions must be satisfied to use Z-distribution
• Do we know the population standard deviation?
• Is the sample size > 30?
• CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
• Else we should use t-distribution
• CI = x (bar) – t*s/√n to x (bar) + t*s/√n

19. What are H0 and H1? What is H0 and H1 for the two-tail test?

• H0 is known as the null hypothesis. It is the normal case/default case.

For one tail test x <= µ

For two-tail test x = µ

• H1 is known as an alternate hypothesis. It is the other case.

For one tail test x > µ

For two-tail test x <> µ

20. What is the Degree of Freedom?

DF is defined as the number of options we have:

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

21. How to calculate p-Value?

Ans: Calculating p-value:

Using Excel:

1. Go to the Data tab
2. Click on Data Analysis
3. Select Descriptive Statistics
4. Choose the column
5. Select summary statistics and confidence level (0.95)

By Manual Method:

1. Find H0 and H1
2. Find n, x(bar) and s
3. Find DF for t-distribution
4. Find the type of distribution – t or z distribution
5. Find t or z value (using the look-up table)
6. Compute the p-value to the critical value

22. What is ANOVA?

Ans: ANOVA expands to the analysis of variance, is described as a statistical technique used to determine the difference in the means of two or more populations, by examining the amount of variation within the samples corresponding to the amount of variation between the samples. It bifurcates the total amount of variation in the dataset into two parts, i.e. the amount ascribed to chance and the amount ascribed to specific causes.

It is a method of analyzing the factors which are hypothesized or affect the dependent variable. It can also be used to study the variations amongst different categories, within the factors, that consist of numerous possible values. It is of two types:

One way ANOVA: When one factor is used to investigate the difference between different categories, having many possible values.

Two way ANOVA: When two factors are investigated simultaneously to measure the interaction of the two factors influencing the values of a variable.

23.  What is ANCOVA?

Ans: ANCOVA stands for Analysis of Covariance, is an extended form of ANOVA, that eliminates the effect of one or more interval-scaled extraneous variable, from the dependent variable before carrying out research. It is the midpoint between ANOVA and regression analysis, wherein one variable in two or more populations can be compared while considering the variability of other variables.

When in a set of independent variables consist of both factor (categorical independent variable) and covariate (metric independent variable), the technique used is known as ANCOVA. The difference independent variables because of the covariate are taken off by an adjustment of the dependent variable’s mean value within each treatment condition.

This technique is appropriate when the metric independent variable is linearly associated with the dependent variable and not to the other factors. It is based on certain assumptions which are:

• There is some relationship between the dependent and uncontrolled variables.
• The relationship is linear and is identical from one group to another.
• Various treatment groups are picked up at random from the population.
• Groups are homogeneous in variability.

24.  What is the difference between ANOVA and ANCOVA?

Ans: The points given below are substantial so far as the difference between ANOVA and ANCOVA is concerned:

• The technique of identifying the variance among the means of multiple groups for homogeneity is known as Analysis of Variance or ANOVA. A statistical process which is used to take off the impact of one or more metric-scaled undesirable variable from the dependent variable before undertaking research is known as ANCOVA.
• While ANOVA uses both linear and non-linear models. On the contrary, ANCOVA uses only a linear model.
• ANOVA entails only categorical independent variables, i.e. factor. As against this, ANCOVA encompasses a categorical and a metric independent variable.
• A covariate is not taken into account, in ANOVA, but considered in ANCOVA.
• ANOVA characterizes between-group variations, exclusively to treatment. In contrast, ANCOVA divides between-group variations to treatment and covariate.
• ANOVA exhibits within-group variations, particularly individual differences. Unlike ANCOVA, which bifurcates within-group variance in individual differences and covariate.

25.  What are t and z scores? Give Details.

T-Score vs. Z-Score: Overview: A z-score and a t score are both used in hypothesis testing.

T-score vs. z-score: When to use a t score:

The general rule of thumb for when to use a t score is when your sample:

Has a sample size below 30,

Has an unknown population standard deviation.

You must know the standard deviation of the population and your sample size should be above 30 in order for you to be able to use the z-score. Otherwise, use the t-score.

Z-score

Technically, z-scores are a conversion of individual scores into a standard form. The conversion allows you to more easily compare different data. A z-score tells you how many standard deviations from the mean your result is. You can use your knowledge of normal distributions (like the 68 95 and 99.7 rule) or the z-table to determine what percentage of the population will fall below or above your result.

The z-score is calculated using the formula:

• z = (X-μ)/σ

Where:

• σ is the population standard deviation and
• μ is the population mean.
• The z-score formula doesn’t say anything about sample size; The rule of thumb applies that your sample size should be above 30 to use it.

T-score

Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation; You make an estimate by using your sample.

• T = (X – μ) / [ s/√(n) ]

Where:

• s is the standard deviation of the sample.

If you have a larger sample (over 30), the t-distribution and z-distribution look pretty much the same.

To know more about Data Science, Artificial Intelligence, Machine Learning, and Deep Learning programs visit our website www.learnbay.co

Watch our Live Session Recordings to precisely understand statistics, probability, calculus, linear algebra, and other math concepts used in data science. 