### Clustering & Types Of Clustering

Clustering & Types Of Clustering is the process of finding similar groups in data, called a cluster. It groups data instances that are similar to each other in one cluster and data instances that are very different(far away) from each other into different clusters. A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science. Entities in each group and is comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used clustering methods.

Steps involved in Clustering analysis:

1. Formulate the problem – select variables to be used for clustering.

2. Decide the clustering procedure whether it will be Hierarchical or Non-Hierarchical.

3. Select the measure of similarity or dissimilarity.

4. Choose clustering algorithms.

5. Decide the number of clusters.

6. Interpret the cluster output(profile the clusters).

7. Validate the clusters.

### Types of clustering technique:

Broadly speaking, clustering can be divided into two subgroups :

• Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
• Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering are:

k-means clustering:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-Means minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. Better Euclidean solutions can, for example, be found using k-medians and k-medoids.

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids: The centroid of data points in the red cluster is shown using a red cross and those in a grey cluster using the grey cross.
4. Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of the grey cluster. Thus, we assign that data point into a grey cluster
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

``` from pandas import DataFrame Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46], 'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] } df = DataFrame(Data,columns=['x','y']) print (df) ```

k-means for cluster=3

``` from pandas import DataFrame import matplotlib.pyplot as plt from sklearn.cluster import KMeans Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46], 'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] } df = DataFrame(Data,columns=['x','y']) kmeans = KMeans(n_clusters=3).fit(df) centroids = kmeans.cluster_centers_ print(centroids) plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) ```Hierarchical Clustering:

Hierarchical clustering, as the name suggests is an algorithm that builds the hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using the dendrogram. The dendrogram can be interpreted as:

Two important things that you should know about hierarchical clustering are:

• This algorithm has been implemented above using a bottom-up approach. It is also possible to follow the top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
• The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
• Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
• Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
• Manhattan distance: ||a-b||1 = Σ|ai-bi|
• Maximum distance:||a-b||INFINITY = maxi|ai-bi|
• Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}

` import numpy as np`
`X = np.array([[5,3], [10,15], [15,12], [24,10], [30,30], [85,70], [71,80], [60,78], [70,55], [80,91],]) import matplotlib.pyplot as plt labels = range(1, 11) plt.figure(figsize=(10, 7)) plt.subplots_adjust(bottom=0.1) plt.scatter(X[:,0],X[:,1], label='True Position') for label, x, y in zip(labels, X[:, 0], X[:, 1]): plt.annotate( label, xy=(x, y), xytext=(-3, 3), textcoords='offset points', ha='right', va='bottom') plt.show()`

` from scipy.cluster.hierarchy import dendrogram, linkage from matplotlib import pyplot as plt`
`linked = linkage(X, 'single') labelList = range(1, 11) plt.figure(figsize=(10, 7)) dendrogram(linked, orientation='top', labels=labelList, distance_sort='descending', show_leaf_counts=True) plt.show()`

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

### Introduction to Simple Linear Regression in Machine Learning

No matter what ML course you have chosen, the first learning goal of data science statistics modules will be the LR (linear regression), better to say, Simple Linear Regression in Machine Learning . In addition, we call this type of widely useful ML algorithm with an abbreviation of SLR.

In this blog, we’ll evaluate the foundational approach of Simple Linear Regression in Machine Learning in ML modelling.

#### What is SLR in Machine Learning?

Simple Linear Regression in Machine Learning (SLR) is a tactic that can help to review and evaluate relationships between two factors; where one of several factors is adjustable, this is certainly self-sufficient and can also be referred to as ‘explanatory’ or ‘stimulus’ or ‘predictor’ factors (variable). The other one is a subordinate factor, additionally known as a ‘response’ or ‘outcome’ factor.
Now, if you ask why ‘simple?’ Well, the phrase “Simple” relates to two factors used in this regression evaluation method. A line that is certainly straight used to mold linear regression and grant an explanation for the association between factors.

While you get to indulge in machine learning problems and then land on the expected and profitable outcomes, you need to find certain inter-relationships between a set of the above two types of variables. So here comes the application of Simple Linear Regression in Machine Learning .

#### What are the real-life applications of SLR algorithms?

If we sit to lists out the real-life instances of SLR in ML, then the list will be an endless entity. However, the handiest real-world example of the SLR application is as follows.

• Suppose you have decided to take a train your company employee with the basics of data analytics to improve your business outcomes. Now the amount you are going to invest in this training will be a self-sufficient factor. Therefore, based on the training-related investment, the percentage of ROI concerning your business decision improvement will be the outcome factor.
• Suppose you have planned to buy a second-hand car. But finding it difficult to set your budget based on car performance. To ensure the performance and parts availability, you have decided to consider up to a certain amount of age of the car. In such a scenario, you can apply SLR to set your budget. Here the age of the car will be a self-sufficient factor while the budget will be the outcome factor.
• Suppose you work for an e-commerce company in the marketing domain. A few months back your company have implemented new advertising strategies. But now you want to evaluate the profit level in monthly advertising cost with respect to the monthly sales rate. Here you can apply the SLR for ML modeling.

SLR can be the ultimate solution to lots of complex problems to a moderate business problem. Just keep one thing in mind, don’t forget to approach the linearity condition correctly.

#### What is the linearity condition in SLR?

SLR tries to solve the noticeable changes in the value of the subordinate factor (dependent)
Y with the familiarity of the values of the predictor (independent) variables X.
Hence, the equation 𝛼𝑖 + 𝛽𝑖𝑋 gives the predicted values of Yi for the provided credit of Xi. Hence,
So, you can consider 𝛼𝑖 + 𝛽𝑖𝑋 as the conditional credit that is certainly expected of Yi concerning the provided value of Xi.
Here 𝛼 and 𝛽 are the linear regression coefficients.
While doing SLR, the most vital thing to remember is that the linearity symptom in linear regression is characterized by the characteristics of regression coefficients and not regarding the explanatory variables in the data design.
Therefore, the useful formula of the SLR becomes as follows.

`𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖𝑋𝑖2+ 𝜀𝑖 ⇒𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖 ln(𝑋𝑖 ) + 𝜀𝑖 `

#### What can simple linear regression tell us that correlation does not tell us?

Although correlation apparently seems to be similar to the simple linear regression in actuality, there lies a range of differences between these two.

Difference1: Correlation quantifies the amount to which two factors are all related. Besides, fitting a line through the data set is not the case of correlation.

Difference 2: In case you need to quantify both the factors, correlation is often used. It infrequently works if one factor is something that you rightfully control. On the contrary, with Simple linear regression, the X factor is often something that you manipulate (it may be a time series or range of salary or price, etc. ). The Y factor is something that can be scaled (measured).

#### How does SLR work?

To make an SLR work to find out the solution to your identified problem, you need to follow a seven-step mathematical process as follows.

Step#1: Visualise the inter-connections between the identified factors through graphical outcomes. The standard type of graph used in SLR is a scatter plot.

Step#2: Utilise the OLS technique to calculate the regression parameters and defining the proper execution of the relationship between the variables.

Step#3:Calculate error that is standard of regression estimation.

Step#4: Calculate proper forecast interludes predicated upon your own postulates that are inclined to become normally discarded (normal distribution) depending on a prophesied charge of X.

Step#5: Validate the consequence of regression parameters received.

Step#6: Validate the best fitting quality for the model for the intact model. Only keep in mind, while dealing with the SLR algorithm, the value of p associated with the F-test and the linear regression coefficient remain identical.

Step#7: Identify the determinant and correlation coefficients.

#### Why use a scatter diagram in SLR?

While you choose SLR as your regression model, then the first thing you need to do is assessing the relationship between your identified factors.
To perform this inter-relationship identification, the best graphical visualization seems to be the scatter plot. The reason for choosing the scatter plot for this purpose is,

• Apart from the best-fit line, the dots (data points of identified variables) helps a lot to visualize the hidden pattern of inter-relationship between the factors.
• If the factors proved to be mutually inter-related, then the estimated equation for the identified relationship can be predicted. Then, with the help of this predicted equation, you can proceed with your ML algorithm modeling.

In case simple linear regression applies to a business problem, then the identified factors usually can be fo following six types of the scattered plot:

Fig:1
The above plot indicates an immediate linear connection between 2 sorts of factors (dependent and independent).

Fig:2

The above plot indicates an immediate but curvy linear connection between 2 sorts of factors.

Fig:3
The above plot indicates an immediate but inverted linear connection between 2 sorts of factors

Fig:4

The above plot indicates an inverted and curvy linear connection between 2 sorts of factors.

Fig:5
The above plot indicates a direct and inverted linear connection between 2 sorts of factors, unlike figure 3. But the extent of scattering is much higher in this case.

Fig 6:
The above plot indicates the non-linear relationship between the factors.

#### How to calculate the SLR in ML modeling?

To model, an ML algorithm utilizing SLR can be done either with Python or R. Here, I will explain the python programming variant.

To program an SLR model using python, six prime steps have to be followed cautiously. The prime steps are as follows.

• #1: Dataset Importing

• #2: Data Pre-processing

• #3: Segregation of the train and test sets

• #4: Assessing the linear regression model concerning the training dataset

• #5: Predicting evaluation success

#6: Conceiving the evaluation benefits

Now while using python programming, the generic step from 1 to 5 remains almost the same. However, depending on which type of graphs or chart you will be using, step 6 alters a bit. So the generic python programming for SLR regression is as follows.

` # Dataset Importing import pandas as pd import matplotlib.pyplot as plt import numpy as np dataset = pd.read_csv('file name.csv') dataset.head()`

` # data pre-processing X = dataset.iloc[:, :-1].values #X is the array of self-sufficient factors Y = dataset.iloc[:,1].values #Y is the vector consisting of subordinate factor.`

` # segregation of the train and test sets from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test, train_test_split(X,Y,test_size=1/3,random_state=0) # test size ⅓ is used as of the policy of 20-80 or 30-70 splitting.`

` # Assessing the linear regression model with respect to the training dataset from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,Y_train) #This step provides the out of linear equation going to be used on the considered dataset.`

` # Predicting evaluation success y_pred = regressor.predict(X_test) y_pred`

y_test

#### Where to learn SLR?

If you want to learn more about the application of SLR in ML, you can join IBM certified Learnbay Data science and AI certification courses. The data science course syllabus of Learnbay offers balanced learning scopes on both statistics and programming- the two key pillars of data science career growth. Our AI and ML courses are available for both fresh graduates and working professionals. All of our courses are entitled to real-time industrial projects and live online classes. Our course is available in all the prime cities across India, such as Mumbai, Kolkata, Bengaluru, Hyderabad, Delhi, Lucknow, and Patna.

### Top 50 interview question on Statistics

#### Interview question on Statistics

1. What are the different types of Sampling?
Ans: Some of the Common sampling ways are as follows:

• Simple random sample: Every member and set of members have an equal chance of being included in the sample. Technology, random number generators, or some other sort of change process is needed to get a simple random sample.

Example—A teacher puts students’ names in a hat and chooses without looking to get a sample of students.

Why it’s good: Random samples are usually fairly representative since they don’t favor certain members.

• Stratified random sample: The population is first split into groups. The overall sample consists of some members of every group. The members of each group are chosen randomly.

Example—A student council surveys 100100100 students by getting random samples of 252525 freshmen, 252525 sophomores, 252525 juniors, and 252525 seniors.

Why it’s good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.

• Cluster random sample: The population is first split into groups. The overall sample consists of every member of the group. The groups are selected at random.

Example—An airline company wants to survey its customers one day, so they randomly select 555 flights that day and survey every passenger on those flights.

Why it’s good: A cluster sample gets every member from some of the groups, so it’s good when each group reflects the population as a whole.

• Systematic random sample: Members of the population are put in some order. A starting point is selected at random, and every nth member is selected to be in the sample.

Example—A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.

2. What is the confidence interval? What is its significance?

Ans: A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. A confidence interval can take any number of probabilities, with the most common being a 95% or 99% confidence level.

3. What are the effects of the width of the confidence interval?

• The confidence interval is used for decision making
•  The confidence level increases the width of
• The confidence interval also increases
• As the width of the confidence interval increases, we tend to get useless information also.
• Useless information – wide CI
• High risk – narrow CI

4.  What is the level of significance (Alpha)?

Ans: The significance level also denoted as alpha or α, is a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. The researcher determines the significance level before conducting the experiment.

The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.

Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

5. What are Skewness and Kurtosis? What does it signify?

Ans: Skewness: It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in the data distribution. It differentiates extreme values in one versus the other tail. The asymmetrical distribution will have a skewness of 0.

There are two types of Skewness: Positive and Negative

Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.

So, when is the skewness too much?

The rule of thumb seems to be:

• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Example

Let us take a very common example of house prices. Suppose we have house values ranging from \$100k to \$1,000,000 with the average being \$500,000.

If the peak of the distribution was left of the average value, portraying a positive skewness in the distribution. It would mean that many houses were being sold for less than the average value, i.e. \$500k. This could be for many reasons, but we are not going to interpret those reasons here.

If the peak of the distributed data was right of the average value, that would mean a negative skew. This would mean that the houses were being sold for more than the average value.

Kurtosis: Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Investigate!

Low kurtosis in a data set is an indicator that data has light tails or a lack of outliers. If we get low kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.

Mesokurtic: This distribution has kurtosis statistics similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.

Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. The peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.

Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.

Platykurtic: (Kurtosis < 3): Distribution is shorter; tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.

6. What are Range and IQR? What does it signify?

Ans: Range: The range of a set of data is the difference between the highest and lowest values in the set.

IQR(Inter Quartile Range): The interquartile range (IQR) is the difference between the first quartile and the third quartile. The formula for this is:

IQR = Q3 – Q1

The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile is, indicates how to spread out the middle 50% of our set of data is.

7.  What is the difference between Variance and Standard Deviation? What is its significance?

Ans: The central tendency mean gives you the idea of an average of the data points( i.e center location of the distribution) And now you want to know how far are your data points from mean So, here comes the concept of variance to calculate how far are your data points from mean (in simple terms, it is to calculate the variation of your data points from mean)

Standard deviation is simply the square root of variance. And the standard deviation is also used to calculate the variation of your data points (And you may be asking, why do we use standard deviation when we have variance. Because in order to maintain the calculations in same units i.e suppose mean is in 𝑐𝑚/𝑚, then the variance is in 𝑐𝑚2/𝑚2, whereas standard deviation is in 𝑐𝑚/𝑚, so we use standard deviation most)

8.  What is selection Bias? Types of Selection Bias?

Ans: Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

• Sampling bias: a biased sample caused by non-random sampling
• Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
• Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
• Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
• Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
• Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

9.  What are the ways of handling missing Data?

• Delete rows with missing data
• Mean/Median/Mode imputation
• Assigning a unique value
• Predicting the missing values using Machine Learning Models
• Using an algorithm that supports missing values, like random forests.

10.  What are the different types of the probability distribution? Explain with example?

Ans: The common Probability Distribution is as follows:

1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution

1. Bernoulli Distribution: A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Example: whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

2. Uniform Distribution: When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

Example: Rolling a fair dice.

3. Binomial Distribution: A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.

• Each trial is independent.
• There are only two possible outcomes in a trial- either a success or a failure.
• A total number of n identical trials are conducted.
• The probability of success and failure is the same for all trials. (Trials are identical.)

Example: Tossing a coin.

4. Normal Distribution: Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:

• The mean, median, and mode of the distribution coincide.
• The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
• The total area under the curve is 1.
• Exactly half of the values are to the left of the center and the other half to the right.

5. Poisson Distribution: A distribution is called Poisson distribution when the following assumptions are valid:

• Any successful event should not influence the outcome of another successful event.
• The probability of success over a short interval must equal the probability of success over a longer interval.
• The probability of success in an interval approaches zero as the interval becomes smaller.

Example: The number of emergency calls recorded at a hospital in a day.

11. What are the statistical Tests? List Them.

Ans: Statistical tests are used in hypothesis testing. They can be used to:

• determine whether a predictor variable has a statistically significant relationship with an outcome variable.
• estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

Common Tests in Statistics:

1. T-Test/Z-Test
2. ANOVA
3. Chi-Square Test
4. MANOVA

12. How do you calculate the sample size required?

Ans: You can use the margin of error (ME) formula to determine the desired sample size.

• t/z = t/z score used to calculate the confidence interval
• ME = the desired margin of error
• S = sample standard deviation

13. What are the different Biases associated when we sample?

Ans: Potential biases include the following:

• Sampling bias: a biased sample caused by non-random sampling
• Under coverage bias: sampling too few observations
• Survivorship bias: error of overlooking observations that did not make it past a form of the selection process.

14.  How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the

formula: X (standardized) = (x-µ) / σ

15. How to find the mean length of all fishes in a river?

• Define the confidence level (most common is 95%)
• Take a sample of fishes from the river (to get better results the number of fishes > 30)
• Calculate the mean length and standard deviation of the lengths
• Calculate t-statistics
• Get the confidence interval in which the mean length of all the fishes should be.

16.  What do you mean by the degree of freedom?

• DF is defined as the number of options we have
• DF is used with t-distribution and not with Z-distribution
• For a series, DF = n-1 (where n is the number of observations in the series)

17. What do you think if DF is more than 30?

• As DF increases the t-distribution reaches closer to the normal distribution
• At low DF, we have fat tails
• If DF > 30, then t-distribution is as good as the normal distribution.

18. When to use t distribution and when to use z distribution?

• The following conditions must be satisfied to use Z-distribution
• Do we know the population standard deviation?
• Is the sample size > 30?
• CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
• Else we should use t-distribution
• CI = x (bar) – t*s/√n to x (bar) + t*s/√n

19. What are H0 and H1? What is H0 and H1 for the two-tail test?

• H0 is known as the null hypothesis. It is the normal case/default case.

For one tail test x <= µ

For two-tail test x = µ

• H1 is known as an alternate hypothesis. It is the other case.

For one tail test x > µ

For two-tail test x <> µ

20. What is the Degree of Freedom?

DF is defined as the number of options we have:

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

21. How to calculate p-Value?

Ans: Calculating p-value:

Using Excel:

1. Go to the Data tab
2. Click on Data Analysis
3. Select Descriptive Statistics
4. Choose the column
5. Select summary statistics and confidence level (0.95)

By Manual Method:

1. Find H0 and H1
2. Find n, x(bar) and s
3. Find DF for t-distribution
4. Find the type of distribution – t or z distribution
5. Find t or z value (using the look-up table)
6. Compute the p-value to the critical value

22. What is ANOVA?

Ans: ANOVA expands to the analysis of variance, is described as a statistical technique used to determine the difference in the means of two or more populations, by examining the amount of variation within the samples corresponding to the amount of variation between the samples. It bifurcates the total amount of variation in the dataset into two parts, i.e. the amount ascribed to chance and the amount ascribed to specific causes.

It is a method of analyzing the factors which are hypothesized or affect the dependent variable. It can also be used to study the variations amongst different categories, within the factors, that consist of numerous possible values. It is of two types:

One way ANOVA: When one factor is used to investigate the difference between different categories, having many possible values.

Two way ANOVA: When two factors are investigated simultaneously to measure the interaction of the two factors influencing the values of a variable.

23.  What is ANCOVA?

Ans: ANCOVA stands for Analysis of Covariance, is an extended form of ANOVA, that eliminates the effect of one or more interval-scaled extraneous variable, from the dependent variable before carrying out research. It is the midpoint between ANOVA and regression analysis, wherein one variable in two or more populations can be compared while considering the variability of other variables.

When in a set of independent variables consist of both factor (categorical independent variable) and covariate (metric independent variable), the technique used is known as ANCOVA. The difference independent variables because of the covariate are taken off by an adjustment of the dependent variable’s mean value within each treatment condition.

This technique is appropriate when the metric independent variable is linearly associated with the dependent variable and not to the other factors. It is based on certain assumptions which are:

• There is some relationship between the dependent and uncontrolled variables.
• The relationship is linear and is identical from one group to another.
• Various treatment groups are picked up at random from the population.
• Groups are homogeneous in variability.

24.  What is the difference between ANOVA and ANCOVA?

Ans: The points given below are substantial so far as the difference between ANOVA and ANCOVA is concerned:

• The technique of identifying the variance among the means of multiple groups for homogeneity is known as Analysis of Variance or ANOVA. A statistical process which is used to take off the impact of one or more metric-scaled undesirable variable from the dependent variable before undertaking research is known as ANCOVA.
• While ANOVA uses both linear and non-linear models. On the contrary, ANCOVA uses only a linear model.
• ANOVA entails only categorical independent variables, i.e. factor. As against this, ANCOVA encompasses a categorical and a metric independent variable.
• A covariate is not taken into account, in ANOVA, but considered in ANCOVA.
• ANOVA characterizes between-group variations, exclusively to treatment. In contrast, ANCOVA divides between-group variations to treatment and covariate.
• ANOVA exhibits within-group variations, particularly individual differences. Unlike ANCOVA, which bifurcates within-group variance in individual differences and covariate.

25.  What are t and z scores? Give Details.

T-Score vs. Z-Score: Overview: A z-score and a t score are both used in hypothesis testing.

T-score vs. z-score: When to use a t score:

The general rule of thumb for when to use a t score is when your sample:

Has a sample size below 30,

Has an unknown population standard deviation.

You must know the standard deviation of the population and your sample size should be above 30 in order for you to be able to use the z-score. Otherwise, use the t-score.

Z-score

Technically, z-scores are a conversion of individual scores into a standard form. The conversion allows you to more easily compare different data. A z-score tells you how many standard deviations from the mean your result is. You can use your knowledge of normal distributions (like the 68 95 and 99.7 rule) or the z-table to determine what percentage of the population will fall below or above your result.

The z-score is calculated using the formula:

• z = (X-μ)/σ

Where:

• σ is the population standard deviation and
• μ is the population mean.
• The z-score formula doesn’t say anything about sample size; The rule of thumb applies that your sample size should be above 30 to use it.

T-score

Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation; You make an estimate by using your sample.

• T = (X – μ) / [ s/√(n) ]

Where:

• s is the standard deviation of the sample.

If you have a larger sample (over 30), the t-distribution and z-distribution look pretty much the same.

To know more about Data Science, Artificial Intelligence, Machine Learning, and Deep Learning programs visit our website www.learnbay.co

Watch our Live Session Recordings to precisely understand statistics, probability, calculus, linear algebra, and other math concepts used in data science.

To get updates on Data Science and AI Seminars/Webinars – Follow our Meetup group.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

### Win the COVID-19

If you slightly change your perspective towards the lock-down situation you can find hope of this pandemic to end and can hope of a brighter than ever future. Go for Data Science, it will be worth it.

### What is Supervised, Unsupervised Learning, and Reinforcement Learning in Machine Learning

The Supervised Unsupervised And Reinforcement Learning algorithm is widely used in the industries to predict the business outcome, and forecasting the result on the basis of historical data. The output of any supervised learning depends on the target variables. It allows the numerical, categorical, discrete, linear datasets to build a machine learning model. The target variable is known for building the model and that model predicts the outcome on the basis of the given target variable if any new data point comes to the dataset.

The supervised learning model is used to teach the machine to predict the result for the unseen input. It contains a known dataset to train the machine and its performance during the training time of a model. And then the model predicts the response of testing data when it is fed to the trained model. There are different machine learning models (Supervised Unsupervised And Reinforcement Learning)that are suitable for different kinds of datasets. The supervised algorithm uses regression and classification techniques for building predictive models.

For example, you have a bucket of fruits and there are different types of fruits in the bucket. You need to separate the fruits according to their features and you know the name of the fruits follow up its corresponding features the features of the fruits are independent variables and name of fruits are dependent variable that is out target variable. We can build a predicting model to determine the fruit name.

There are various types of Supervised learning:

1. Linear regression
2. Logistic regression
3. Decision tree
4. Random forest
5. support vector machine
6. k-Nearest neighbors

Linear and logistic regression is used when we have continuous data. Linear regression defines the relationship between the variables where we have independent and dependent variables. For example, what would be the performance percentage of a student after studying a number of hours? The numbers of hours are in an independent feature and the performance of students in the dependent features. The linear regression is also categorized in types
those are simple linear regression, multiple linear regression, polynomial regression.

Classification algorithms help to classify the categorical values. It is used for the categorical values, discrete values, or the values which belong to a particular class. Decision tree and Random forest and KNN all are used for the categorical dataset. Popular or major applications of classification include bank credit scoring, medical imaging, and speech recognition. Also, handwriting recognition uses classification to recognize letters and numbers, to check whether an email is genuine or spam, or even to detect whether a tumor is benign or cancerous and for recommender systems.

The support vector machine is used for both classification and regression problems. It uses the regression method to create a hyperplane to classify the category of the datapoint. sentiment analysis of a subject is determined with the help of SVM whether the statement is positive or negative.

Unsupervised learning algorithms

Unsupervised learning is a technique in which we need to supervise the model as we have not any target variable or labeled dataset. It discovers its own information to predict the outcome. It is used for the unlabeled datasets. Unsupervised learning algorithms allow you to perform more complex processing tasks compared to supervised learning. Although, unsupervised learning can be more unpredictable compared with other natural learning methods. It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.

For example, We have a bucket of fruits and we need to separate them accordingly, and there no target variable available to determine whether the fruit is apple, orange, or banana. Unsupervised learning categorizes these fruits to make a prediction when new data comes.

Types of unsupervised learning:

1. Hierarchical clustering
2. K-means clustering
3. K-NN (k nearest neighbors)
4. Principal Component Analysis
5. Singular Value Decomposition
6. Independent Component Analysis

Hierarchical clustering is an algorithm that builds a hierarchy of clusters. It begins with all the data which is assigned to a cluster of their own. Here, two close clusters are going to be in the same cluster. This algorithm ends when there is only one cluster left.

K-means and KNN is also a clustering method to classify the dataset. k-means is an iterative method of clustering and also used to find the highest value for every iteration, we can select the numbers of clusters. You need to define the k cluster for making a good predictive model. K- nearest neighbour is the simplest of all machine learning classifiers. It differs from other machine learning techniques, in that it doesn’t produce a model. It is a simple algorithm that stores all available cases and classifies new instances based on a similarity measure.

PCA(Principal component analysis) is a dimensionality reduction algorithm. For example, you have a dataset with 200 of the features/columns. You need to reduce the number of features for the model with only an important feature. It maintains the complexity of the dataset.

Reinforcement learning is also a type of Machine learning algorithm. It provides a suitable action in a particular situation, and it is used to maximize the reward. The reward could be positive or negative based on the behavior of the object. Reinforcement learning is employed by various software and machines to find the best possible behavior in a situation.

Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start
• Output: There are much possible output as there are a variety of solution to a particular problem
• Training: The training is based upon the input, The model will return a state and the user will decide to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson,Supervised Unsupervised  ,And Reinforcement Learning , Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

### Everything About the XGBoost Classifier

What is the XGboost classifier?

XGBoost classifier is a Machine learning algorithm that is applied for structured and tabular data. XGBoost classifier is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost is an extreme gradient boost algorithm. And that means it’s a big Machine learning algorithm with lots of parts. XGBoost works with large complicated datasets. XGBoost classifier is an ensemble modelling technique.

What is ensemble modeling?

XGBoost classifier is an ensemble learning method. Sometimes, it may not be sufficient to rely upon the results of just one machine learning model. Ensemble learning offers a systematic solution to combine the predictive power of multiple learners. The resultant is a single model that gives the aggregated output from several models.

The models that form the ensemble, also known as base learners, could be either from the same learning algorithm or different learning algorithms. Bagging and boosting are two widely used ensemble learners. Though these two techniques can be used with several statistical models, the most predominant usage has been with decision trees.

Unique features of XGBoost classifier:

XGBoost is a popular implementation of gradient boosting. Let’s discuss some features of XGBoost that make it so interesting.

• Regularization: XGBoost classifier has an option to penalize complex models through both L1 and L2 regularization. Regularization helps in preventing overfitting
• Handling sparse data: Missing values or data processing steps like one-hot encoding make data sparse. XGBoost incorporates a sparsity-aware split finding algorithm to handle different types of sparsity patterns in the data
• Weighted quantile sketch: Most existing tree-based algorithms can find the split points when the data points are of equal weights (using a quantile sketch algorithm). However, they are not equipped to handle weighted data. XGBoost has a distributed weighted quantile sketch algorithm to effectively handle weighted data
• Block structure for parallel learning: For faster computing, the XGBoost classifier can make use of multiple cores on the CPU. This is possible because of a block structure in its system design. Data is sorted and stored in in-memory units called blocks. Unlike other algorithms, this enables the data layout to be reused by subsequent iterations, instead of computing it again. This feature also serves useful for steps like split finding and column sub-sampling
• Cache awareness: In XGBoost classifier, non-continuous memory access is required to get the gradient statistics by row index. Hence, XGBoost Classifier has been designed to make optimal use of hardware. This is done by allocating internal buffers in each thread, where the gradient statistics can be stored
• Out-of-core computing: This feature optimizes the available disk space and maximizes its usage when handling huge datasets that do not fit into memory.

Solve the XGBoost mathematically:

Here we will use simple Training Data, which has a Drug dosage on the x-axis and Drug effectiveness in the y-axis. These above two observations(6.5, 7.5) have a relatively large value for Drug Effectiveness and that means that the drug was helpful and these below two observations(-10.5, -7.5) have a relatively negative value for Drug Effectiveness, and that means that the drug did more harm than good.

The very 1st step in fitting XGBoost Classifier to the training data is to make an initial prediction. This prediction could be anything but by default, it is 0.5, regardless of whether you are using XGBoost classifier for Regression or Classification.

The prediction 0.5 corresponds to the thick black horizontal line.

Unlike unextreme Gradient Boost which typically uses regular off-the-shelf, Regression Trees. XGBoost uses a unique Regression tree that is called an XGBoost Tree.

Now we need to calculate the Quality score or Similarity score for the Residuals.

Here λ  is a regularization parameter.

So we split the observations into two groups, based on whether or not the Dosage<15.

The observation on the left is the only one with Dosage<15. All of the other residuals go to the leaf on the right.

When we calculate the similarity score for the observations –10.5,-7.5,6.5,7.5 while putting λ =0
we got similarity =4  and

Hence the result we got is:

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. By choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

### What is EDA?

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

It is always good to explore and compare a data set with multiple exploratory techniques. After the exploratory data analysis, you will get confidence in your data to point where you’re ready to engage a machine learning algorithm and another benefit of EDA is to the selection of feature variables that will be used later for Machine Learning.
In this post, we take Iris Dataset to get the process of EDA.

Importing libraries:

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt``` Loading the Iris data `iris_data= pd.read_csv("Iris.csv")`  Understand the data: ```iris_data.shape
(150,5)
iris_data['Species'].value_counts()
setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64 iris_data.columns() Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'],dtype='object')``` 1D scatter plot of the iris data: ```iris_setso = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setso["petal_length"],np.zeros_like(iris_setso["petal_length"]), 'o')
plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor["petal_length"]), 'o')
plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica["petal_length"]), 'o')
plt.grid()
plt.show() ```  2D scatter plot: ```iris.plot(kind="scatter",x="sepal_length",y="sepal_width")
plt.show()```  2D scatter plot with the seaborn library : ```import seaborn as sns
sns.set_style("whitegrid");
sns.FacetGrid(iris,hue="species",size=4) \
.map(plt.scatter,"sepal_length","sepal_width") \
plt.show() ``` ```

Conclusion

• Blue points can be easily separated from red and green by drawing a line.
• But red and green data points cannot be easily separated.
• Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
• Separating Versicolor from Viginica is much harder as they have considerable overlap.

Pair Plot:

A pairs plot allows us to see both the distribution of single variables and relationships between two variables. For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs, in this case, will be :

•  Sepal length, sepal width
• sepal length, petal length
• sepal length, petal width
• sepal width, petal length
• sepal width, petal width
• petal length, petal width

So, here instead of trying to visualize four dimensions which are not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

``````sns.set_style("whitegrid");
sns.pairplot(iris,hue="species",size=3);
plt.show()``````

Conclusion:

1. petal length and petal width are the most useful features to identify various flower types.
2. While Setosa can be easily identified (linearly separable), virginica and Versicolor have some overlap (almost linearly separable).
3. We can find “lines” and “if-else” conditions to build a simple model to classify the flower types.

Cumulative distribution function:

``````iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
>>>[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0.   0.04]
print(bin_edges);
>>>[1.   1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
cdf = np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf) ``````

Mean, Median, and Std-Dev:

``````print("Means:")
print(np.mean(iris_setosa["petal_length"]))
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))``` OutPut: - Means: 1.464 2.4156862745098038 5.5520000000000005 4.26```

Std-dev:
0.17176728442867112
0.546347874526844
0.4651881339845203

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

### What is Regression?

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). Regression is a predictive modeling analysis technique. It estimates a relationship between the dependent and an independent variable.

Use of Regression:

• Determine the strength of predictors.
• Forecasting an effect.
• Trend forecasting.

Linear Regression:

Linear regression is a basic and commonly used type of predictive analysis.  The overall idea of regression is to examine two things, it does a set of predictor variables do a good job in predicting an outcome (dependent) variable?  in Which variables, in particular, are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?  These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.  The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

Linear Regression Selection Criteria:

1. Classifiaction & Regression capabalities.
2. Data quality.
3. Computational complexity.
4. Comprehensive & transport.

When will we use Linear Regression?

• Evaluating trends & sales estimates.
• Analyzing the impact of price changes.
• Assessment of risk in financial services and insurance domain.

for example, a group of creative Tech enthusiasts started a company in Silicon Valley. This start-up — called Banana — is so innovative that it has been growing constantly since 2016. You, the wealthy investor, would like to know whether to put your money on Banana’s success in the next year or not. Let’s assume that you don’t want to risk a lot of money, especially since the stakes are high in Silicon Valley. So you decide to buy a few shares, instead of investing in a big portion of the company.

Well, you can definitely see the trend. Banana is growing like crazy, kicking up their stock price from 100 dollars to 500 in just three years. You only care about how the price is going to be like in the year 2021 because you want to give your investment some time to blossom along with the company. Optimistically speaking, it looks like you will be growing your money in the upcoming years. The trend is likely not to go through a sudden, drastic change. This leads to you hypothesizing that the stock price will fall somewhere above the \$500 indicator.

Here’s an interesting thought. Based on the stock price records of the last couple of years you were able to predict what the stock price is going to be like. You were able to infer the range of the new stock price (that doesn’t exist on the plot) for a year that we don’t have data for (the year 2021). Well — kinda.

What you just did is infer your model (that head of yours) to generalize — predict the y-value for an x-value that is not even in your knowledge. However, this is not accurate in any way. You couldn’t specify what exactly is the stock price most likely going to be. For all you know, it is probably going to be above 500 dollars.

Here is where Linear Regression (LR) comes into play. The essence of LR is to find the line that best fits the data points on the plot so that we can, more or less, know exactly where the stock price is likely to fall in the year 2021.

Let’s examine the LR-generated line (in red) above, by looking at the importance of it. It looks like, with just a little modification, we were able to realize that Banana’s stock price is likely to be worth a little bit higher than \$600 by the year 2021. Obviously, this is an oversimplified example. However, the process stays the same. Linear Regression as an algorithm relies on the concept of lowering the cost to maximize the performance. We will examine this concept, and how we got the red line on the plot next.

Finding the best fit line:

To check the goodness of fit we use the R-squared method.

What is the R-squared method?

R-squared value is a statistical measure of how close the data to the fitted linear regression line. It is also known as COD(coefficient of determination), or the coefficient of multiple determination.

### What are overfitting and underfitting?

Overfitting: Good performance on the training data, poor generalization to other data.

Underfitting: Poor performance on the training data & poor generalization to other data.

Linear Regression with python:

1.Importing required libraries:

`import numpy as np from sklearn.linear_model import LinearRegression`

2. Provide data:

`x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1)) y = np.array([5, 20, 14, 32, 22, 38])`
`print(x)`
`print(y)`` `

```Output:
>>> print(x)
[[ 5]
[15]
[25]
[35]
[45]
[55]]
>>> print(y)
[ 5 20 14 32 22 38]```

3. Create a model and fit it:

`model = LinearRegression().fit(x, y) `

4. Get Result:

`>> r_sq = model.score(x, y)`
`>>> print('coefficient of determination:', r_sq) coefficient of determination: 0.715875613747954 `

5. Predict response:

``````>>> y_pred = model.predict(x)
>>> print('predicted response:', y_pred, sep='\n')
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]
``````

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

``````
``````

## Random Forest Model: Introduction

Random Forest Model is also a classification model with the combination of the decision tree. The random forest model algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction is based on mode for classification and means for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.

### Random Forest Model Algorithm: Working

We can understand the working of the Random Forest algorithm with the help of following steps −

• Step 1 − First, start with the selection of random samples from a given dataset. Do sampling without replacement.

Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.

• Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
• Step 3 − In this step, voting will be performed for every predicted result.
• Based on ‘n’ samples… ‘n’ tree is built
• Each record is classified based on the n tree
• The final class for each record is decided based on voting

Step 4 − At last, select the most voted prediction result as the final prediction result.

### What is the Out of Bag score in Random Forests?

Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.

For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.

OOB Error Rate Computation Steps

• Sample left out (out-of-bag) in Kth tree is classified using the Kth tree
• Assume j cases are misclassified
• The proportion of time that j is not equal to true class averaged over all cases is the OOB error rate.

Variable importance of RF:

It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.

• Random Forest computes two measures of Variable Importance
• Mean Decrease in Accuracy
• Mean Decrease in Gini
• Mean Decrease in Accuracy is based on permutation
• Randomly permute values of a variable for which importance is to be computed in the OOB sample
• Compute the Error Rate with permuted values
• Compute decrease in OOB Error rate (Permuted- Not permuted)
• Average the decrease overall the trees
• Mean Decrease in Gini is computed as a “total decrease in node impurities from splitting on the variable averaged over all trees”.

Finding the optimal values using grid-search cv:

It stats the optimal values of the splitting decision tree that how many trees to be split within the model.

Measuring RF model performance by Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.

Random Forest with python:

Importing the important libraries–

`import pandas as pd import numpy as np from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn import svm from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz `

```dummy_df = pd.read_csv("bank.csv", na_values =['NA']) temp = dummy_df.columns.values[0] temp print(dummy_df)```

## Data Pre-Processing:

`columns_name = temp.split(';') data = dummy_df.values print(data) print(data.shape) contacts = list() for element in data: contact = element[0].split(';') contacts.append(contact)`

`contact_df = pd.DataFrame(contacts,columns = columns_name) print(contact_df) def preprocessor(df): res_df = df.copy() le = preprocessing.LabelEncoder()`

` ``encoded_df = preprocessor(contact_df) #encoded_df = preprocessor(contacts) x = encoded_df.drop(['"y"'],axis =1).values y = encoded_df['"y"'].values`

## Split the data into Train-Test¶

`x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)`

## Build the Decision Tree Model

```# Decision tree with depth = 2 model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2) model_dt_2.fit(x_train, y_train) model_dt_2_score_train = model_dt_2.score(x_train, y_train) print("Training score: ",model_dt_2_score_train) model_dt_2_score_test = model_dt_2.score(x_test, y_test) print("Testing score: ",model_dt_2_score_test) #y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1] #Decision tree```

`model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy") model_dt.fit(x_train, y_train) y_pred_dt = model_dt.predict_proba(x_test)[:, 1]`

## Graphical Representation of Tree

`plt.figure(figsize=(6,6)) dot_data = StringIO() export_graphviz(model_dt, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())`

## Performance Metrics

```fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt) roc_auc_dt = auc(fpr_dt, tpr_dt) predictions = model_dt.predict(x_test) # Model Accuracy print (model_dt.score(x_test, y_test)) y_actual_result = y_test[0] for i in range(len(predictions)): if(predictions[i] == 1): y_actual_result = np.vstack((y_actual_result, y_test[i]))```

## Recall

`#Recall y_actual_result = y_actual_result.flatten() count = 0 for result in y_actual_result: if(result == 1): count=count+1 print ("true yes|predicted yes:") print (count/float(len(y_actual_result)))`

## Area Under the Curve

`plt.figure(1) lw = 2 plt.plot(fpr_dt, tpr_dt, color='green', lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Area Under Curve') plt.legend(loc="lower right") plt.show()`

## Confusion Matrix

```print (confusion_matrix(y_test, predictions)) accuracy_score(y_test, predictions) import itertools from sklearn.metrics import confusion_matrix def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix. cm = confusion_matrix(y_test, model, labels=[0, 1]) classes=["Success", "Default"] cmap = plt.cm.Blues title = "Confusion Matrix" if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')```

`plt.figure(figsize=(6,6)) plot_confusion_matrix(predictions, normalize=False) plt.show()`

### Pruning of the tree¶

`from sklearn.tree._tree import TREE_LEAF`

`def prune_index(inner_tree, index, threshold): if inner_tree.value[index].min() < threshold: # turn node into a leaf by "unlinking" its children inner_tree.children_left[index] = TREE_LEAF inner_tree.children_right[index] = TREE_LEAF # if there are shildren, visit them as well if inner_tree.children_left[index] != TREE_LEAF: prune_index(inner_tree, inner_tree.children_left[index], threshold) prune_index(inner_tree, inner_tree.children_right[index], threshold)`

`print(sum(model_dt.tree_.children_left < 0)) # start pruning from the root prune_index(model_dt.tree_, 0, 5) sum(model_dt.tree_.children_left < 0)`

`#It means that the code has created 17 new leaf nodes #(by practically removing links to their ancestors). The tree, which has looked before like`

`from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus plt.figure(figsize=(6,6)) dot_data = StringIO() export_graphviz(model_dt, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())`

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

## Human Activity recognition:

In this case study, we design a model by which a smartphone can detect its owner’s activity precisely. Human activity recognition with a smartphone is a very famous ML project. It is a wellness approach for a human.  Human activity is a very exciting project for AI.

Most of the smartphones have two smart sensors accelerometer and gyroscope, which is an IoT sensor. With the help of the IoT devices captures the activity of a human. The data of human activity collected through the IoT sensor. The two smartphone sensors are accelerometer and gyroscope. Accelerometer collects the data of mobile movement such as move landscape and portrait when playing mobile games and gyroscope measure the rotational movement.

An example that a smartphone has an android app that reads the accelerometers and gyroscope which can predict the human activity that he/she walking normally, walking upstairs, walking downstairs, laying down, sitting all these are the human activities.  Some of the accelerometer and gyroscope measures heart rate, calories burned, etc. by reading all the human activities these tells how much work have done in a day by the human this is also the area of the internet of things(IoT).

### Working of Human task project:

1. Human activity recognition: With the help of sensors we collect the data of body movement which is captured by the smartphone. Movements are often indoor activities such as walking, walking upstairs, walking downstairs, lying down, sitting and standing. The data have recorded for the prediction of the data.

2. Data set collection of activity: The data was collected from the 30 volunteers aged between 19 to 48 performing the activities mentioned above while wearing a smartphone on waist. The example video is given below to understand Subject performing the activities and the movement data was labeled manually.

3. Human Activity Recognition Using Smartphones Data Set: The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% the test data. The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low-frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

• There are “train” and “test” folders containing the split portions of the data for modeling (e.g. 70%/30%).
• There is a “txt” file that contains a detailed technical description of the dataset and the contents of the unzipped files.
• There is a “txt” file that contains a technical description of the engineered features.

The contents of the “train” and “test” folders are similar (e.g. folders and file names), although with differences in the specific data they contain.

Load  set data and process it:

Important libraries to import for data processing

`#start with some necessary imports import numpy as np import pandas as pd from google.colab import files uploaded = files.upload()`

google.colab used to fetch the data from the collaborator files.

` train_data = pd.read_csv("train.csv") train_data.head()`

we select the training data set for the modeling.

`train_data.Activity.value_counts() train_data.shape`

The above function defines how many rows and columns the dataset have.

` train_data.describe()  `

It describes that there are (8 rows and 563 columns) with all the features of the data. For numeric data, the result’s index will include `count``mean``std``min``max` as well as lower, `50` and upper percentiles. By default the lower percentile is `25` and the upper percentile is `75`. The `50` percentile is the same as the median.

` uploaded = files.upload() test_data = pd.read_csv('test.csv') test_data.head()`

Here we read the csv file to analyze the data set and the operation which is supposed to be programmed. head()
shows the first 5 rows with their respective columns so here we have (5 rows and 563 columns).

`# suffling data from sklearn.utils import shuffle`
`# test = shuffle(test) train_data = shuffle(train_data)`

Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you’d shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.

`# separating data inputs and output lables trainData = train_data.drop('Activity' , axis=1).values trainLabel = train_data.Activity.values`
`testData = test_data.drop('Activity' , axis=1).values testLabel = test_data.Activity.values print(testLabel)`

By using the above code we separate the input and output, here it determines the human activities which are captured by the IoT device. The human activities walking, standing, walking upstairs, walking downstairs, sitting and lying down are got separated to optimize the result.

`# encoding labels from sklearn import preprocessing`
`encoder = preprocessing.LabelEncoder()`
`# encoding test labels encoder.fit(testLabel) testLabelE = encoder.transform(testLabel)`
`# encoding train labels encoder.fit(trainLabel) trainLabelE = encoder.transform(trainLabel)`

Holds the label for each class. encode categorical features using a one-hot or ordinal encoding scheme. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

`# applying supervised neural network using multi-layer preceptron import sklearn.neural_network as nn mlpSGD = nn.MLPClassifier(hidden_layer_sizes=(90,) \ , max_iter=1000 , alpha=1e-4 \ , solver='sgd' , verbose=10 \ , tol=1e-19 , random_state=1 \ , learning_rate_init=.001) `
`mlpADAM = nn.MLPClassifier(hidden_layer_sizes=(90,) \ , max_iter=1000 , alpha=1e-4 \ , solver='adam' , verbose=10 \ , tol=1e-19 , random_state=1 \ , learning_rate_init=.001) ``nnModelSGD = mlpSGD.fit(trainData , trainLabelE)`
`y_pred = mlpSGD.predict(testData).reshape(-1,1) #print(y_pred) from sklearn.metrics import classification_report print(classification_report(testLabelE, y_pred))  `

`import matplotlib.pyplot as plt import seaborn as sns fig = plt.figure(figsize=(32,24)) ax1 = fig.add_subplot(221) ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,0], data=sub_01, jitter=True) ax2 = fig.add_subplot(222) ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,1], data=sub_01, jitter=True) plt.show() `

`fig = plt.figure(figsize=(32,24)) ax1 = fig.add_subplot(221) ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,2], data=sub_01, jitter=True) ax2 = fig.add_subplot(222) ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,3], data=sub_01, jitter=True) plt.show()`

` `