#### Category: Uncategorized ### Introduction to Simple Linear Regression in Machine Learning

No matter what ML course you have chosen, the first learning goal of data science statistics modules will be the LR (linear regression), better to say, simple linear regression. In addition, we call this type of widely useful ML algorithm with an abbreviation of SLR.

In this blog, we’ll evaluate the foundational approach of SLR in ML modelling.

#### What is SLR in Machine Learning?

Simple Linear Regression (SLR) is a tactic that can help to review and evaluate relationships between two factors; where one of several factors is adjustable, this is certainly self-sufficient and can also be referred to as ‘explanatory’ or ‘stimulus’ or ‘predictor’ factors (variable). The other one is a subordinate factor, additionally known as a ‘response’ or ‘outcome’ factor.
Now, if you ask why ‘simple?’ Well, the phrase “Simple” relates to two factors used in this regression evaluation method. A line that is certainly straight used to mold linear regression and grant an explanation for the association between factors.

While you get to indulge in machine learning problems and then land on the expected and profitable outcomes, you need to find certain inter-relationships between a set of the above two types of variables. So here comes the application of SLR in ML.

#### What are the real-life applications of SLR algorithms?

If we sit to lists out the real-life instances of SLR in ML, then the list will be an endless entity. However, the handiest real-world example of the SLR application is as follows.

• Suppose you have decided to take a train your company employee with the basics of data analytics to improve your business outcomes. Now the amount you are going to invest in this training will be a self-sufficient factor. Therefore, based on the training-related investment, the percentage of ROI concerning your business decision improvement will be the outcome factor.
• Suppose you have planned to buy a second-hand car. But finding it difficult to set your budget based on car performance. To ensure the performance and parts availability, you have decided to consider up to a certain amount of age of the car. In such a scenario, you can apply SLR to set your budget. Here the age of the car will be a self-sufficient factor while the budget will be the outcome factor.
• Suppose you work for an e-commerce company in the marketing domain. A few months back your company have implemented new advertising strategies. But now you want to evaluate the profit level in monthly advertising cost with respect to the monthly sales rate. Here you can apply the SLR for ML modeling.

SLR can be the ultimate solution to lots of complex problems to a moderate business problem. Just keep one thing in mind, don’t forget to approach the linearity condition correctly.

#### What is the linearity condition in SLR?

SLR tries to solve the noticeable changes in the value of the subordinate factor (dependent)
Y with the familiarity of the values of the predictor (independent) variables X.
Hence, the equation 𝛼𝑖 + 𝛽𝑖𝑋 gives the predicted values of Yi for the provided credit of Xi. Hence,
So, you can consider 𝛼𝑖 + 𝛽𝑖𝑋 as the conditional credit that is certainly expected of Yi concerning the provided value of Xi.
Here 𝛼 and 𝛽 are the linear regression coefficients.
While doing SLR, the most vital thing to remember is that the linearity symptom in linear regression is characterized by the characteristics of regression coefficients and not regarding the explanatory variables in the data design.
Therefore, the useful formula of the SLR becomes as follows.

`𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖𝑋𝑖2+ 𝜀𝑖 ⇒𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖 ln(𝑋𝑖 ) + 𝜀𝑖 `

#### What can simple linear regression tell us that correlation does not tell us?

Although correlation apparently seems to be similar to the simple linear regression in actuality, there lies a range of differences between these two.

Difference1: Correlation quantifies the amount to which two factors are all related. Besides, fitting a line through the data set is not the case of correlation.

Difference 2: In case you need to quantify both the factors, correlation is often used. It infrequently works if one factor is something that you rightfully control. On the contrary, with Simple linear regression, the X factor is often something that you manipulate (it may be a time series or range of salary or price, etc. ). The Y factor is something that can be scaled (measured).

#### How does SLR work?

To make an SLR work to find out the solution to your identified problem, you need to follow a seven-step mathematical process as follows.

Step#1: Visualise the inter-connections between the identified factors through graphical outcomes. The standard type of graph used in SLR is a scatter plot.

Step#2: Utilise the OLS technique to calculate the regression parameters and defining the proper execution of the relationship between the variables.

Step#3:Calculate error that is standard of regression estimation.

Step#4: Calculate proper forecast interludes predicated upon your own postulates that are inclined to become normally discarded (normal distribution) depending on a prophesied charge of X.

Step#5: Validate the consequence of regression parameters received.

Step#6: Validate the best fitting quality for the model for the intact model. Only keep in mind, while dealing with the SLR algorithm, the value of p associated with the F-test and the linear regression coefficient remain identical.

Step#7: Identify the determinant and correlation coefficients.

#### Why use a scatter diagram in SLR?

While you choose SLR as your regression model, then the first thing you need to do is assessing the relationship between your identified factors.
To perform this inter-relationship identification, the best graphical visualization seems to be the scatter plot. The reason for choosing the scatter plot for this purpose is,

• Apart from the best-fit line, the dots (data points of identified variables) helps a lot to visualize the hidden pattern of inter-relationship between the factors.
• If the factors proved to be mutually inter-related, then the estimated equation for the identified relationship can be predicted. Then, with the help of this predicted equation, you can proceed with your ML algorithm modeling.

In case simple linear regression applies to a business problem, then the identified factors usually can be fo following six types of the scattered plot: Fig:1
The above plot indicates an immediate linear connection between 2 sorts of factors (dependent and independent). Fig:2

The above plot indicates an immediate but curvy linear connection between 2 sorts of factors. Fig:3
The above plot indicates an immediate but inverted linear connection between 2 sorts of factors Fig:4

The above plot indicates an inverted and curvy linear connection between 2 sorts of factors. Fig:5
The above plot indicates a direct and inverted linear connection between 2 sorts of factors, unlike figure 3. But the extent of scattering is much higher in this case. Fig 6:
The above plot indicates the non-linear relationship between the factors.

#### How to calculate the SLR in ML modeling?

To model, an ML algorithm utilizing SLR can be done either with Python or R. Here, I will explain the python programming variant.

To program an SLR model using python, six prime steps have to be followed cautiously. The prime steps are as follows.

#1: Dataset Importing
#2: Data Pre-processing
#3: Segregation of the train and test sets
#4: Assessing the linear regression model concerning the training dataset
#5: Predicting evaluation success
#6: Conceiving the evaluation benefits

Now while using python programming, the generic step from 1 to 5 remains almost the same. However, depending on which type of graphs or chart you will be using, step 6 alters a bit. So the generic python programming for SLR regression is as follows.

` # Dataset Importing import pandas as pd import matplotlib.pyplot as plt import numpy as np dataset = pd.read_csv('file name.csv') dataset.head()`

` # data pre-processing X = dataset.iloc[:, :-1].values #X is the array of self-sufficient factors Y = dataset.iloc[:,1].values #Y is the vector consisting of subordinate factor.`

` # segregation of the train and test sets from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test, train_test_split(X,Y,test_size=1/3,random_state=0) # test size ⅓ is used as of the policy of 20-80 or 30-70 splitting.`

` # Assessing the linear regression model with respect to the training dataset from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,Y_train) #This step provides the out of linear equation going to be used on the considered dataset.`

` # Predicting evaluation success y_pred = regressor.predict(X_test) y_pred`

`y_test`

#### Where to learn SLR?

If you want to learn more about the application of SLR in ML, you can join IBM certified Learnbay Data science and AI certification courses. The data science course syllabus of Learnbay offers balanced learning scopes on both statistics and programming- the two key pillars of data science career growth. Our AI and ML courses are available for both fresh graduates and working professionals. All of our courses are entitled to real-time industrial projects and live online classes. Our course is available in all the prime cities across India, such as Mumbai, Kolkata, Bengaluru, Hyderabad, Delhi, Lucknow, and Patna. ### Know The Best Strategy To Find The Right Data Science Job in Delhi?

Data science careers are buzzing everywhere, and so the data science courses. It’s true that data science salaries are too lucrative and offer sample scopes of career growth. But the majority of candidates struggle a lot to grab the right data science job after competing in their data science courses. After Bengaluru, Mumbai, Hyderabad, and Chennai, Delhi will be the next promising destination for data science aspirants. In this blog, I’ll discuss the best strategy for grabbing the right data science job in Delhi and a brief understanding of the growth orientation of the data science salary in India.

#### Is data science a good career in India?

We always keep our concerned eyes on the 1st world countries job market and keep regretting the lack of opportunities in our own country. In some cases, this becomes a very hard truth that our country lacks job opportunities and growth, but if it comes to data science, then India is also proudly participating in the data science advancement race.

According to the Analytics Insight survey, by the mid of 2025, India will experience a huge data science job boom. It’s expected that the number of data science and associated job vacancies at that time in India will be around 1,37,630. The Indian job market has already experienced massive demand for a data scientist in the first phase of 2021. Even after the pandemic effect, 50,000 data science, AI, and ML job vacancies have been filled from 2020 January to May 2021. So, there is no confusion that the data science discipline is holding a promising option as a future proof career in India.

#### What is the data science salary in India?

According to the data available in Glassdoor (as of June 15, 2021), the average data scientist salary in India have already reached the figure of 10,00,000 INR/ year with a lower limit of 4,00,000 INR/ year (freshers) and a higher limit of 20,98,000 INR/ year (for senior-level). In the case of the other subdomains of data science, such as machine learning engineers, AI experts, deep learning experts, India’s companies offer more lucrative packages.

And not only the MNCs but SMEs are also stepping forward to invest in sky-high salary packages for data science professionals.

#### Is data science in demand in Delhi?

Now let’s enter into our core topic. What is the position of data science skill demand in Delhi?

According to the Linkedin job search, including all sub-domain like ML, AI, data analytics, etc., around 2000, data science jobs are now available in Delhi. At the same time, Naukri has listed an additional 4800 data science job approx.

If you search for the salary insight of data science in Delhi, then you will land on a result that indicates the average yearly salary of 10,10,000 INR. While for senior roles, the figure easily reaches 16,31,000 INR. (Source: Glassdoor Salary insight).

#### Which companies keep hiring a data scientist around the year in Delhi?

Below are the companies that keep hiring data science professionals of different expertise levels throughout the year in Delhi.

These are the top companies of Delhi location that offer lucrative salaries and career opportunity growth and keep recruiting a data scientist (not in bulk) 365 days a year. Apart from these, there are plenty of other options for data scientists and ML engineers in Delhi.

#### To find the right data science job in Delhi?

Delhi is indeed growing very rapidly in terms of job opportunities but compared to the three prime locations, Mumbai, Bangalore, and Hyderabad, digging out the opportunities is a bit hard in Delhi. But that does not mean the capital of India lacks data science job opportunities. Rather, if you follow the right strategy of job searching, you can land on the best data science opportunities in this location of India.
Let’s explore the 6-step data science job searching strategy to grab the first data science job in Delhi.

1. Target the right Job title
2. Typing ‘data science job’ in the job search bar and hitting ‘enter’ is the biggest and most common mistake related to the data science job search.

The keyword ‘Data science’ indicates the entire data science domain, but while searching for a job, you need to focus on specific job roles like

• Data scientist
• Data analyst
• Machine learning engineer
• AI expert
• Marketing data analyst

To land on the appropriate list of available job opportunities, you need to target your job title first.
Apart from this, to make sure your profile gets shortlisted for the interview, check the job description and skills required section before applying. Applying randomly doesn’t increase the chances of getting a job. Rather continuous rejection due to relevant skill lack might discourage you.

3. Don’t roam across different domains.
4. The Data science job field is highly domain-specific. Even for freshers candidates, it is always recommended to study data science, keeping a specific domain in mind.

At present, about 70% of data science candidates remain associated with career switch. Even such candidates are very high on demand. But why so?

Well, data science is not a completely new domain. Rather, it’s such a discipline that introduced magical, rapid, and sky-kissing advancement across all types of industries like BFSI, Health and Social CareMarketing and sales, FMCG, and so on.

Hence every data science job roles demand appealing domain expertise in terms of

• Core working concept
• Domain-specific business theories and postulates
• Customised working strategies
• Dynamic trends
• Special skills like extremely proficient time management or highly polished communication skills, extraordinary negotiation skills etc.

In case you switch for the domain, then you will lack in the above-mentioned expertise aspect, which seems too harmful to your data science career initiation. Hence Stick to your domain and target for an associated data science job role.

For example, you have been working in the FMCG industry as a marketing executive. While switching to a data science career, your target should be securing a marketing data analyst or BI analyst career only in FMCG companies.

5. Invest sufficient time in making your online portfolio and CV
6. No matter how credible your skill sets are or how unique your capstone project. The shortlisting for your CV, as well as visibility of your online portfolio to the right recruiters and talent acquisition team itself, undergoes several data analytics.

Yes. Starting from possibilities of your profile view to resuming selection includes automated keywords matching processes. The associated AI-powered data analytics tools select the profiles based on keyword research. Hence to ensure the higher chances of your profile visibility and resume selection, you need to describe your skill sets and domain experience using the exact keywords that recruiters use. While making the online profile and portfolio, keep the following things in mind.

• Make your profile to the point.
• Mention only those skills that are relevant to your targeted job role and you own in reality. (always be loyal in this regard).
• Keep it more important to list your working experience, hands-on achievements rather than academic achievements.
• Mention your project in the resume briefly and provide an elaborated (but to the point) description of the same in your project portfolio.
• For insane, as you are searching for a data scientist job in Delhi, set the preferred location as Delhi only. This will help you to find a customised job opening based on the Delhi location.

7. Don’t be conventional regarding job board choosing
8. What are the first few names that come to your mind while someone discusses a job search? Linkedin, Naukri, Glassdoor, Indeed, etc. Right?

No doubt these are the most popular and exposed job searching platforms, and securing the right job from such a platform, especially when you are going to grab your first data science job, will be too tough. As mentioned, these platforms are extensively exposed, so the competition per job post remains too high. Such platforms are a better option for the expertise and senior-level candidates. So, are there no chances for data science new bees like you?
Well. Now I am going to tell you the biggest secret that most data science aspirants don’t know.

The field of data science has its own dedicated job boards, where you can find the right job as per your domain specifications, locations, and years of working experience. Even the majority of MNCs nowadays have stopped using generic recruiting sites like Linkedin, Naukri for filling up their various data science positions. Rather, they post their vacancies on the job boards dedicated to data science. Below are a few examples of such job boards.

• Outer Join
• Analytics Vidhya
• Kaggle Jobs
• Github Jobs

Apart from these sites, parallelly, you need to keep your eyes on the dedicated career portals of your targeted companies. The best options in this regard are to join the Linkedin and other social media groups of those companies. You can even find location-specific groups too.
Such groups will provide you with the present as well as upcoming data science opportunities of respective companies.

9. Target the designation as per your experience level
10. Switching to a data career does not mean initiation of a fresh career restart. Rather, it is a kind of career up-gradation.

So if you are already at the leadership level, then don’t target for a normal BI analyst, marketing analyst role. Rather target for leadership and managerial level in the data science field too.

At present, data science is offering equal opportunities to all aspirants from variable working experiences. And especially in the case of leadership positions, the data science domain is suffering from a talent shortage. So to land on the right job that you actually deserve, target the higher or at least the similar level designation.

But keep in mind to grab the right job, you need to be very cautious from the initial state of your data science career transition trajectory. The data science course you choose must be according to your experience level. This is the key to grab the right data science job at the earliest.

#### So, what’s next?

If you need personalised career guidance for a data science career switch, you can contact Learbay. We are providing data science IBM certified AI, ML, BI analyst and other data science courses in Delhi.
Each of our course modules is designed according to the work experience and domain experience of the candidates. Instead of providing generalised data science training, we have different courses for candidates with different degrees of working experience. Not only that, all of our courses include a live industrial capstone project that will be done directly from any product based MNCs in Delhi. ### Investing 3 lakhs on Data science Certification Course? Is it really worth it?

Should a working professional invest 2-3 lakhs on Data Science Courses?

The world of data science comes with endless possibilities. With the advancement of time the scope of data science career is getting extremely rewarding. Data scientists, artificial intelligence and machine learning engineers are high in demand. Not only the fresher, but also the working professionals are becoming crazy about data science career transition. The craze has reached such a level, where professionals are ready to invest 2-3 lac in pursuing data science courses or its certifications.
Are you also going to do the same? If so, then please hold back your application for a few minutes and read this post, then decide.
Nothing is wrong in investing in data science career transformation. Rather, it’s an intelligent decision but doubt comes with the investment amount. 2 to 3 lakhs. Is this investment really worth it? Certainly, ‘no’.
Certification is the key for a successful career switch to data science career switch: Myths Vs Facts.
Lots of certification, master degree programs on data science advertisements comes throughout the professional network sites, social media sites, and rode-side hoardings. Massiveness of data science course promotions are making everyone believe that certification is must to shift your domain into data science.
But the fact is this is nothing but a myth. Yes, as a working professional, certification can never be the entry gate of your data science career. Instead, at this ‘level ‘hands on experience’ becomes the key to your data science career.
Is a data science course or certification a complete waste?
The answer is ‘yes’ and ‘no’ at the same time.
Getting confused?
Well, let me explain.
Perusing a data science course is too worthy if it makes you competent in the data scientist Job market . But the same becomes a complete waste of money if it makes you only knowledgeable, not job ready.
Remember, you are going to shift your career toward the data since domain, not starting a new career.
Your goal is to get a hike not getting an entry level job in the data science domain. So, to ensure the maximum possible return on investment, choose such a course of certification that makes you a successful competitor of the current data science job market.
How to choose the right data science course for you?
To choose the right course you need look into following aspects:

• Course Curriculum: There is no defined, universal module for data science certification/ Master degree program. Every institution and universities build up their own course on the basis of contemporary market demands and upcoming scopes. So, you should be very cautious while choosing such a course.
Check out for the course that offers in-depth learning options for programming languages and analytical tools like python, R, java, SAS, SPSS, mathematical and statistical modules like numpy, pandas, Matplotlib, and algorithms on demands. As you are at the intermediate level of your career, dive deep into the programming and algorithm.
The basic courses of data science remain limited to the entry level projects and data analysis. So as a professional choose such a course that includes k-means algorithm, word frequency algorithm for NLP sentiment analysis, ARIMA model associated with machine learning, Tensorflow, CNN associated with deep learning. • Timing and class type: Being a working professional, it’s obvious that you can’t opt for full time courses. So choose courses that offer flexible timing. Live classes (online/offline) are always best but if it’s impossible to commit for scheduled classes, then choose a flexible one that offers both recorded and live classes options. If you enjoy offline learning choose courses offering weekend classes. But keep in mind, your learning should not hamper your present job.
• Project experience: If your chosen course is not offering any real-time data science project option, immediately discard it. Companies only search for candidates having hands-on project experience. As a working professional, experience is everything for your next job. Some institutions let you practice your data science skills on a few completed projects. Be cautious in this regard. Before joining any data science course verify the offered projects are real time or not. Choose only that course, where you will get to work on hands-on industry projects. No matters if the projects are from MNC or startups. If you can manage time then choose a course with a part-time internship.
• Throughout assistance: Being a dynamic field, data science needs more personalized assistance. As there is no domain limitation in data science, your chosen course must fit your targeted domain. Doing an investment on a generalized course is nothing but wasting your hard earned money. A valuable data science course assists you with domain specific interview questions, mock tests, and interview calls from growing companies.
• Certification/ non certification courses: As mentioned earlier, certificates become only a decorative entity for a working professional’s CV. So don’t run after certification courses, rather you can choose any non certification course that really benefits your next job application in the field of data science. If you are already working in a core technical domain and own an impressive amount of python, R, java, etc, then you can choose a specific course like Tensorflow, a machine learning algorithm that will fill up the gap between your current job and targeted data science jobs.

How much money should you invest in a data science course?
Here comes the final answer. Up to 80k INR investment is fair enough to crack a promising career transformation. Yes, it’s true. Because, the main goal of doing a data science course is to upgrade your current experience to such a stage that will let you enter into the world of data science with a good hike.
You don’t need to master every subdomain of data science, in fact it’s impossible. Rather you need to learn and up-skill yourself in the data science subdomain of your interest or offer huge possibilities with respect to your present experience….and yes, again, the first priority of real-time industry projects.
Fulfilling above criteria doesn’t need investments of 2 to 3 lacs INR. Rather, plenty of promising and reliable online and offline courses are available that can make you highly competent in the data science and AI job market by investing 40k to 90K INR.
You can check Data science and AI courses offered by Learnbay. They offer customized courses for candidates of every working experience level. Their courses cost between 59,000 INR and 75,000 INR (without taxes). The top most benefits for their courses are multiple real-time industry projects with IBM, Amazon, Uber, Rapido, etc. You will get a change to work on your domain specific projects. They offer both in class (online/offline), and recorded session video classes.
Best of Luck ☺. ### Data Science for working professionals

To secure a job in any domain one has to give it a lot of preparation, should be trained for the role and should have absolute knowledge about the field, usually people will dedicate years in preparing for their desired roles. Shifting from a prepared role of domain to a different domain will not usually be easy, strong gust of skepticism would surely haunt. The process of shifting from one domain to another is hard, it gets harder to learn data science for working professionals because they will have to prepare for the new job role while maintaining their current one.

If and only if you plan the whole process of domain shifting in an organised and rational way, you can have a win-win situation.

#### Have a vision and plan your strategy

You must win in both the games of learning and working, for that you will have to strategize in such a way that your time in learning data science should not in any way collide with your work life and vice-versa. Because both of the activities are equally important as they require immense attention and individual preference.

let us start from the scratch, here are some possible concerns of a working professional:

1. Time management
2. Balancing the energy between two activities
3. Scheduling
4. Risk of affording a wrong move
5. Risk of inefficient or improper execution

As a working professional you will have to manage your responsibilities in a way that you will have control over every single thing that happens to exist. With proper planning and the right way of approach, the above mentioned concerns could be easily tamed.

Firmly state your purpose of learning data science
Why do you want to change your domain into Data Science while you already have a job? firmly define the purpose. You should know that by shifting to data science everything will change, you will have to develop new skill sets for the role that you are targeting, processing of workflow will be different, your future job role will have different goals, purpose and aim. Act consciously when you are risking to give up on the comfort and expertise you have in your current job, be very sure about the purpose of doing so. Doing this will eliminate the skepticism about the risk of getting out of your comfort zone. The efforts that you put over learning Data Science will never go in vain because you will learn about the currently trending technologies and tools, that will help you survive not only in data science but anywhere in the IT firm.

Have a soft target
People think only the role of ‘data scientist’ matters the most but the fact is that there are several other roles in data science which significantly matter in the field, choose one role that which you want to become and start preparing for it. Doing this should be good for the starters, because you do not have to be a scholar in every tool that has ever been used in the field, smartly target those topics that are the essentials in Data Science. When you specifically work on a targeted role you will have the chance to completely know about it and its importance in the field. This way of approach will be a very smart move because you will not be confused regarding what exactly to study in the vast field of data science and the field generally prioritizes those who holds master expertise in specified field. So be very sure about the role you want to serve in, in data science.

Plan the execution
To perfectly plan the execution part you will first have to design the implementation part, do it wise and rationally. Revise your daily-life activities, reschedule it for the sake of balancing between learning and working.

Exercise on the way you spend time on everyday things, revise it according to your daily schedules. Practice to make a note of your tasks everyday, according to that plan on how much time you would invest on the things and try your best to act as decided. In other words, this way of dealing with the things is called as discipline, to have a structured day you will have to practice discipline in all possible ways. Revise your activities from sleeping habits to break sessions, reschedule them in such a way that the things will itself fall in the right place. Set targets, set your own deadlines and design the way that you want things to work in.

Networking and understanding the field
Involve with the people that come from the field of Data Science, know about the insider story of the field and about how it works. Having field knowledge is very much necessary, remember that when you get into data science you will have to work in teams, so practice skills in communication and confidence. Get interactive with the people by asking them about the ways to reach to the field, this way you will build good connections and will get great suggestions as well. Start associating yourself with the people who belongs to Data science, you will need to get used to that. A good course
Everything that you do and every effort that you put is only to learn Data Science, but if you make the mistake of choosing a wrong course every effort of yours will go in vain. Your purpose of learning Data Science is to shift your domain into that of Data science, you cannot do this without the help of a good course. The course that you choose should not only help you to have fine knowledge in data science but also should help you to manage your planned schedules. There are many data science courses that are specially built for working professionals, it will greatly help if you choose the right one among them.

Conclusion
With the right approach and proper planning you can triumph in learning Data Science while maintaining a full time job. Stick to your plans and preparations, seek help from a good course, practice as much as you could and start involving yourself with the field. If you manage to everyday execute the plans you will surely reach your destination in ease.

The data science course of Learnbay is specially designed for working professionals, the benefits provided in the course will help you balance your scheduling. Learnbay powered by IBM will help you throughout the journey of learning and experiencing data science. ### Regression techniques in Machine Learning

Machine learning has become the sexiest and very trendy technology in this world of technologies, Machine learning is used every day in our life such as Virtual assistance, for making future predictions, Videos surveillance, Social media services, spam mail detection, online customer support, search engine resulting prediction, fraud detection, recommendation systems, etc. In machine learning, Regression is the most important topic that needed to be learned. There are different types of Regression techniques which we will know in this article.

### Introduction:

Regression algorithms such as Linear regression and Logistic regression are the most important algorithms that people learn while they study about Machine learning algorithms. There are numerous forms of regression that are used to perform regression and each has its own specific features, that are applied accordingly. The regression techniques are used to find out the relationship between the dependent and independent variables or features. It is a part of data analysis that is used to analyze the infinite variables and the main aim of this is forecasting, time series analysis, modeling.

### What is Regression?

Regression is a statistical method that mainly used for finance, investing and sales forecasting, and other business disciplines that make attempts to find out the strength and relationship among the variables.

There are two types of the variable into the dataset for apply regression techniques:

1. Dependent Variable that is mainly denoted as Y
2. Independent variable that is denoted as x.

And, There are two types of regression

1. Simple Regression: Only with a single independent feature /variable
2. Multiple Regression: With two or more than two independent features/variables.

Indeed, in all regression studies, mainly seven types of regression techniques are used firmly for complex problems.

• Linear regression
• Logistics regression
• Polynomial regression
• Stepwise Regression
• Ridge Regression
• Lasso Regression

### Linear regression:

It is basically used for predictive analysis, and this is a supervised machine learning algorithm. Linear regression is linear approach to modeling the relationship between scalar response and the parameters or multiple predictor variables. It focuses on the conditional probability distribution. The formula for linear regression is Y = mX+c.

Where Y is the target variable, m is the slope of the line, X is the independent feature, and c is the intercept. #### Additional points on Linear regression:

1. There should be a linear relationship between the variables.
2. It is very sensitive to Outliers and can give a high variance and bias model.
3. The problem of occurring multi colinearity with multiple independent features

### Logistic regression:

It is used for classification problems with a linear dataset. In layman’s term, if the depending or target variable is in the binary form (1 0r 0), true or false, yes or no. It is better to decide whether an occurrence is possibly either success or failure. 1. It is used for classification problems.
2. It does not require any relation between the dependent and independent features.
3. It can after by the outliers and can occur underfitting and overfishing.
4. It needs a large sample size to make the estimation more accurate.
5. It needs to avoid collinearity and multicollinearity.

### Polynomial regression:

The polynomial regression technique is used to execute a model that is suitable for handling non-linear separated data. It gives a curve that is best suited to data points, rather than a straight line.
The polynomial regression suits the least-squares form. The purpose of an analysis of regression to model the expected y value for the independent x of the dependent variable.
The formula for this Y=  β0+ β0x1+e Look particularly for curve towards the ends to see if those shapes to patterns make logical sense. More polynomials can lead to weird extrapolation results.

### Step-wise Regression:

It is used for statistical model fitting regression with predictive models. It is done automatically.
The variable is supplemented or removed from the explanatory variable set at every step. The main approaches for the regression are reverse elimination and bidirectional elimination and step by step approaches.
The formula for this: b = b(sxi/sy)
1. This regression provides two things, the very first one is to add prediction for each steep and remove predictors fro each step.
2. It starts with the most significant predictor into the ML model and then adds features for each step.
3. The backward elimination starts with all the predictors into the model and then removes the least significant variable.

### Ridge Regression:

It is a method that used when the dataset having multicollinearity which means, the independent variables are strongly related to each other. Although the least-squares estimates are unbiased in multicollinearity, So after adding the degree of bias to the regression, ridge regression can reduce the standard errors. 1. In this regression, normality is not to be estimated the same as Least squares regression.
2. In this regression, the value could be varied but doesn’t come to zero.
3. This uses the l2 regularization method as it is also a regularization method.

### Lasso Regression:

Lasso is an abbreviation of the Least Absolute shrinkage and selection operator. This is similar to the ridge regression as it also analyzes the absolute size of the regression coefficients. And the additional features of that are it is capable of reducing the accuracy and variability of the coefficients of the Linear regression models. 1. Lasso regression shrinks the coefficients aero, which will help in feature selection for building a proper ML model.
2. It is also a regularization method that uses l1 regularization.
3. If there are many correlated features, it picks only one of them and shrinks it to the zero.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Top 50 interview questions of Machine Learning

51. How to handle categorical variables in KNN?

Ans: Create dummy variables out of a categorical variable and include them instead of the original categorical variable. Unlike regression, create k dummies instead of (k-1).

For example, a categorical variable named “Department” has 5 unique levels/categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

52Can KNN be used for Regression? How to use KNN for Regression?

Ans: Yes, K-nearest neighbour can be used for regression. In other words, the K-nearest neighbour algorithm can be applied when the dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbours.

53Discuss the difference between KNN and K Means Algorithms.

Ans: KNN and k-means clustering both are very different algorithms that solve different problems and have their own meanings of what the variable ‘k’ is.  KNN is a supervised classification algorithm that will label new data points based on the ‘k’ number of nearest data points and k-means clustering is an unsupervised clustering algorithm that groups the data into ‘k’ number of clusters.

54. How to reduce the increased variance of the model other than changing k?

Ans: By using bagging-based decision boundaries. If not restricted in the number of times, one can draw samples from the original dataset, a sample variance reduction method would be to sample, many times, and then simply take a majority vote of the kNN models to fit each of these samples to classify each test data point. This variance reduction method is called bagging.

55. What is the effect of sampling on KNN?

Ans: Sampling does several things from the perspective of a single data point since kNN works on a point-by-point basis.

1. The average distance to the k nearest neighbours increases due to increased sparsity in the dataset.
2. Consequently, the area covered by k-nearest neighbours increases in size and covers a larger area of the feature space.
3. The sample variance increases.

A consequence of this change in input is an increase in variance. When we talk of variance, we refer to the variability in the predictions given different samples from the population. Why would the immediate effects of sampling lead to the increased variance of the model?

Notice that now a larger area of the feature space is represented by the same k data points. While our sample size has not grown, the population space that it represents has increased in size. This will result in higher variance in the proportion of classes in the k nearest data points, and consequently a higher variance in the classification of each data point.

56. What happens when we change the value of K in KNN?

Ans: Short Answer: The class boundaries of the predictions become more smooth as k increases.

Long Answer: What really is the significance of these effects? First, it gives hints that a lower k value makes the KNN model more “sensitive.” That is, it is more sensitive to the local changes in the dataset. The “sensitivity” of the model directly translates to its variance.

All of these examples point to an inverse relationship between variance and k. Additionally, consider how KNN operates when k reaches its maximum value, k=n, where n is the number of points in the training set) In this case, the majority class in the training set will always dominate the predictions. It will simply pick the most abundant class in the data, and never deviate, effectively resulting in zero variance. Therefore, it seems to reduce variance, k must be increased.

Final Verdict: In order to offset the increased variance due to sampling, k can be increased to decrease model variance.

57. What is the thumb rule to approach the KNN problem?

Ans:

2. Initialize the value of k
• Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
• Sort the calculated distances in ascending order based on distance values
• Get top k rows from the sorted array
• Get the most frequent class of these rows
• Return the predicted class for getting the predicted class, iterate from 1 to the total number of training data points.

KNN Code Snippet: 58What is SVM Algorithm?

Ans: SVM stands for support vector machine, it is a supervised machine learning algorithm that can be used for both Regression and Classification. In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors) Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away. In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.

59. What are support Vectors?

#### 60. What is the purpose of the Support Vector in SVM?

Ans: A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the distance margin between the two classes. The extreme points in the data sets that define the hyperplane are the support vectors.

#### 61. What are kernels?

Ans: SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of the kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be of different types.

There are four types of kernels in SVM.

1. Linear Kernel
2. Polynomial kernel
4. Sigmoid kernel

#### 62. What is Kernel Trick?

Ans: Short Answer:  It allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space.

1. For a dataset with n features (~n-dimensional), SVMs find an n-1-dimensional hyperplane to separate it (let us say for classification)
2. Thus, SVMs perform very badly with datasets that are not linearly separable
3. SVM can now do well with datasets that are not linearly separable
4. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
5. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly)
1. For datasets with a lot of features, it becomes next to impossible to try out all the interesting transformations
6. Enter the Kernel Trick
• Thankfully, the only thing SVMs need to do in the (higher-dimensional) feature space (while training) is computing the pair-wise dot products
• For a given pair of vectors (in lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first
• We are saved!

#### 63. Why is SVM called as Large Margin Classifier?

Ans: Short Answer: Because it places the decision boundary such that it maximizes the distance between two clusters.

Long Answer: choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by: Which is simply the functional margin normalized. So, these intuitions lead to the maximum margin classifier which is a precursor to the SVM.

#### Ans:

1. SVM tries to find the “best” margin (distance between the line and the support vectors) that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.
2. SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables.
3. SVM is based on the geometrical properties of the data while logistic regression is based on statistical approaches.
4. Logistic Regression can’t be applied to a nonlinearly separable dataset whereas SVM can be applied.
5. The risk of overfitting is less in SVM, while Logistic regression is vulnerable to overfitting.

#### 65. When to Use Logistic Regression vs Support Vector Machine?

Ans: Depending on the number of training sets (data)/features that you have, you can choose to use either logistic regression or support vector machine.

Let’s take these as an example where:
n = number of features,
m = number of training examples

1. If n is large (1–10,000) and m is small (10–1000): use logistic regression or SVM with a linear kernel.
2. If n is small (1–1000) and m is intermediate (10–10,000): use SVM with (Gaussian, polynomial, etc) kernel
3. If n is small (1–100), m is large (50,000–1,000,000+): first, manually add more features and then use logistic regression or SVM with a linear kernel

#### 66. What does c and gamma parameter in SVM signify?

Cost and Gamma are the hyper-parameters that decide the performance of an SVM model. There should be a fine balance between Variance and Bias for any ML model. (this is a science and an art – as we call it in empirical studies)

For SVM, a High value of Gamma leads to more accuracy but biased results and vice-versa. Similarly, a large value of Cost parameter (C) indicates poor accuracy but low bias and vice-versa.

Following table summarizes the above explanation –

The art is to choose a model with optimum variance and bias. Therefore, you need to choose the values of C and Gamma accordingly.

Optimum values of C and Gamma can be found by using methods like Grid search.

The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’.

The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.  If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of the high density of any pair of two classes.

• SVM’s are very good when we have no idea about the data.
• Works well with even unstructured and semi-structured data like text, Images, and trees.
• The kernel trick is a real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
• Unlike in neural networks, SVM is not solved for local optima.
• It scales relatively well to high dimensional data.
• SVM models have generalization in practice, the risk of over-fitting is less in SVM.
• SVM is always compared with ANN. When compared to ANN models, SVMs give better results.

• Choosing a “good” kernel function is not easy.
• Long training time for large datasets.
• Difficult to understand and interpret the final model, variable weights, and individual impact.
• Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic.
• The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact.

SVM code snippet: 68What is Naïve Bayes Algorithm?

Ans: It is a classification algorithm that predicts the probability of each data point belonging to a class and then classifies the point as the class with the highest probability.

### Discuss Bayes Theorem.

Bayes’ Theorem gives us the probability of an event actually happening by combining the conditional probability given some result and the prior knowledge of an event happening.

Conditional probability is the probability that something will happen, given that something has occurred.  In other words, the conditional probability is the probability of X given a test result or P(X|Test).  For example, what is the probability an e-mail is spam given that my spam filter classified it as spam.

The prior probability is based on previous experience or the percentage of previous samples.  For example, what is the probability that any email is spam?

Formally

• P(A|B) = Posterior probability = Probability of A given B happened
• P(B|A) = Conditional probability = Probability of B happening if A is true
• P(A) = Prior probability = Probability of A happening in general
• P(B) = Evidence probability = Probability of getting a positive test

#### 69. Why is Naïve Bayes Naïve?

Ans: In Layman’s Term: The simple meaning of Naive is willing to believe that that life is simple and fair, which is not true. Naive Bayes is naive because it assumes that the features that are going into the model are not related to each other anyhow Change in one variable will not affect the other variable directly.

Long Answer: Naive Bayes (NB) is ‘naive’ because it makes the assumption that features of measurement are independent of each other. This is naive because it is (almost) never true. Here is how it works even then – NB is a very intuitive classification algorithm. It asks the question, “Given these features, does this measurement belong to class A or B?”, and answers it by taking the proportion of all previous measurements with the same features belonging to class A multiplied by the proportion of all measurements in class A. If this number is bigger than the corresponding calculation for class B then we say the measurement belongs in class A.

#### 71 Applications of Naïve Bayes Classification Algorithms?

Ans: Some of the real-world examples are as given below

• To mark an email as spam, or not spam?
• Classify a news article about technology, politics, or sports?
• Check a piece of text expressing positive emotions, or negative emotions?
• Also used for face recognition software.

#### 72. What are the Advantages and Disadvantages of using the Naïve Bayes Algorithm?

1. Fast
2. Highly scalable.
3. Used for binary and Multiclass Classification.
4. Great Choice for text classification.
5. It can easily train smaller data sets.

Naive Bayes considers that the features are independent of each other. However, in the real-world, features depend on each other.

### Naïve Bayes Code Snippet: #### 73. What is K-Means Clustering? What are the steps for it?

Ans: K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, original from signal processing, that is popular for cluster analysis in data mining.

If k is given, the K-means algorithm can be executed in the following steps:

• Partition of objects into k non-empty subsets
• Identifying the cluster centroids (mean point) of the current partition.
• Assigning each point to a specific cluster
• Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
• After re-allotting the points, find the centroid of the new cluster formed.

#### 74. Why is the word “means” associated with the name of the K-Means algorithm?

Ans: The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

There are k-medoids and k-medians algorithms as well.

k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids choose datapoints as centers (medoids or exemplars).

k-medians is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median.

#### 75. How to find the optimum number of clusters in K-Means? Discuss the elbow curve/elbow method?

Ans: The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS. Notice the elbow at k =3.

The optimal number of clusters can be defined as follow:

1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

#### 76. What is the difference between K-Means and Hierarchical Clustering? When to use which?

Ans: Hierarchical Clustering and k-means clustering complement each other. In hierarchical clustering, the researcher is not aware of the number of clusters to be made whereas, in k-means clustering, the number of clusters to be made is specified before-hand.
Advice- If unaware of the number of clusters to be formed, use hierarchical clustering to determine the number and then use k-means clustering to make more stable clusters as hierarchical clustering is a single-pass exercise whereas k-means is an iterative process.

#### 77. What are the advantages and disadvantages of using K-Means Algorithms?

1) If variables are huge, then  K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.

2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

1) Difficult to predict K-Value.
2) With a global cluster, it didn’t work well.
3) Different initial partitions can result in different final clusters.
4) It does not work well with clusters (in the original data) of Different sizes and Different density.

KNN code snippet: #### 78.What is Hierarchical Clustering?

Ans: Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabelled data points having similar characteristics. Hierarchical clustering algorithms fall into the following two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters

#### 79. What are the steps to perform Agglomerative Hierarchical Clustering?

Ans: Most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −

• Step 1 − Treat each data point as a single cluster. Hence, we will be having, say K clusters at the start. The number of data points will also be K at the start.
• Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in a total of K-1 clusters.
• Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in a total of K-2 clusters.
• Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.
• Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.

80. What is Dendrogram and what is its importance in Hierarchical Clustering?

Ans: A dendrogram is a type of Tree Diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.

The role of the dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple clusters of related data points depending upon our problem.

Parts of Dendrogram: Hierarchical Clustering Code Snippet: #### 81. What is Boosting?

Ans: Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set.

Purpose of Boosting: It helps the weak learner to be modified to become better.

How it evolved: The first Boosting Algorithm gained popularity was AdaBoost or Adaptive Boosting. Further it evolved and generalized as Gradient Boosting.

Ans: Adaboost combines multiple weak learners into a single strong learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally. To correct the previous error, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified. AdaBoost algorithms can be used for both classification and regression problems. #### 83. What is Gradient Boosting Method (GBM)?

Ans: Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation at every iteration like AdaBoost, the Gradient Boosting method tries to fit the new predictor to the residual errors made by the previous predictor.

GBM uses Gradient Descent to find the shortcomings in the previous learner’s predictions. The GBM algorithm can be given in the following steps.

Fit a model to the data, F1(x) = y

Create a new model, F2(x) = F1(x) + h1(x)

By combining weak learners after weak learners, our final model is able to account for a lot of the error from the original model and reduces this error over time. #### 84. What is XGBoost?

Ans: XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on computational speed and model performance. XGBoost provides:

• Parallelization of tree construction using all of your CPU cores during training.
• Distributed Computing for training very large models using a cluster of machines.
• Out-of-Core Computing for very large datasets that don’t fit into memory.
• Cache Optimization of data structures and algorithm to make the best use of hardware.

XGBoost Code Snippet: #### 85. What are the basic enhancements done to Gradient Boosting?

Ans: Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

We will look at 4 enhancements to basic gradient boosting:

1. Tree Constraints
2. Shrinkage
3. Random sampling
4. Penalized Learning
1. Tree Constraints: A good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.

Below are some constraints that can be imposed on the construction of decision trees:

• The number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.
• Tree depth, deeper trees are more complex trees, and shorter trees are preferred. Generally, better results are seen with 4-8 levels.
• The number of nodes or number of leaves, like depth, can constrain the size of the tree but is not constrained to a symmetrical structure if other constraints are used.
• Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered
• Minimum improvement to loss is a constraint on the improvement of any split added to a tree.
1. Penalized Gradient Boosting: Additional constraints can be imposed on the parameterized trees in addition to their structure. Classical decision trees like CART are not used as weak learners, instead, a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature. As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as L1 regularization of weights and L2 regularization of weights. The additional regularization term helps to smooth the final learned weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.
2. Weighted Updates: The predictions of each tree are added together sequentially. The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.
3. Stochastic Gradient Boosting: A big insight into bagging ensembles and the random forest was allowing trees to be greedily created from subsamples of the training dataset. This same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models. This variation of boosting is called stochastic gradient boosting. At each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

#### 86. What is Dimensionality Reduction? Why is it used?

Ans: Dimensionality reduction refers to the process of converting a set of data. That data needs to having vast dimensions into data with lesser dimensions. Also, it needs to ensure that it conveys similar information concisely.

Although, we use these techniques to solve machine learning problems. And the problem is to obtain better features for a classification or regression task.

#### 87. What are the commonly used Dimensionality Reduction Techniques?

Ans: The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

#### 88. How does PCA work? When to use?

Ans: Short Answer: Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning.

High dimensionality means that the dataset has a large number of features. The primary problem associated with high dimensionality in the machine learning field in model overfitting, which reduces the ability to generalize beyond the examples in the training set.

PCA in Layman’s Term: Consider the 2D XY plane.

For the sake of intuition, let us consider variance as the spread of data – the distance between the two farthest points.

Assumption:
Typically, it is believed, that if the variance of data is large, it offers more information, than data that has a small variance. (This may or may not be true). This is the assumption which PCA intends to exploit.

I give you 4 points – {(1,1), (2,2), (3,3), (4,4)}
(all lie on the line X=Y)

What is the variance on X-axis?
Variance(X) = 4-1 = 3

What is the variance on Y-axis?
Variance(Y) = 4-1 = 3

Can we obtain new data with higher variance in some manner?
Rotate your XY system by 45 degrees anticlockwise. What happens? The line X=Y has now become the X(new)-axis. And, X = -Y is now the Y(new)-axis. Let’s compute the variance again (in the form of distance)

Variance(X(new)) = distance ((4,4), (1,1)) = sqrt(18) = 4.24
Variance(Y(new)) =requires some calculations.

#### 89. What did we get by doing this rotation?Ans: Original data – had the highest variance on any axis as 3. This rotation gave us a variance of 4.24

That was the intuitive explanation of what PCA does. Just for further clarification

Eigenvalues = variance of the data along a particular axis in the new coordinate system. In above example, Eigenvalue(X(new)) = 4.24.

Eigenvectors = the vectors which represent the new coordinate system. In above example, vector [1,1], would be an eigenvector for X(new), and [1,-1] eigenvector for Y(new). Since they are just directions – solvers typically give us unit vectors.

Getting transformed data
Once you have the eigenvectors, a dot product of the eigenvector with the original point will give you the new point in the new coordinate system.

Diagonalization: This is the part where you equate covariance to lambda*I. This is basically trying to find an eigenvector, such that all points would lie on the same line, and thus it will have only elements of variance, and covariance terms would be zero.

Steps of PCA:

1. Calculate the covariance matrix X of data points.
2. Calculate eigenvectors and correspond eigenvalues.
3. Sort eigenvectors accordingly to their given value in decrease order.
4. Choose first k eigenvectors and that will be the new k dimensions.
5. Transform the original n-dimensional data points into k-dimensions

PCA code snippet: 90. How does LDA work? When to use?

Ans: LDA is a way to reduce ‘dimensionality’ while at the same time preserving as much of the class discrimination information as possible.

How does it work?
Basically, LDA helps you find the ‘boundaries’ around clusters of classes. It projects your data points on a line so that your clusters ‘are as separated as possible’, with each cluster having a relative (close) distance to a centroid.

What was that stuff about dimensionality?
Let’s say you have a group of data points in 2 dimensions, and you want to group them into 2 groups. LDA reduces the dimensionality of your settings like so:
K(Groups) = 2. 2-1 = 1.

Why? Because “The K centroids lie in an at most K-1-dimensional affine subspace”. What is the affine subspace? It’s a geometric concept or *structure* that says, “I am going to generalize the affine properties of Euclidean space”. What are those affine properties of the Euclidean space? Basically, it’s the fact that we can represent a point with 3 coordinates in a 3-dimensional space (with a nod toward the fact that there may be more than 3 dimensions that we are ultimately dealing with).

So, we should be able to represent a point with 2 coordinates in 2-dimensional space and represent a point with 1 coordinate in a 1-dimensional space. LDA reduced the dimensionality of our 2-dimension problem down to one dimension. So now we can get down to the serious business of listening to the data. We now have 2 groups, and 2 points in any dimension can be joined by a line. How many dimensions does a line have? 1! Now we are cooking with Crisco!

So we get a bunch of these data points, represented by their 2d representation (x,y). We are going to use LDA to group these points into either group 1 or group 2.

91. What are the Steps for LDA?

Ans: Steps of LDA:

1. 1. Compute the d-dimensional mean vector for the different classes from the dataset.
2. Compute the Scatter matrix (in between class and within the class scatter matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest eigenvalue to from a d x k dimensional matrix w (where every column represents an eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new subspace.

This can be summarized by the matrix multiplication.

Y = X x W (where X is an n * d dimension matrix representing the n samples and you are transformed n * k dimensional samples in the new subspace.

LDA code snippet: 92. What is GDA?

Ans: When we have a classification problem in which the input features are continuous random variable, we can use GDA, it’s a generative learning algorithm in which we assume p(x|y) is distributed according to a multivariate normal distribution and p(y) is distributed according to Bernoulli.

Gaussian discriminant analysis (GDA) is a generative model for classification where the distribution of each class is modeled as a multivariate Gaussian.

• Dimensionality Reduction helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Dimensionality Reduction helps in data compressing and reducing the storage space required
• It fastens the time required for performing the same computations.
• If there present fewer dimensions then it leads to less computing. Also, dimensions can allow the usage of algorithms unfit for a large number of dimensions.
• It takes care of multicollinearity that improves model performance. It removes redundant features. For example, there is no point in storing a value in two different units (meters and inches).
• Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly.

• Basically, it may lead to some amount of data loss.
• Although, PCA tends to find linear correlations between variables, which is sometimes undesirable.
• Also, PCA fails in cases where mean and covariance are not enough to define datasets.
• Further, we may not know how many principal components to keep- in practice, some thumb rules are applied.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Top 50 interview question on Statistics

1. What are the different types of Sampling?
Ans: Some of the Common sampling ways are as follows:

• Simple random sample: Every member and set of members have an equal chance of being included in the sample. Technology, random number generators, or some other sort of change process is needed to get a simple random sample.

Example—A teacher puts students’ names in a hat and chooses without looking to get a sample of students.

Why it’s good: Random samples are usually fairly representative since they don’t favor certain members.

• Stratified random sample: The population is first split into groups. The overall sample consists of some members of every group. The members of each group are chosen randomly.

Example—A student council surveys 100100100 students by getting random samples of 252525 freshmen, 252525 sophomores, 252525 juniors, and 252525 seniors.

Why it’s good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.

• Cluster random sample: The population is first split into groups. The overall sample consists of every member of the group. The groups are selected at random.

Example—An airline company wants to survey its customers one day, so they randomly select 555 flights that day and survey every passenger on those flights.

Why it’s good: A cluster sample gets every member from some of the groups, so it’s good when each group reflects the population as a whole.

• Systematic random sample: Members of the population are put in some order. A starting point is selected at random, and every nth member is selected to be in the sample.

Example—A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.

2. What is the confidence interval? What is its significance?

Ans: A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. A confidence interval can take any number of probabilities, with the most common being a 95% or 99% confidence level.

3. What are the effects of the width of the confidence interval?

• The confidence interval is used for decision making
•  The confidence level increases the width of
• The confidence interval also increases
• As the width of the confidence interval increases, we tend to get useless information also.
• Useless information – wide CI
• High risk – narrow CI

4.  What is the level of significance (Alpha)?

Ans: The significance level also denoted as alpha or α, is a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. The researcher determines the significance level before conducting the experiment.

The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.

Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

5. What are Skewness and Kurtosis? What does it signify?

Ans: Skewness: It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in the data distribution. It differentiates extreme values in one versus the other tail. The asymmetrical distribution will have a skewness of 0.

There are two types of Skewness: Positive and Negative Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.

So, when is the skewness too much?

The rule of thumb seems to be:

• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

Example

Let us take a very common example of house prices. Suppose we have house values ranging from \$100k to \$1,000,000 with the average being \$500,000.

If the peak of the distribution was left of the average value, portraying a positive skewness in the distribution. It would mean that many houses were being sold for less than the average value, i.e. \$500k. This could be for many reasons, but we are not going to interpret those reasons here.

If the peak of the distributed data was right of the average value, that would mean a negative skew. This would mean that the houses were being sold for more than the average value.

Kurtosis: Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Investigate!

Low kurtosis in a data set is an indicator that data has light tails or a lack of outliers. If we get low kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.

Mesokurtic: This distribution has kurtosis statistics similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.

Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. The peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.

Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.

Platykurtic: (Kurtosis < 3): Distribution is shorter; tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.

6. What are Range and IQR? What does it signify?

Ans: Range: The range of a set of data is the difference between the highest and lowest values in the set.

IQR(Inter Quartile Range): The interquartile range (IQR) is the difference between the first quartile and the third quartile. The formula for this is:

IQR = Q3 – Q1

The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile is, indicates how to spread out the middle 50% of our set of data is.

7.  What is the difference between Variance and Standard Deviation? What is its significance?

Ans: The central tendency mean gives you the idea of an average of the data points( i.e center location of the distribution) And now you want to know how far are your data points from mean So, here comes the concept of variance to calculate how far are your data points from mean (in simple terms, it is to calculate the variation of your data points from mean) Standard deviation is simply the square root of variance. And the standard deviation is also used to calculate the variation of your data points (And you may be asking, why do we use standard deviation when we have variance. Because in order to maintain the calculations in same units i.e suppose mean is in 𝑐𝑚/𝑚, then the variance is in 𝑐𝑚2/𝑚2, whereas standard deviation is in 𝑐𝑚/𝑚, so we use standard deviation most) 8.  What is selection Bias? Types of Selection Bias?

Ans: Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

• Sampling bias: a biased sample caused by non-random sampling
• Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
• Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
• Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
• Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
• Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

9.  What are the ways of handling missing Data?

• Delete rows with missing data
• Mean/Median/Mode imputation
• Assigning a unique value
• Predicting the missing values using Machine Learning Models
• Using an algorithm that supports missing values, like random forests.

10.  What are the different types of the probability distribution? Explain with example?

Ans: The common Probability Distribution is as follows:

1. Bernoulli Distribution
2. Uniform Distribution
3. Binomial Distribution
4. Normal Distribution
5. Poisson Distribution

1. Bernoulli Distribution: A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Example: whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

2. Uniform Distribution: When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

Example: Rolling a fair dice.

3. Binomial Distribution: A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.

• Each trial is independent.
• There are only two possible outcomes in a trial- either a success or a failure.
• A total number of n identical trials are conducted.
• The probability of success and failure is the same for all trials. (Trials are identical.)

Example: Tossing a coin.

4. Normal Distribution: Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:

• The mean, median, and mode of the distribution coincide.
• The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
• The total area under the curve is 1.
• Exactly half of the values are to the left of the center and the other half to the right.

5. Poisson Distribution: A distribution is called Poisson distribution when the following assumptions are valid:

• Any successful event should not influence the outcome of another successful event.
• The probability of success over a short interval must equal the probability of success over a longer interval.
• The probability of success in an interval approaches zero as the interval becomes smaller.

Example: The number of emergency calls recorded at a hospital in a day.

11. What are the statistical Tests? List Them.

Ans: Statistical tests are used in hypothesis testing. They can be used to:

• determine whether a predictor variable has a statistically significant relationship with an outcome variable.
• estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

Common Tests in Statistics:

1. T-Test/Z-Test
2. ANOVA
3. Chi-Square Test
4. MANOVA 12. How do you calculate the sample size required?

Ans: You can use the margin of error (ME) formula to determine the desired sample size. • t/z = t/z score used to calculate the confidence interval
• ME = the desired margin of error
• S = sample standard deviation

13. What are the different Biases associated when we sample?

Ans: Potential biases include the following:

• Sampling bias: a biased sample caused by non-random sampling
• Under coverage bias: sampling too few observations
• Survivorship bias: error of overlooking observations that did not make it past a form of the selection process.

14.  How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the

formula: X (standardized) = (x-µ) / σ

15. How to find the mean length of all fishes in a river?

• Define the confidence level (most common is 95%)
• Take a sample of fishes from the river (to get better results the number of fishes > 30)
• Calculate the mean length and standard deviation of the lengths
• Calculate t-statistics
• Get the confidence interval in which the mean length of all the fishes should be.

16.  What do you mean by the degree of freedom?

• DF is defined as the number of options we have
• DF is used with t-distribution and not with Z-distribution
• For a series, DF = n-1 (where n is the number of observations in the series)

17. What do you think if DF is more than 30?

• As DF increases the t-distribution reaches closer to the normal distribution
• At low DF, we have fat tails
• If DF > 30, then t-distribution is as good as the normal distribution.

18. When to use t distribution and when to use z distribution?

• The following conditions must be satisfied to use Z-distribution
• Do we know the population standard deviation?
• Is the sample size > 30?
• CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
• Else we should use t-distribution
• CI = x (bar) – t*s/√n to x (bar) + t*s/√n

19. What are H0 and H1? What is H0 and H1 for the two-tail test?

• H0 is known as the null hypothesis. It is the normal case/default case.

For one tail test x <= µ

For two-tail test x = µ

• H1 is known as an alternate hypothesis. It is the other case.

For one tail test x > µ

For two-tail test x <> µ

20. What is the Degree of Freedom?

DF is defined as the number of options we have:

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

21. How to calculate p-Value?

Ans: Calculating p-value:

Using Excel:

1. Go to the Data tab
2. Click on Data Analysis
3. Select Descriptive Statistics
4. Choose the column
5. Select summary statistics and confidence level (0.95)

By Manual Method:

1. Find H0 and H1
2. Find n, x(bar) and s
3. Find DF for t-distribution
4. Find the type of distribution – t or z distribution
5. Find t or z value (using the look-up table)
6. Compute the p-value to the critical value

22. What is ANOVA?

Ans: ANOVA expands to the analysis of variance, is described as a statistical technique used to determine the difference in the means of two or more populations, by examining the amount of variation within the samples corresponding to the amount of variation between the samples. It bifurcates the total amount of variation in the dataset into two parts, i.e. the amount ascribed to chance and the amount ascribed to specific causes.

It is a method of analyzing the factors which are hypothesized or affect the dependent variable. It can also be used to study the variations amongst different categories, within the factors, that consist of numerous possible values. It is of two types:

One way ANOVA: When one factor is used to investigate the difference between different categories, having many possible values.

Two way ANOVA: When two factors are investigated simultaneously to measure the interaction of the two factors influencing the values of a variable.

23.  What is ANCOVA?

Ans: ANCOVA stands for Analysis of Covariance, is an extended form of ANOVA, that eliminates the effect of one or more interval-scaled extraneous variable, from the dependent variable before carrying out research. It is the midpoint between ANOVA and regression analysis, wherein one variable in two or more populations can be compared while considering the variability of other variables.

When in a set of independent variables consist of both factor (categorical independent variable) and covariate (metric independent variable), the technique used is known as ANCOVA. The difference independent variables because of the covariate are taken off by an adjustment of the dependent variable’s mean value within each treatment condition.

This technique is appropriate when the metric independent variable is linearly associated with the dependent variable and not to the other factors. It is based on certain assumptions which are:

• There is some relationship between the dependent and uncontrolled variables.
• The relationship is linear and is identical from one group to another.
• Various treatment groups are picked up at random from the population.
• Groups are homogeneous in variability.

24.  What is the difference between ANOVA and ANCOVA?

Ans: The points given below are substantial so far as the difference between ANOVA and ANCOVA is concerned:

• The technique of identifying the variance among the means of multiple groups for homogeneity is known as Analysis of Variance or ANOVA. A statistical process which is used to take off the impact of one or more metric-scaled undesirable variable from the dependent variable before undertaking research is known as ANCOVA.
• While ANOVA uses both linear and non-linear models. On the contrary, ANCOVA uses only a linear model.
• ANOVA entails only categorical independent variables, i.e. factor. As against this, ANCOVA encompasses a categorical and a metric independent variable.
• A covariate is not taken into account, in ANOVA, but considered in ANCOVA.
• ANOVA characterizes between-group variations, exclusively to treatment. In contrast, ANCOVA divides between-group variations to treatment and covariate.
• ANOVA exhibits within-group variations, particularly individual differences. Unlike ANCOVA, which bifurcates within-group variance in individual differences and covariate.

25.  What are t and z scores? Give Details.

T-Score vs. Z-Score: Overview: A z-score and a t score are both used in hypothesis testing.

T-score vs. z-score: When to use a t score:

The general rule of thumb for when to use a t score is when your sample:

Has a sample size below 30,

Has an unknown population standard deviation.

You must know the standard deviation of the population and your sample size should be above 30 in order for you to be able to use the z-score. Otherwise, use the t-score.

Z-score

Technically, z-scores are a conversion of individual scores into a standard form. The conversion allows you to more easily compare different data. A z-score tells you how many standard deviations from the mean your result is. You can use your knowledge of normal distributions (like the 68 95 and 99.7 rule) or the z-table to determine what percentage of the population will fall below or above your result.

The z-score is calculated using the formula:

• z = (X-μ)/σ

Where:

• σ is the population standard deviation and
• μ is the population mean.
• The z-score formula doesn’t say anything about sample size; The rule of thumb applies that your sample size should be above 30 to use it.

T-score

Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation; You make an estimate by using your sample.

• T = (X – μ) / [ s/√(n) ]

Where:

• s is the standard deviation of the sample.

If you have a larger sample (over 30), the t-distribution and z-distribution look pretty much the same.

To know more about Data Science, Artificial Intelligence, Machine Learning, and Deep Learning programs visit our website www.learnbay.co

Watch our Live Session Recordings to precisely understand statistics, probability, calculus, linear algebra, and other math concepts used in data science.

To get updates on Data Science and AI Seminars/Webinars – Follow our Meetup group.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Model vs Algorithm in ML Machine Learning works with “models” and “algorithms”, and both play an important role in machine learning where the algorithm tells about the process and model is built by following those rules.

Algorithms have derived by the statistician or mathematician very long ago and those algorithms are studies and applied by the individuals for their business purposes.

A model in machine learning nothing but a function that is used to take some certain input, perform a certain operation which is told by algorithms to its best on the given input, and gives a suitable output.

Some of the machine learning algorithms are:

1. Linear regression
2. Logistic regression
3. Decision tree
4. Random forest
5. K-nearest neighbor
6. K-means learning

What is an algorithm in Machine learning?

An algorithm is a step by step approach powered by statistics that guides the machine learning in its learning process. An algorithm is nothing but one of the several components that constitute a model.

There are several characteristics of machine learning algorithms:

1. Machine learning algorithms can be represented by the use of mathematics and pseudo code.
2. The effectiveness of machine learning algorithms can be measured and represented.
3. With any of the popular programming languages, machine learning algorithms can be implemented.

What is the Model in Machine learning?

The model is dependent on factors such as features selection, tuning parameters, cost functions along with the algorithm the model just not fully dependent on algorithms.

Model is the result of an algorithm when we implement the algorithm with the code when we train the algorithms with the real data. A model is something that tells what your program learned from the data by following the rules of those algorithms. The model is used to predict the future result that is observed by the algorithm implementation of small data.

Model = Data + Algorithm

A model contains four major steps that are:

1. Data preprocessing
2. Feature engineering
3. Data management
4. performance measurement.

How the model and algorithms work together in machine learning?

For example:

y = mx+c is an equation for a line where m is the slope of the line and c is the y-intercept, this is nothing but linear regression with only one variable.
similarly, the decision tree and random forest have something like the Gini index and K-nearest having Euclidean distance formula.

So take the linear regression algorithm:

2. Find out the parameters c0, c1, c2 with the random variables.
3. Find out the learning rate alpha
4. Then repeat the following updates such as c0 = co-alpha +h(x)-y and for c1, c2 also.
5. Repeat these processes till converged.

when you employing this algorithm, you are employing these exact 5 steps in your model without changing the steps, your model initiated by the algorithm and also treat all the dataset same.

If you want to apply that algorithm to the model, the model finds out the value of m and c that we don’t know, then how will you find out?
suppose you have 3 variables that are having values of x and y now your model will find the value of m1, m2, m3, and c1, c2, c3 for three variables.
The model will work with three slopes and three intercepts to find out the result of the dataset to predict the future.

The “algorithm” might be treating all the data the same but it is the “model” that actually solves the problems. An algorithm is something that you use to train the model on the data.

After building a model, a data science enthusiasts test it to get the accuracy of that model and fine-tuning to improve the results.

This article may help you yo understand about the algorithm and model in Machine learning, In summary, an algorithm is a process or a technique that we follow to get the result or to find the solution of a problem.
And a model is a computation or a formula that formed as an output of an algorithm that takes some input, so you can say that you are building a model using a given algorithm.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### Win the COVID-19

If you slightly change your perspective towards the lock-down situation you can find hope of this pandemic to end and can hope of a brighter than ever future. Go for Data Science, it will be worth it. ### text stemming in NLP

Human language is an unsolved problem that there are more than 6500 languages worldwide. The tons of data are generating every day as we speak, we text, we tweet, from voice to text on every social application and o get the insights of these text data we need technology as NLP. If you know there are two types of data are there one is structured and unstructured data. Structured data used for Machine learning models and unstructured data is used for Natural language processing. There are only 21% of structured data is available, so now you can estimate how much NLP is required to handle unstructured data.

To get the insights of the dataset of unstructured data to take out the important information from it. The important technique to analyze the text data is text mining. Text mining is the technique to extract useful information from the unstructured data by identifying and exploring a large amount of text data. Or we can say that text mining is used to convert the unstructured data to the structured dataset.

Normalization, lemmatization, stemming, tokenization is the technique in NLP to get out the insights from the data.

Now we will see how text stemming works?

Stemming is the process of reducing inflection in words to their “root” forms such as mapping a group of words to the same stem. Stem words mean the suffix and prefix that have added to the root word. It is the process to produce grammatically variants of root words.  A stemming is provided by the NLP algorithms that are stemming algorithms or stemmers. The stemming algorithm removes the stem from the word. For example, eats, eating, eatery, they are made from the root word “eat“. so here the stemmer removes s, ing, very from the above words to take out meaning that the sentence is about eating something. The words are nothing but different tenses forms of verbs. This is the general idea to reduce the different forms of the word to their root word.
Words that are derived from one another can be mapped to a base word or symbol, especially if they have the same meaning.

As we can not sure that it will give us a 100% result so we have two types of error in stemming they are: over stemming and under stemming.

Over stemming occurs when there are too many words have cut out.
This could be known as non-sensical items, where the meaning of the word has lost, or it can not be able to distinguish between two stems or resolve the same stem where they should differ from each other.

For example, take out the four words university, universities, universal, and universe. A stemmer that resolves these four stems to “Univers” that is over stemming. It should be the universe stemmer that stemmed together and university, universities stemmed together they all four are not fit for the single stem.

Under stemming: Under-stemming is the opposite of stemming. It comes from when we have different words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.

This can be seen if we have a stemming algorithm that stems from the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with the date? And is there a good general rule? So there under stemming occurs.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. 