Call WhatsApp Enquiry

Category: Machine Learning

How to Build a Rewarding Career As a Healthcare Data Scientist?

Data Science in Healthcare– Know The Hidden Scopes

Data science in Healthcare isn’t something new. It is the most common industry where data science and analytics are applied. The global pandemic has dramatically raised the demand and importance of healthcare data scientists. Over the years, we have seen how data science professionals have pulled together to work on Covid-19 Healthcare data and build AI/ML models to track the outbreak. This data was used for contact tracing, screening, and vaccine development. 

Thus, data science has the potential to improve the entire healthcare system.

Maybe you’ve worked in healthcare for a while and want to shift your career path to put your analytical skills to the test. Or perhaps, you have strong experience in data analysis and are seeking a field where you can put your knowledge and expertise to use. Even, the case might be that you are not happy with your current career growth in the healthcare industry and dreaming of a lucrative package like your IT friends. Believe me, that’s also possible and the key is nothing but data science and AI.

In either case, you are considering whether a career as a healthcare data scientist would be the right fit for you.

This blog will help you familiarize yourself with everything you need to know as before/ after stepping into the healthcare industry. 


  • What role does data play in the healthcare sector?

    The healthcare industry contributes a substantial amount of data to the global data pool

    Data science and AI have the potential to transform how care is delivered. 

    Today Medical Science has advanced rapidly, increasing life expectancy around the world. However, as longevity increases, the healthcare system faces a growing demand for their services, rising expenses, and a workforce struggling to meet the requirements of their patients. 

  • 5 Major Applications of Data Science in Healthcare

    Data Science has already begun to address all of these issues in order to achieve the desired result. As Data Science is now benefiting society, its applications undoubtedly will be more valuable than ever. It will propel the healthcare business forward. Doctors will have a lot of assistance, while patients will have a more personalized experience and treatments.  Let’s look at some of the essential applications in healthcare:

  • Predictive analysis: The predictive analysis model generates predictions about a patient’s status. It analyzes various correlations between symptoms, behaviors, and diseases and generates relevant predictions. It also enables healthcare to build predictive models using data science which in turn makes it possible to identify potential risks before they arrive.
  • Medical image analysis: Medical imaging is the most common application of the data science healthcare industry. Techniques like X-ray, MRI, and CT scan reveal the inner parts of the human body. With the advent of deep learning, it is now possible to identify the defects in the human body and help doctors in developing successful treatment options.
  • Drug Discovery:Today, pharmaceutical industries rely significantly on DS to provide better drugs for patients. They make use of patient information like mutation profiles and metadata to derive insights which in turn helps in the development of models.On the other hand, it is possible to improve drug discovery procedures by collecting historical data.
  • Genomics: Before the development of data analysis techniques, genomic research was a time-consuming task. But today Data science in healthcare has made it effortless and easier. Researchers now use DS to study DNA sequences to discover the link between the parameters within it and the diseases transferred by genes.

Monitoring patient health: IoT gadgets are being used by certain patients and clinicians as wearable monitors to track heartbeat and temperature. Data Science in healthcare collects data and analyzes it with the help of data science. Using analytical tools doctors can monitor a patient’s blood pressure, circadian cycle, and calorie intake.

  • What is the Role of a Data Scientist in Healthcare

Data scientists in healthcare help in exploring ways to predict drug behavior and gain a better understanding of human diseases. 

The primary role of a Healthcare data scientist is to apply all data science techniques to healthcare software and applications. They draw meaningful insights from data to make predictive models. 

In general, the core responsibilities of a healthcare data scientist are as follows:

  1. Performing data analysis with various analytical tools.
  2. Collection data/ health data. 
  3. Analyzing hospital requirements.
  4. Organizing and sorting data for use. 
  5. Implementing algorithms to extract insights.
  6. Building predictive models with the development team.
  7. Database management including data collection, retrieval, storage, and security.
  8. Converting data into easily digestible chunks for non-technical employees of an organization.
  9. Understanding hospital procedures and systems, as well as utilizing data to aid decision-making.
  10. Performing information-based audits.
  • What skills are required to become a successful data science professional in Healthcare?

The popularity of big data and its potential impact on the healthcare industry has driven the demand for more qualified data scientists. 

There isn’t a “one-size-fits-all” approach to data science. Instead, each role is unique, and because data scientists are in high demand across the healthcare industry, these jobs can require a wide range of talents. 

It is important to build a strong foundation of skills before moving forward based on your interest and strength.

  • Mathematics and statistical skills
  • Programming languages, Python and R
  • Database management languages like SQL and SAS
  • Machine learning and deep learning concepts
  • Data visualization
  • Quantitative skills and analytical skills
  • Communication and presentation skills

While these skills may help data scientists in analyzing the huge amount of data, healthcare data scientists are great at problem-solving and storytelling and are aware of their organization’s objectives. They need to discuss how to leverage data and insights with other data professionals, interact with laboratory staff and also engage with patients.


  • Healthcare data scientist and data analyst salaries: What to expect?

Today, Healthcare organizations are investing heavily in data science and analytics since it helps them in reducing administrative costs, minimizing fraudulent payments, delivering more accurate treatments and diagnostics, and overall decision making. 

Like any other industry, the salary of a healthcare data scientist is typically determined by qualification, skill set, experience, location, and organization.

On average, The annual salary of Data scientists in Healthcare and life science companies is expected to be around Rs. 40 LPA. 

Some of the popular life science companies are as follows:

Healthcare data analyst Salary:

According to Glassdoor, In India, the average income for a Healthcare Data Analyst is Rs. 7,61,298.


  • Data Science in Healthcare Projects ideas to level up your portfolio:

The majority of data science expertise is gained through data science projects which help in deeper understanding, greater retention, and awareness of real-world problems faced by data scientists. So, If you want to become a data scientist, your first step would be to learn how to work with data and then work on data science use cases.

Here are some project ideas you can work on to level-up your portfolio:

  1. Medical image segmentation: Medical image segmentation is the process of extracting areas of interest from 3D image data, such as Magnetic Resonance Imaging (MRI)  or Computerized Tomography (CT) scans. The main purpose of this project is to identify the areas of anatomy required for a particular investigation.

    Image segmentation dataset by Kaggle
  2. Ultrasound nerve segmentation: It is very crucial to accurately identify the neural structure in ultrasound images before inserting the patient’s pain catheter. In this project, you’ll learn how cutting-edge deep learning techniques are utilized to develop an end-to-end system where a person just feeds on an ultrasound image of the region to a deep learning model where it segments the nerve seen in the image.
    Ultrasound Nerve Segmentation dataset from Kaggle
  3. Heart failure prediction: Heart failure is a common consequence of cardiovascular diseases (CVDs) resulting in an increase in mortality rate. In this project, you’ll build a machine learning model that predicts mortality by heart failure. Throughout this project, you’ll learn multiple ML algorithms including Random Forest and K-NN, data wrangling, and filtering techniques.
    Heart failure prediction dataset from Kaggle
  4. Depression, anxiety, and stress prediction: Depression and stress Detection is the challenge of identifying signs of depression in individuals. This sign may be identified in several behavioral changes in a person. These symptoms can be predicted by developing a model with AI and ML algorithms such as CNN, support vector machine, KNN classifier, and linear regression.
    Depression analysis dataset by Kaggle
  5. Breast cancer prediction: Breast cancer affects approximately 12% of women worldwide and is on the verge to rise even more. This project helps doctors to predict whether a patient has breast cancer or not. You’ll be required to create an ML model to classify malignant and benign tumors by utilizing the supervised machine learning classifier technique.
    Breast cancer prediction dataset by Kaggle

The first crucial step in becoming a healthcare data scientist is earning a Bootcamp certification or a master’s degree that gives you the expertise and skills to succeed. 

Most people consider self-study or enrolling in a Bootcamp. 

Learnbay is a well-known institute for learning data science concepts. They offer affordable, flexible, and beginner-friendly Data science and AI courses in Bangalore and globally as well. You’ll gain knowledge of advanced DS and AI tools and concepts that are effectively used in the Healthcare domain. Their case studies and industry projects in healthcare will help you stand out from the competition.



By now, you’ve seen that DS has revolutionized the healthcare industry in large ways and how Healthcare and pharmaceutical industries have heavily utilized data science and AI to improve patient lifestyles. I hope this blog helped you understand the value of domain expertise in healthcare.

Also, Keep in mind that the pharma and healthcare industry will never be redundant since it is an integral part of human life. Hence, this is the perfect time for you to make that career move you have always wanted. 

Wake up and begin your journey in data science and AI from Learnbay institute! 

For more such content, do check out our site: Learnbay

You can subscribe to our social media channels to get regular Data science and AI updates. 




NLP and Deep Learning for Data Scientists

Deep learning and natural language processing (NLP and Deep Learning)are as busy as they’ve always been. The most in-demand technologies are deep learning and natural language processing (NLP). Advances in natural language processing and deep learning (NLP and Deep Learning) are produced nearly every day. Despite the fact that quarantine regulations in many nations have hampered numerous businesses, the machine learning industry continues to advance.

Aside from the fact that the Covid-19 has caused problems for a number of organizations, new-age tech skills such as Machine Learning (ML), Artificial Intelligence (AI), and Natural Language Processing (NLP and Deep Learning) is in high demand. For budding Data Scientists, here are some must-read publications. In this article, we at Learnbay try to go over some of the most crucial and current breakthroughs.

How Deep Learning can keep you safe

Nukanj Aggarwal, ML Lead at Citizen, compiled a list of instances of how deep learning is being utilized to produce life-changing technology in his article that written. This article, ideally written by Citizen’s Machine Learning Lead, that shows how deep learning is being utilized to produce life-changing (or life-saving) technology.

  • Citizen is nothing but a company that analyses first and foremost responder radio frequencies that using speech-to-text engines as well as convolutional neural networks.
  • Citizen is a real-time emergency as well as safety alert app that quite notifies users of occurrences and crimes that have specially occurred in their neighborhood.
  • The company has been able to expand its apps to a number of US cities.
  • In the coming years, this technology could signify a significant shift in the police and first responder infrastructure.
  • The NLP-driven has the ability to transform the police and response infrastructure dramatically.

The Publication of the Open AI API

The publication of GPT-3 by Open AI was arguably the most significant development in the field of natural language processing this year. The API allows businesses and individuals to integrate OpenAI’s new AI technologies into their products and services. The publication of Open AI’s API, on the other hand, may have gone unnoticed by many.

  • The API’s goal is technically keeps their focus on to provide users with access to future models built by the corporation, such as GPT-3.
  • The API is general-purpose and can be used on nearly any natural language work; its success is inversely proportional to the task’s complexity.
  • This is significant since it represents a departure from the company’s usual practice of open-sourcing its models (as they did with GPT-2)
  • The company discusses why they opted to produce a commercial product this time, why they avoided open-source this time, and how they will manage any API misuse in the post.
  • This official blog discusses how the corporation moved away from open source in order to prevent API exploitation


IBM will no longer offer, develop, or research facial recognition technology

This official blog discusses how the corporation moved away from open source in order to prevent API exploitation. The CEO of IBM publicly indicated in a letter to Congress that the business would certainly be ceasing development as well as service offers of general-purpose facial recognition technologies and methodologies.

  • Artificial intelligence advancements have substantially enhanced facial recognition software during the last decade.
  • This was a significant step forward for the organisation, as well as a strong message to the data science community at large.
  • Face recognition technology will no longer be developed or researched by IBM, according to the company.
  • IBM’s decision to prioritise ethics and safety may have influenced other large IT firms (including Microsoft) to follow suit.
  • They feel that now is the right time to start a national conversation about whether and how domestic law enforcement organisations should use facial recognition methodologies.

Conversational AI: Neural Approaches

It examines neural approaches to conversational AI that have been developed in recent times as well. Audiences are interested in Natural Language Processing and Information Retrieval.

  • The researchers divided into three categories: question answering agents, task-oriented dialogue agents, and chatbots in this paper.
  • It offers a complete overview of the various approaches to conversational AI that have been developed in recent years, including quality assurance, task-oriented, and social bots, as well as a unified view of optimum decision-making.
  • An overview of state-of-the-art neural techniques is offered for each category, along with a comparison of them to traditional approaches, as well
  • Its a discussion of progress made and obstacles still faced, using specific systems and models as case studies and sets.

It offers a coherent perspective as well as a full presentation of the key concepts and insights required to comprehend and develop modern dialogue agents that will be critical in making world knowledge and services accessible to millions of people in natural and intuitive ways.

Language Models Are Unsupervised Multitask Learners

Question answering, machine translation, reading comprehension, and summarization are all examples of natural language processing (NLP) problems that are often ideally tackled using supervised learning on task-specific data models as well.

  • When trained on a new dataset of millions of online pages called WebText, the authors proved that language models began to learn these tasks without any explicit administration as well.
  • The language model’s capacity is nothing but critical to zero-shot task transfer’s effectiveness just because of the increase whilst it certainly enhances performance in a log-linear pattern-wise across tasks.
  • These findings point to a possible avenue for developing language processing algorithms that learn to fulfill tasks based on natural demonstrations.


Generative Pre-Training Improves Language Understanding

The researchers discussed natural language processing and how discriminatively trained models can struggle to perform effectively in this paper published by OpenAI.

  • Most deep learning approaches necessitate a large amount of manually labelled data, which limits their usefulness in many sectors where annotated resources are few.
  • The approach’s effectiveness was technically proved on a numeric of natural language processing criteria, as according to the specific researchers.
  • These target tasks do not have to be in the same domain as the unlabeled corpus in our configuration.

They suggested a broad task-agnostic model that beat discriminatively trained models that use architectures specifically generated for each specific task in around 9 of the 12 tasks that studied, greatly outperforming the state-of-the-art. Their goal is to learn a universal representation that can be used for a variety of tasks with minimum change.

Deep Learning Generalization

Many difficult research areas, like image recognition and natural language processing, have seen considerable success using deep learning.

  • Deep learning has had a substantial impact on the conceptual foundations of machine learning and artificial intelligence and has achieved significant practical success.
  • They would demonstrate in this certain Deep Learning Generalization article that deep learning technology nowadays is a strong contender for increasing sensing abilities.

The Model Card Toolkit for Easier Model Transparency Reporting

Transparency in machine learning (ML) models is crucial in a range of sectors that affect people’s lives, including healthcare, personal finance, and implementation as well. It gets more difficult to convey the intended use cases and other information to consumers downstream whenever larger and also possibly more and more intricate deep learning models are developed.

  • The details that developers need to assess whether or not a model is appropriate for their use case may vary, as will the information required by downstream users.
  • To help and assess that how to tackle this particular difficulty, as Google researchers ideally developed the “Model Card Toolkit,” which particularly simplifies the creation of model transparency reports.

The Complete Guide to Deep Learning Algorithms

This article, written by Sergios Karagiannakos, the founder of AI Summer, provides a comprehensive guide to deep learning.

  • Deep Learning is getting a lot of traction in both the scientific and corporate worlds.
  • Sergios Karagiannakos, certainly the founder of AI Summer, who has written a comprehensive handbook.
  • More and more businesses are incorporating them into their regular operations. It covers far too many topics, ranging from various types of neural networks to deep learning baselines.

Deepfake Detection Tools and AI-Generated Text

With the widespread dissemination of misinformation on social media, I was alarmed when I noticed it had reached my own inner surrounding. The consequences of such deepfakes have been disastrous, with hacked videos of public personalities circulating, putting their reputations at risk. I wanted to help counteract the nefarious use of these technologies as it has become easier to make deepfakes and manufacture fake articles using AI.

  • Given the catastrophic consequences of deepfakes, many attempts to develop relevant tools to detect them have been attempted, with variable degrees of success.
  • Furthermore, the digital behemoth unveiled a new tool that can detect doctored information and ensure readers of its veracity.
  • This article explains a few easy strategies and browser plugins for detecting deepfakes and AI-generated text.
  • Binghamton University and Intel researchers developed a method that goes beyond deepfake identification to identify the deepfake model behind the hacked video.

GPT-3 Philosophers (updated with replies by GPT-3)

This is a fascinating thinking piece in which nine philosophers go into Open AI’s GPT-3. It’s not only a matter of correcting the linguistic biases that have arisen (or used in training.) This is an intriguing thinking article from Daily Nous, in which nine philosophers delve into Open AI’s GPT-3.

  • It isn’t a case of discovering a technological panacea to eliminate bias.
  • The thought leaders ponder the ethical and moral challenges that technology may raise, as well as the remaining questions that it may raise.

Bridging The Gap Between Training & Inference For Neural Machine Translation

This paper is one of the top NLP papers that published from the premier conference, Association for Computational Linguistics (ACL). Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words.

  • This paper bridging The Gap Between Training & Inference For Neural Machine Translation talks about the error accumulation.
  • The researchers certainly addressed such specific problems by sampling context words, not only from the ground truth sequence. But also from the predicted sequence particularly by the model during training, whereas the predicted sequence is technically selected with a sentence-level optimum.
  • In this paper, they address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence.
  • According to the specific researchers, this approach can technically achieve significant improvements in multiple datasets.

The Matrix Calculus You Need For Deep Learning

This document attempts to teach all of the matrix mathematics required to comprehend deep neural network training. Using the automatic differentiation built into modern deep learning libraries. This certainly explains how to become a world-class and relevant deep learning practitioner with only a basic understanding of scalar calculus.

  • They presume you know nothing about arithmetic beyond what you studied in calculus 1 and provide resources to assist you refresh your math skills if necessary.
  • This material is for those who are already familiar with the basics of neural networks and want to deepen their understanding of the underlying math.

You do not need to understand this material before learning to train and use deep learning in practise; rather, this material is for those who are already familiar with the basics of neural networks and want to deepen their understanding of the underlying math.

Final lines

We hope that these articles and instructions on natural language processing and NLP and Deep Learning helped you keep up with some of the major developments in machine learning this year. Increased focus with NLP and Deep Learning means more internet materials are available. But a good article is sometimes required to gain a solid understanding of such a complicated and multi-faceted subject. Articles can help you improve your overall data literacy by providing basic background information, such as an introduction to deep learning and natural language processing (NLP) or clarification on significant ideas and real-world illustrations very well. Keep growing, my fellow members of the A.I. community.

Clustering & Types Of Clustering

Clustering & Types Of Clustering is the process of finding similar groups in data, called a cluster. It groups data instances that are similar to each other in one cluster and data instances that are very different(far away) from each other into different clusters. A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Cluster graph

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science. Entities in each group and is comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used clustering methods.

Steps involved in Clustering analysis:

1. Formulate the problem – select variables to be used for clustering.

2. Decide the clustering procedure whether it will be Hierarchical or Non-Hierarchical.

3. Select the measure of similarity or dissimilarity.

4. Choose clustering algorithms.

5. Decide the number of clusters.

6. Interpret the cluster output(profile the clusters).

7. Validate the clusters.

Types of clustering technique:

Broadly speaking, clustering can be divided into two subgroups :

  • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering are:

k-means clustering:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-Means minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. Better Euclidean solutions can, for example, be found using k-medians and k-medoids.

K-Means Clustering example

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps :

  1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
  2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
  3. Compute cluster centroids: The centroid of data points in the red cluster is shown using a red cross and those in a grey cluster using the grey cross.
  4. Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of the grey cluster. Thus, we assign that data point into a grey cluster
  5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
  6. Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

from pandas import DataFrame
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] }
df = DataFrame(Data,columns=['x','y'])
print (df) 

k-means for cluster=3

from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] }
df = DataFrame(Data,columns=['x','y'])
kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_
plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) 
Hierarchical Clustering: 

Hierarchical clustering, as the name suggests is an algorithm that builds the hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using the dendrogram. The dendrogram can be interpreted as:

The results of hierarchical clustering the dendrogram.

Two important things that you should know about hierarchical clustering are:

  • This algorithm has been implemented above using a bottom-up approach. It is also possible to follow the top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
  • The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
    • Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
    • Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
    • Manhattan distance: ||a-b||1 = Σ|ai-bi|
    • Maximum distance:||a-b||INFINITY = maxi|ai-bi|
    • Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}

import numpy as np
X = np.array([[5,3],
import matplotlib.pyplot as plt
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.scatter(X[:,0],X[:,1], label='True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
xy=(x, y), xytext=(-3, 3),
textcoords='offset points', ha='right', va='bottom')


Data point plot

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize=(10, 7))

Dendrogram plot

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Introduction to Simple Linear Regression in Machine Learning

No matter what ML course you have chosen, the first learning goal of data science statistics modules will be the LR (linear regression), better to say, Simple Linear Regression in Machine Learning . In addition, we call this type of widely useful ML algorithm with an abbreviation of SLR.

In this blog, we’ll evaluate the foundational approach of Simple Linear Regression in Machine Learning in ML modelling.

What is SLR in Machine Learning?

Simple Linear Regression in Machine Learning (SLR) is a tactic that can help to review and evaluate relationships between two factors; where one of several factors is adjustable, this is certainly self-sufficient and can also be referred to as ‘explanatory’ or ‘stimulus’ or ‘predictor’ factors (variable). The other one is a subordinate factor, additionally known as a ‘response’ or ‘outcome’ factor.
Now, if you ask why ‘simple?’ Well, the phrase “Simple” relates to two factors used in this regression evaluation method. A line that is certainly straight used to mold linear regression and grant an explanation for the association between factors.

While you get to indulge in machine learning problems and then land on the expected and profitable outcomes, you need to find certain inter-relationships between a set of the above two types of variables. So here comes the application of Simple Linear Regression in Machine Learning .

What are the real-life applications of SLR algorithms?

If we sit to lists out the real-life instances of SLR in ML, then the list will be an endless entity. However, the handiest real-world example of the SLR application is as follows.

  • Suppose you have decided to take a train your company employee with the basics of data analytics to improve your business outcomes. Now the amount you are going to invest in this training will be a self-sufficient factor. Therefore, based on the training-related investment, the percentage of ROI concerning your business decision improvement will be the outcome factor.
  • Suppose you have planned to buy a second-hand car. But finding it difficult to set your budget based on car performance. To ensure the performance and parts availability, you have decided to consider up to a certain amount of age of the car. In such a scenario, you can apply SLR to set your budget. Here the age of the car will be a self-sufficient factor while the budget will be the outcome factor.
  • Suppose you work for an e-commerce company in the marketing domain. A few months back your company have implemented new advertising strategies. But now you want to evaluate the profit level in monthly advertising cost with respect to the monthly sales rate. Here you can apply the SLR for ML modeling.

SLR can be the ultimate solution to lots of complex problems to a moderate business problem. Just keep one thing in mind, don’t forget to approach the linearity condition correctly.

What is the linearity condition in SLR?

SLR tries to solve the noticeable changes in the value of the subordinate factor (dependent)
Y with the familiarity of the values of the predictor (independent) variables X.
Hence, the equation 𝛼𝑖 + 𝛽𝑖𝑋 gives the predicted values of Yi for the provided credit of Xi. Hence,
So, you can consider 𝛼𝑖 + 𝛽𝑖𝑋 as the conditional credit that is certainly expected of Yi concerning the provided value of Xi.
Here 𝛼 and 𝛽 are the linear regression coefficients.
While doing SLR, the most vital thing to remember is that the linearity symptom in linear regression is characterized by the characteristics of regression coefficients and not regarding the explanatory variables in the data design.
Therefore, the useful formula of the SLR becomes as follows.

𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖𝑋𝑖2+ 𝜀𝑖
⇒𝑌𝑖 = 𝛼𝑖 + 𝛽𝑖 ln(𝑋𝑖 ) + 𝜀𝑖

What can simple linear regression tell us that correlation does not tell us?

Although correlation apparently seems to be similar to the simple linear regression in actuality, there lies a range of differences between these two.

Difference1: Correlation quantifies the amount to which two factors are all related. Besides, fitting a line through the data set is not the case of correlation.

Difference 2: In case you need to quantify both the factors, correlation is often used. It infrequently works if one factor is something that you rightfully control. On the contrary, with Simple linear regression, the X factor is often something that you manipulate (it may be a time series or range of salary or price, etc. ). The Y factor is something that can be scaled (measured).

How does SLR work?

To make an SLR work to find out the solution to your identified problem, you need to follow a seven-step mathematical process as follows.

Step#1: Visualise the inter-connections between the identified factors through graphical outcomes. The standard type of graph used in SLR is a scatter plot.

Step#2: Utilise the OLS technique to calculate the regression parameters and defining the proper execution of the relationship between the variables.

Step#3:Calculate error that is standard of regression estimation.

Step#4: Calculate proper forecast interludes predicated upon your own postulates that are inclined to become normally discarded (normal distribution) depending on a prophesied charge of X.

Step#5: Validate the consequence of regression parameters received.

Step#6: Validate the best fitting quality for the model for the intact model. Only keep in mind, while dealing with the SLR algorithm, the value of p associated with the F-test and the linear regression coefficient remain identical.

Step#7: Identify the determinant and correlation coefficients.

Why use a scatter diagram in SLR?

While you choose SLR as your regression model, then the first thing you need to do is assessing the relationship between your identified factors.
To perform this inter-relationship identification, the best graphical visualization seems to be the scatter plot. The reason for choosing the scatter plot for this purpose is,

  • Apart from the best-fit line, the dots (data points of identified variables) helps a lot to visualize the hidden pattern of inter-relationship between the factors.
  • If the factors proved to be mutually inter-related, then the estimated equation for the identified relationship can be predicted. Then, with the help of this predicted equation, you can proceed with your ML algorithm modeling.

In case simple linear regression applies to a business problem, then the identified factors usually can be fo following six types of the scattered plot:

The above plot indicates an immediate linear connection between 2 sorts of factors (dependent and independent).


The above plot indicates an immediate but curvy linear connection between 2 sorts of factors.

The above plot indicates an immediate but inverted linear connection between 2 sorts of factors


The above plot indicates an inverted and curvy linear connection between 2 sorts of factors.

The above plot indicates a direct and inverted linear connection between 2 sorts of factors, unlike figure 3. But the extent of scattering is much higher in this case.

Fig 6:
The above plot indicates the non-linear relationship between the factors.

How to calculate the SLR in ML modeling?

To model, an ML algorithm utilizing SLR can be done either with Python or R. Here, I will explain the python programming variant.

To program an SLR model using python, six prime steps have to be followed cautiously. The prime steps are as follows.

    • #1: Dataset Importing


    • #2: Data Pre-processing


    • #3: Segregation of the train and test sets


    • #4: Assessing the linear regression model concerning the training dataset


    • #5: Predicting evaluation success


    #6: Conceiving the evaluation benefits

Now while using python programming, the generic step from 1 to 5 remains almost the same. However, depending on which type of graphs or chart you will be using, step 6 alters a bit. So the generic python programming for SLR regression is as follows.

# Dataset Importing
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
dataset = pd.read_csv('file name.csv')

# data pre-processing
X = dataset.iloc[:, :-1].values #X is the array of self-sufficient factors
Y = dataset.iloc[:,1].values #Y is the vector consisting of subordinate factor.

# segregation of the train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test, train_test_split(X,Y,test_size=1/3,random_state=0) # test size ⅓ is used as of the policy of 20-80 or 30-70 splitting.

# Assessing the linear regression model with respect to the training dataset
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(),Y_train) #This step provides the out of linear equation going to be used on the considered dataset.

# Predicting evaluation success
y_pred = regressor.predict(X_test)



Where to learn SLR?

If you want to learn more about the application of SLR in ML, you can join IBM certified Learnbay Data science and AI certification courses. The data science course syllabus of Learnbay offers balanced learning scopes on both statistics and programming- the two key pillars of data science career growth. Our AI and ML courses are available for both fresh graduates and working professionals. All of our courses are entitled to real-time industrial projects and live online classes. Our course is available in all the prime cities across India, such as Mumbai, Kolkata, Bengaluru, Hyderabad, Delhi, Lucknow, and Patna.

To learn more about Learnaby Data science, AI, and ML courses, and book a telephonic counseling session, click here.

Regression techniques in Machine Learning

Machine learning has become the sexiest and very trendy technology in this world of technologies, Machine learning is used every day in our life such as Virtual assistance, for making future predictions, Videos surveillance, Social media services, spam mail detection, online customer support, search engine resulting prediction, fraud detection, recommendation systems, etc. In machine learning, Regression is the most important topic that needed to be learned. There are different types of Regression techniques in Machine Learning which we will know in this article.


Regression techniques in Machine Learning such as Linear regression and Logistic regression are the most important algorithms that people learn while they study about Machine learning algorithms. There are numerous forms of regression that are used to perform regression and each has its own specific features, that are applied accordingly. The regression techniques are used to find out the relationship between the dependent and independent variables or features. It is a part of data analysis that is used to analyze the infinite variables and the main aim of this is forecasting, time series analysis, modeling.

What is Regression?

Regression is a statistical method that mainly used for finance, investing and sales forecasting, and other business disciplines that make attempts to find out the strength and relationship among the variables.

There are two types of the variable into the dataset for apply regression techniques:

  1. Dependent Variable that is mainly denoted as Y
  2. Independent variable that is denoted as x.

And, There are two types of regression

  1. Simple Regression: Only with a single independent feature /variable
  2. Multiple Regression: With two or more than two independent features/variables.

Indeed, in all regression studies, mainly seven types of regression techniques are used firmly for complex problems.

  • Linear regression
  • Logistics regression
  • Polynomial regression
  • Stepwise Regression
  • Ridge Regression
  • Lasso Regression

Linear regression:

It is basically used for predictive analysis, and this is a supervised machine learning algorithm. Linear regression is linear approach to modeling the relationship between scalar response and the parameters or multiple predictor variables. It focuses on the conditional probability distribution. The formula for linear regression is Y = mX+c.

Where Y is the target variable, m is the slope of the line, X is the independent feature, and c is the intercept.

Simple Linear Regression in Machine learning - Javatpoint

Additional points on Linear regression:

  1. There should be a linear relationship between the variables.
  2. It is very sensitive to Outliers and can give a high variance and bias model.
  3. The problem of occurring multi colinearity with multiple independent features

Logistic regression:

It is used for classification problems with a linear dataset. In layman’s term, if the depending or target variable is in the binary form (1 0r 0), true or false, yes or no. It is better to decide whether an occurrence is possibly either success or failure.


Logistic Regression

Additional point:

  1. It is used for classification problems.
  2. It does not require any relation between the dependent and independent features.
  3. It can after by the outliers and can occur underfitting and overfishing.
  4. It needs a large sample size to make the estimation more accurate.
  5. It needs to avoid collinearity and multicollinearity.

Polynomial regression:

The polynomial regression technique is used to execute a model that is suitable for handling non-linear separated data. It gives a curve that is best suited to data points, rather than a straight line.
The polynomial regression suits the least-squares form. The purpose of an analysis of regression to model the expected y value for the independent x of the dependent variable. 
The formula for this Y=  β0+ β0x1+e
Polynomial Regression - Towards Data Science
Additional  features: 
Look particularly for curve towards the ends to see if those shapes to patterns make logical sense. More polynomials can lead to weird extrapolation results. 

Step-wise Regression:

It is used for statistical model fitting regression with predictive models. It is done automatically. 
The variable is supplemented or removed from the explanatory variable set at every step. The main approaches for the regression are reverse elimination and bidirectional elimination and step by step approaches. 
The formula for this: b = b(sxi/sy)
Additional points: 
  1. This regression provides two things, the very first one is to add prediction for each steep and remove predictors fro each step.
  2. It starts with the most significant predictor into the ML model and then adds features for each step.
  3. The backward elimination starts with all the predictors into the model and then removes the least significant variable.

Ridge Regression: 

It is a method that used when the dataset having multicollinearity which means, the independent variables are strongly related to each other. Although the least-squares estimates are unbiased in multicollinearity, So after adding the degree of bias to the regression, ridge regression can reduce the standard errors.
Ridge Regression for Better Usage - Towards Data Science

Additional points:

  1. In this regression, normality is not to be estimated the same as Least squares regression.
  2. In this regression, the value could be varied but doesn’t come to zero.
  3. This uses the l2 regularization method as it is also a regularization method.

Lasso Regression:

Lasso is an abbreviation of the Least Absolute shrinkage and selection operator. This is similar to the ridge regression as it also analyzes the absolute size of the regression coefficients. And the additional features of that are it is capable of reducing the accuracy and variability of the coefficients of the Linear regression models.

Lasso regression in matlab - Stack Overflow


Additional points: 
  1. Lasso regression shrinks the coefficients aero, which will help in feature selection for building a proper ML model.
  2. It is also a regularization method that uses l1 regularization.
  3. If there are many correlated features, it picks only one of them and shrinks it to the zero.


Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Regression techniques in Machine Learning,Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Top 50 interview questions of Machine Learning

51. How to handle categorical variables in KNN?

Ans: Create dummy variables out of a categorical variable and include them instead of the original categorical variable. Unlike regression, create k dummies instead of (k-1). 

For example, a categorical variable named “Department” has 5 unique levels/categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

52Can KNN be used for Regression? How to use KNN for Regression?

Ans: Yes, K-nearest neighbour can be used for regression. In other words, the K-nearest neighbour algorithm can be applied when the dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbours.

53Discuss the difference between KNN and K Means Algorithms.

Ans: KNN and k-means clustering both are very different algorithms that solve different problems and have their own meanings of what the variable ‘k’ is.  KNN is a supervised classification algorithm that will label new data points based on the ‘k’ number of nearest data points and k-means clustering is an unsupervised clustering algorithm that groups the data into ‘k’ number of clusters.

54. How to reduce the increased variance of the model other than changing k?

Ans: By using bagging-based decision boundaries. If not restricted in the number of times, one can draw samples from the original dataset, a sample variance reduction method would be to sample, many times, and then simply take a majority vote of the kNN models to fit each of these samples to classify each test data point. This variance reduction method is called bagging. 

55. What is the effect of sampling on KNN?

Ans: Sampling does several things from the perspective of a single data point since kNN works on a point-by-point basis.

  1. The average distance to the k nearest neighbours increases due to increased sparsity in the dataset.
  2. Consequently, the area covered by k-nearest neighbours increases in size and covers a larger area of the feature space.
  3. The sample variance increases.

A consequence of this change in input is an increase in variance. When we talk of variance, we refer to the variability in the predictions given different samples from the population. Why would the immediate effects of sampling lead to the increased variance of the model?

Notice that now a larger area of the feature space is represented by the same k data points. While our sample size has not grown, the population space that it represents has increased in size. This will result in higher variance in the proportion of classes in the k nearest data points, and consequently a higher variance in the classification of each data point.

56. What happens when we change the value of K in KNN?

Ans: Short Answer: The class boundaries of the predictions become more smooth as k increases.

Long Answer: What really is the significance of these effects? First, it gives hints that a lower k value makes the KNN model more “sensitive.” That is, it is more sensitive to the local changes in the dataset. The “sensitivity” of the model directly translates to its variance.

All of these examples point to an inverse relationship between variance and k. Additionally, consider how KNN operates when k reaches its maximum value, k=n, where n is the number of points in the training set) In this case, the majority class in the training set will always dominate the predictions. It will simply pick the most abundant class in the data, and never deviate, effectively resulting in zero variance. Therefore, it seems to reduce variance, k must be increased.

Final Verdict: In order to offset the increased variance due to sampling, k can be increased to decrease model variance.

57. What is the thumb rule to approach the KNN problem?


    1. Load the data
    2. Initialize the value of k
      • Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.
      • Sort the calculated distances in ascending order based on distance values
      • Get top k rows from the sorted array
      • Get the most frequent class of these rows
      • Return the predicted class for getting the predicted class, iterate from 1 to the total number of training data points.

KNN Code Snippet:

KNN code snippet

58What is SVM Algorithm? 

Ans: SVM stands for support vector machine, it is a supervised machine learning algorithm that can be used for both Regression and Classification. In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these co-ordinates are known as Support Vectors)


Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away.

graph plot

In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.

59. What are support Vectors? 

Ans: A support vector machine attempts to find the line that “best” separates two classes of points. By “best”, we mean the line that results in the largest margin between the two classes. The points that lie on this margin are the support vectors.

The vectors that define the hyperplane are the support vectors.

60. What is the purpose of the Support Vector in SVM?

Ans: A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the distance margin between the two classes. The extreme points in the data sets that define the hyperplane are the support vectors. 

61. What are kernels? 

Ans: SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of the kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be of different types.

There are four types of kernels in SVM.

  1. Linear Kernel
  2. Polynomial kernel
  3. Radial basis kernel
  4. Sigmoid kernel

62. What is Kernel Trick?

Ans: Short Answer:  It allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space.

Long Answer:

  1. For a dataset with n features (~n-dimensional), SVMs find an n-1-dimensional hyperplane to separate it (let us say for classification)
  2. Thus, SVMs perform very badly with datasets that are not linearly separable
  3. SVM can now do well with datasets that are not linearly separable
  4. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
  5. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly)
    1. For datasets with a lot of features, it becomes next to impossible to try out all the interesting transformations
  6. Enter the Kernel Trick
    • Thankfully, the only thing SVMs need to do in the (higher-dimensional) feature space (while training) is computing the pair-wise dot products
    • For a given pair of vectors (in lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first
    • We are saved!

63. Why is SVM called as Large Margin Classifier?

Ans: Short Answer: Because it places the decision boundary such that it maximizes the distance between two clusters.

Long Answer: choosing the best hyperplane is to choose one in which the distance from the training points is the maximum. This is formalized by the geometric margin. Without getting into the details of the derivation, the geometric margin is given by:

Maths formula

Which is simply the functional margin normalized. So, these intuitions lead to the maximum margin classifier which is a precursor to the SVM.

64What is the difference between Logistics Regression and SVM? When to use which model?


  1. SVM tries to find the “best” margin (distance between the line and the support vectors) that separates the classes and this reduces the risk of error on the data, while logistic regression does not, instead it can have different decision boundaries with different weights that are near the optimal point.
  2. SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables.
  3. SVM is based on the geometrical properties of the data while logistic regression is based on statistical approaches.
  4. Logistic Regression can’t be applied to a nonlinearly separable dataset whereas SVM can be applied.
  5. The risk of overfitting is less in SVM, while Logistic regression is vulnerable to overfitting.

65. When to Use Logistic Regression vs Support Vector Machine?

Ans: Depending on the number of training sets (data)/features that you have, you can choose to use either logistic regression or support vector machine.

Let’s take these as an example where:
n = number of features,
m = number of training examples

  1. If n is large (1–10,000) and m is small (10–1000): use logistic regression or SVM with a linear kernel.
  2. If n is small (1–1000) and m is intermediate (10–10,000): use SVM with (Gaussian, polynomial, etc) kernel
  3. If n is small (1–100), m is large (50,000–1,000,000+): first, manually add more features and then use logistic regression or SVM with a linear kernel

66. What does c and gamma parameter in SVM signify?

Ans: Short Answer:

Cost and Gamma are the hyper-parameters that decide the performance of an SVM model. There should be a fine balance between Variance and Bias for any ML model. (this is a science and an art – as we call it in empirical studies)

For SVM, a High value of Gamma leads to more accuracy but biased results and vice-versa. Similarly, a large value of Cost parameter (C) indicates poor accuracy but low bias and vice-versa.

Following table summarizes the above explanation –

The art is to choose a model with optimum variance and bias. Therefore, you need to choose the values of C and Gamma accordingly.

Optimum values of C and Gamma can be found by using methods like Grid search.

Long Answer:

The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. 

The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.  If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of the high density of any pair of two classes.

67. What are the Advantages and Disadvantages of SVM?

Ans: SVM Advantages

  • SVM’s are very good when we have no idea about the data.
  • Works well with even unstructured and semi-structured data like text, Images, and trees.
  • The kernel trick is a real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
  • Unlike in neural networks, SVM is not solved for local optima.
  • It scales relatively well to high dimensional data.
  • SVM models have generalization in practice, the risk of over-fitting is less in SVM.
  • SVM is always compared with ANN. When compared to ANN models, SVMs give better results.

SVM Disadvantages

  • Choosing a “good” kernel function is not easy.
  • Long training time for large datasets.
  • Difficult to understand and interpret the final model, variable weights, and individual impact.
  • Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic.
  • The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact.

SVM code snippet:

SVM code snippet

68What is Naïve Bayes Algorithm? 

Ans: It is a classification algorithm that predicts the probability of each data point belonging to a class and then classifies the point as the class with the highest probability.

 Discuss Bayes Theorem.

Bayes’ Theorem gives us the probability of an event actually happening by combining the conditional probability given some result and the prior knowledge of an event happening.

Conditional probability is the probability that something will happen, given that something has occurred.  In other words, the conditional probability is the probability of X given a test result or P(X|Test).  For example, what is the probability an e-mail is spam given that my spam filter classified it as spam.

The prior probability is based on previous experience or the percentage of previous samples.  For example, what is the probability that any email is spam?


  • P(A|B) = Posterior probability = Probability of A given B happened
  • P(B|A) = Conditional probability = Probability of B happening if A is true
  • P(A) = Prior probability = Probability of A happening in general
  • P(B) = Evidence probability = Probability of getting a positive test

69. Why is Naïve Bayes Naïve?

Ans: In Layman’s Term: The simple meaning of Naive is willing to believe that that life is simple and fair, which is not true. Naive Bayes is naive because it assumes that the features that are going into the model are not related to each other anyhow Change in one variable will not affect the other variable directly.

Long Answer: Naive Bayes (NB) is ‘naive’ because it makes the assumption that features of measurement are independent of each other. This is naive because it is (almost) never true. Here is how it works even then – NB is a very intuitive classification algorithm. It asks the question, “Given these features, does this measurement belong to class A or B?”, and answers it by taking the proportion of all previous measurements with the same features belonging to class A multiplied by the proportion of all measurements in class A. If this number is bigger than the corresponding calculation for class B then we say the measurement belongs in class A.

70. What are feature matrix and response vectors?

Ans: Feature matrix:- The feature matrix contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features. 

Response vectors:- The response vector contains the value of the class variable (prediction or output) for each row of the feature matrix. 

71 Applications of Naïve Bayes Classification Algorithms?

Ans: Some of the real-world examples are as given below

  • To mark an email as spam, or not spam?
  • Classify a news article about technology, politics, or sports?
  • Check a piece of text expressing positive emotions, or negative emotions?
  • Also used for face recognition software.

72. What are the Advantages and Disadvantages of using the Naïve Bayes Algorithm?

Ans: Advantages

  1. Fast
  2. Highly scalable.
  3. Used for binary and Multiclass Classification.
  4. Great Choice for text classification.
  5. It can easily train smaller data sets.


Naive Bayes considers that the features are independent of each other. However, in the real-world, features depend on each other.

Naïve Bayes Code Snippet:

Naïve Bayes Code Snippet

73. What is K-Means Clustering? What are the steps for it?

Ans: K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, original from signal processing, that is popular for cluster analysis in data mining.

If k is given, the K-means algorithm can be executed in the following steps:

  • Partition of objects into k non-empty subsets
  • Identifying the cluster centroids (mean point) of the current partition.
  • Assigning each point to a specific cluster
  • Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
  • After re-allotting the points, find the centroid of the new cluster formed.

74. Why is the word “means” associated with the name of the K-Means algorithm?

Ans: The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid. 

There are k-medoids and k-medians algorithms as well.

k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids choose datapoints as centers (medoids or exemplars).

k-medians is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median.

75. How to find the optimum number of clusters in K-Means? Discuss the elbow curve/elbow method?

Ans: The basic idea behind partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation [or total within-cluster sum of square (WSS)] is minimized. The total WSS measures the compactness of the clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.

The elbow method using distortion graph

Notice the elbow at k =3.

The optimal number of clusters can be defined as follow:

  1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
  2. For each k, calculate the total within-cluster sum of square (WSS).
  3. Plot the curve of WSS according to the number of clusters k.
  4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

76. What is the difference between K-Means and Hierarchical Clustering? When to use which?

Ans: Hierarchical Clustering and k-means clustering complement each other. In hierarchical clustering, the researcher is not aware of the number of clusters to be made whereas, in k-means clustering, the number of clusters to be made is specified before-hand.
Advice- If unaware of the number of clusters to be formed, use hierarchical clustering to determine the number and then use k-means clustering to make more stable clusters as hierarchical clustering is a single-pass exercise whereas k-means is an iterative process.

77. What are the advantages and disadvantages of using K-Means Algorithms?

Ans: K-Means Advantages :

1) If variables are huge, then  K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.

2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

K-Means Disadvantages:

1) Difficult to predict K-Value.
2) With a global cluster, it didn’t work well.
3) Different initial partitions can result in different final clusters.
4) It does not work well with clusters (in the original data) of Different sizes and Different density.

KNN code snippet:

KNN code snippet

78. What is Hierarchical Clustering?

Ans: Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabelled data points having similar characteristics. Hierarchical clustering algorithms fall into the following two categories.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters

79. What are the steps to perform Agglomerative Hierarchical Clustering?

Ans: Most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −

  • Step 1 − Treat each data point as a single cluster. Hence, we will be having, say K clusters at the start. The number of data points will also be K at the start.
  • Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in a total of K-1 clusters.
  • Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in a total of K-2 clusters.
  • Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.
  • Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.

80. What is Dendrogram and what is its importance in Hierarchical Clustering?

Ans: A dendrogram is a type of Tree Diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.

The role of the dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple clusters of related data points depending upon our problem. 

Parts of Dendrogram:

Parts of DendrogramHierarchical Clustering Code Snippet:

Hierarchical Clustering Code Snippet

81. What is Boosting?

Ans: Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set.

Purpose of Boosting: It helps the weak learner to be modified to become better.

How it evolved: The first Boosting Algorithm gained popularity was AdaBoost or Adaptive Boosting. Further it evolved and generalized as Gradient Boosting.

82. What is Adaboost?

Ans: Adaboost combines multiple weak learners into a single strong learner. The weak learners in AdaBoost are decision trees with a single split, called decision stumps. When AdaBoost creates its first decision stump, all observations are weighted equally. To correct the previous error, the observations that were incorrectly classified now carry more weight than the observations that were correctly classified. AdaBoost algorithms can be used for both classification and regression problems.

Adaboost Code Snippet:

Adaboost Code Snippet

83. What is Gradient Boosting Method (GBM)?

Ans: Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation at every iteration like AdaBoost, the Gradient Boosting method tries to fit the new predictor to the residual errors made by the previous predictor.

GBM uses Gradient Descent to find the shortcomings in the previous learner’s predictions. The GBM algorithm can be given in the following steps.

Fit a model to the data, F1(x) = y

Create a new model, F2(x) = F1(x) + h1(x)

By combining weak learners after weak learners, our final model is able to account for a lot of the error from the original model and reduces this error over time.

Gradient Boosting Code Snippet:

Gradient Boosting Code Snippet

84. What is XGBoost?

Ans: XGBoost stands for eXtreme Gradient Boosting. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Gradient boosting machines are generally very slow in implementation because of sequential model training. Hence, they are not very scalable. Thus, XGBoost is focused on computational speed and model performance. XGBoost provides:

    • Parallelization of tree construction using all of your CPU cores during training.
    • Distributed Computing for training very large models using a cluster of machines.
    • Out-of-Core Computing for very large datasets that don’t fit into memory.
    • Cache Optimization of data structures and algorithm to make the best use of hardware.

XGBoost Code Snippet:

XGBoost Code Snippet

85. What are the basic enhancements done to Gradient Boosting?

Ans: Gradient boosting is a greedy algorithm and can overfit a training dataset quickly. It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

We will look at 4 enhancements to basic gradient boosting:

  1. Tree Constraints
  2. Shrinkage
  3. Random sampling
  4. Penalized Learning
  1. Tree Constraints: A good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.

         Below are some constraints that can be imposed on the construction of decision trees:

  • The number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.
  • Tree depth, deeper trees are more complex trees, and shorter trees are preferred. Generally, better results are seen with 4-8 levels.
  • The number of nodes or number of leaves, like depth, can constrain the size of the tree but is not constrained to a symmetrical structure if other constraints are used.
  • Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered
  • Minimum improvement to loss is a constraint on the improvement of any split added to a tree.
  1. Penalized Gradient Boosting: Additional constraints can be imposed on the parameterized trees in addition to their structure. Classical decision trees like CART are not used as weak learners, instead, a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature. As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as L1 regularization of weights and L2 regularization of weights. The additional regularization term helps to smooth the final learned weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.  
  2. Weighted Updates: The predictions of each tree are added together sequentially. The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.
  3. Stochastic Gradient Boosting: A big insight into bagging ensembles and the random forest was allowing trees to be greedily created from subsamples of the training dataset. This same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models. This variation of boosting is called stochastic gradient boosting. At each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

86. What is Dimensionality Reduction? Why is it used?

Ans: Dimensionality reduction refers to the process of converting a set of data. That data needs to having vast dimensions into data with lesser dimensions. Also, it needs to ensure that it conveys similar information concisely. 

Although, we use these techniques to solve machine learning problems. And the problem is to obtain better features for a classification or regression task.

87. What are the commonly used Dimensionality Reduction Techniques?

Ans: The various methods used for dimensionality reduction include:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Generalized Discriminant Analysis (GDA)

88. How does PCA work? When to use? 

Ans: Short Answer: Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning.

High dimensionality means that the dataset has a large number of features. The primary problem associated with high dimensionality in the machine learning field in model overfitting, which reduces the ability to generalize beyond the examples in the training set.

PCA in Layman’s Term: Consider the 2D XY plane.

For the sake of intuition, let us consider variance as the spread of data – the distance between the two farthest points.

Typically, it is believed, that if the variance of data is large, it offers more information, than data that has a small variance. (This may or may not be true). This is the assumption which PCA intends to exploit.

I give you 4 points – {(1,1), (2,2), (3,3), (4,4)}
(all lie on the line X=Y)

What is the variance on X-axis?
Variance(X) = 4-1 = 3

What is the variance on Y-axis?
Variance(Y) = 4-1 = 3

Can we obtain new data with higher variance in some manner?
Rotate your XY system by 45 degrees anticlockwise. What happens? The line X=Y has now become the X(new)-axis. And, X = -Y is now the Y(new)-axis. Let’s compute the variance again (in the form of distance)

Variance(X(new)) = distance ((4,4), (1,1)) = sqrt(18) = 4.24
Variance(Y(new)) =requires some calculations.

89. What did we get by doing this rotation?
Ans: Original data – had the highest variance on any axis as 3. This rotation gave us a variance of 4.24

That was the intuitive explanation of what PCA does. Just for further clarification

Eigenvalues = variance of the data along a particular axis in the new coordinate system. In above example, Eigenvalue(X(new)) = 4.24.

Eigenvectors = the vectors which represent the new coordinate system. In above example, vector [1,1], would be an eigenvector for X(new), and [1,-1] eigenvector for Y(new). Since they are just directions – solvers typically give us unit vectors.

Getting transformed data
Once you have the eigenvectors, a dot product of the eigenvector with the original point will give you the new point in the new coordinate system.

Diagonalization: This is the part where you equate covariance to lambda*I. This is basically trying to find an eigenvector, such that all points would lie on the same line, and thus it will have only elements of variance, and covariance terms would be zero.

Steps of PCA:

  1. Calculate the covariance matrix X of data points.
  2. Calculate eigenvectors and correspond eigenvalues.
  3. Sort eigenvectors accordingly to their given value in decrease order.
  4. Choose first k eigenvectors and that will be the new k dimensions.
  5. Transform the original n-dimensional data points into k-dimensions

PCA code snippet:

PCA code snippet

90. How does LDA work? When to use?

Ans: LDA is a way to reduce ‘dimensionality’ while at the same time preserving as much of the class discrimination information as possible.

How does it work?
Basically, LDA helps you find the ‘boundaries’ around clusters of classes. It projects your data points on a line so that your clusters ‘are as separated as possible’, with each cluster having a relative (close) distance to a centroid.

What was that stuff about dimensionality?
Let’s say you have a group of data points in 2 dimensions, and you want to group them into 2 groups. LDA reduces the dimensionality of your settings like so:
K(Groups) = 2. 2-1 = 1.

Why? Because “The K centroids lie in an at most K-1-dimensional affine subspace”. What is the affine subspace? It’s a geometric concept or *structure* that says, “I am going to generalize the affine properties of Euclidean space”. What are those affine properties of the Euclidean space? Basically, it’s the fact that we can represent a point with 3 coordinates in a 3-dimensional space (with a nod toward the fact that there may be more than 3 dimensions that we are ultimately dealing with).

So, we should be able to represent a point with 2 coordinates in 2-dimensional space and represent a point with 1 coordinate in a 1-dimensional space. LDA reduced the dimensionality of our 2-dimension problem down to one dimension. So now we can get down to the serious business of listening to the data. We now have 2 groups, and 2 points in any dimension can be joined by a line. How many dimensions does a line have? 1! Now we are cooking with Crisco!

So we get a bunch of these data points, represented by their 2d representation (x,y). We are going to use LDA to group these points into either group 1 or group 2.

91. What are the Steps for LDA?

Ans: Steps of LDA:

  1. 1. Compute the d-dimensional mean vector for the different classes from the dataset.
  2. Compute the Scatter matrix (in between class and within the class scatter matrix)
  3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest eigenvalue to from a d x k dimensional matrix w (where every column represents an eigenvector)
  4. Used d * k eigenvector matrix to transform the sample onto the new subspace.

This can be summarized by the matrix multiplication.

Y = X x W (where X is an n * d dimension matrix representing the n samples and you are transformed n * k dimensional samples in the new subspace.

LDA code snippet:

LDA code snippet

92. What is GDA? 

Ans: When we have a classification problem in which the input features are continuous random variable, we can use GDA, it’s a generative learning algorithm in which we assume p(x|y) is distributed according to a multivariate normal distribution and p(y) is distributed according to Bernoulli.

Gaussian discriminant analysis (GDA) is a generative model for classification where the distribution of each class is modeled as a multivariate Gaussian.

93. What are the advantages and disadvantages of Dimensionality Reduction?

Ans: Advantages:

  • Dimensionality Reduction helps in data compression, and hence reduced storage space.
  • It reduces computation time.
  • It also helps remove redundant features, if any.
  • Dimensionality Reduction helps in data compressing and reducing the storage space required
  • It fastens the time required for performing the same computations.
  • If there present fewer dimensions then it leads to less computing. Also, dimensions can allow the usage of algorithms unfit for a large number of dimensions.
  • It takes care of multicollinearity that improves model performance. It removes redundant features. For example, there is no point in storing a value in two different units (meters and inches).
  • Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly.


  • Basically, it may lead to some amount of data loss.
  • Although, PCA tends to find linear correlations between variables, which is sometimes undesirable.
  • Also, PCA fails in cases where mean and covariance are not enough to define datasets.
  • Further, we may not know how many principal components to keep- in practice, some thumb rules are applied.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Top 50 interview question on Statistics

Interview question on Statistics

1. What are the different types of Sampling?
Ans: Some of the Common sampling ways are as follows:

  • Simple random sample: Every member and set of members have an equal chance of being included in the sample. Technology, random number generators, or some other sort of change process is needed to get a simple random sample.

Example—A teacher puts students’ names in a hat and chooses without looking to get a sample of students.

Why it’s good: Random samples are usually fairly representative since they don’t favor certain members.

  • Stratified random sample: The population is first split into groups. The overall sample consists of some members of every group. The members of each group are chosen randomly.

Example—A student council surveys 100100100 students by getting random samples of 252525 freshmen, 252525 sophomores, 252525 juniors, and 252525 seniors.

Why it’s good: A stratified sample guarantees that members from each group will be represented in the sample, so this sampling method is good when we want some members from every group.

  • Cluster random sample: The population is first split into groups. The overall sample consists of every member of the group. The groups are selected at random.

Example—An airline company wants to survey its customers one day, so they randomly select 555 flights that day and survey every passenger on those flights.

Why it’s good: A cluster sample gets every member from some of the groups, so it’s good when each group reflects the population as a whole.

  • Systematic random sample: Members of the population are put in some order. A starting point is selected at random, and every nth member is selected to be in the sample.

Example—A principal takes an alphabetized list of student names and picks a random starting point. Every 20th student is selected to take a survey.

2. What is the confidence interval? What is its significance?

Ans: A confidence interval, in statistics, refers to the probability that a population parameter will fall between two set values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. A confidence interval can take any number of probabilities, with the most common being a 95% or 99% confidence level.

3. What are the effects of the width of the confidence interval?

  • The confidence interval is used for decision making
  •  The confidence level increases the width of
  • The confidence interval also increases 
  • As the width of the confidence interval increases, we tend to get useless information also. 
  • Useless information – wide CI
  • High risk – narrow CI

4.  What is the level of significance (Alpha)?

Ans: The significance level also denoted as alpha or α, is a measure of the strength of the evidence that must be present in your sample before you will reject the null hypothesis and conclude that the effect is statistically significant. The researcher determines the significance level before conducting the experiment.

The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that you require stronger evidence before you will reject the null hypothesis.

Use significance levels during hypothesis testing to help you determine which hypothesis the data support. Compare your p-value to your significance level. If the p-value is less than your significance level, you can reject the null hypothesis and conclude that the effect is statistically significant. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

5. What are Skewness and Kurtosis? What does it signify?

Ans: Skewness: It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in the data distribution. It differentiates extreme values in one versus the other tail. The asymmetrical distribution will have a skewness of 0.

There are two types of Skewness: Positive and Negative

Skewness graphical representation

Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode.

Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode.

So, when is the skewness too much?

The rule of thumb seems to be:

    • If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
    • If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
    • If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.


Let us take a very common example of house prices. Suppose we have house values ranging from $100k to $1,000,000 with the average being $500,000.

If the peak of the distribution was left of the average value, portraying a positive skewness in the distribution. It would mean that many houses were being sold for less than the average value, i.e. $500k. This could be for many reasons, but we are not going to interpret those reasons here.

If the peak of the distributed data was right of the average value, that would mean a negative skew. This would mean that the houses were being sold for more than the average value.

Kurtosis: Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Investigate!

Low kurtosis in a data set is an indicator that data has light tails or a lack of outliers. If we get low kurtosis(too good to be true), then also we need to investigate and trim the dataset of unwanted results.

Mesokurtic: This distribution has kurtosis statistics similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three.

Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. The peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.

Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution.

Platykurtic: (Kurtosis < 3): Distribution is shorter; tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.

6. What are Range and IQR? What does it signify?

Ans: Range: The range of a set of data is the difference between the highest and lowest values in the set.

IQR(Inter Quartile Range): The interquartile range (IQR) is the difference between the first quartile and the third quartile. The formula for this is:

IQR = Q3 – Q1

The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile is, indicates how to spread out the middle 50% of our set of data is.

7.  What is the difference between Variance and Standard Deviation? What is its significance?

Ans: The central tendency mean gives you the idea of an average of the data points( i.e center location of the distribution) And now you want to know how far are your data points from mean So, here comes the concept of variance to calculate how far are your data points from mean (in simple terms, it is to calculate the variation of your data points from mean)

Variance and Standard Deviation maths formula

 Standard deviation is simply the square root of variance. And the standard deviation is also used to calculate the variation of your data points (And you may be asking, why do we use standard deviation when we have variance. Because in order to maintain the calculations in same units i.e suppose mean is in 𝑐𝑚/𝑚, then the variance is in 𝑐𝑚2/𝑚2, whereas standard deviation is in 𝑐𝑚/𝑚, so we use standard deviation most)

Variance and Standard Deviation maths formula

8.  What is selection Bias? Types of Selection Bias?

Ans: Selection bias is the phenomenon of selecting individuals, groups, or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.

Types of selection bias include:

  • Sampling bias: a biased sample caused by non-random sampling
  • Time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
  • Exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
  • Data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
  • Attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
  • Observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it.

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

9.  What are the ways of handling missing Data?

  • Delete rows with missing data
  • Mean/Median/Mode imputation
  • Assigning a unique value
  • Predicting the missing values using Machine Learning Models
  • Using an algorithm that supports missing values, like random forests.

10.  What are the different types of the probability distribution? Explain with example?

Ans: The common Probability Distribution is as follows:

  1. Bernoulli Distribution
  2. Uniform Distribution
  3. Binomial Distribution
  4. Normal Distribution
  5. Poisson Distribution

1. Bernoulli Distribution: A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. So the random variable X which has a Bernoulli distribution can take value 1 with the probability of success, say p, and the value 0 with the probability of failure, say q or 1-p.

Example: whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

2. Uniform Distribution: When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

Example: Rolling a fair dice.

3. Binomial Distribution: A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is the same for all the trials is called a Binomial Distribution.

  • Each trial is independent.
  • There are only two possible outcomes in a trial- either a success or a failure.
  • A total number of n identical trials are conducted.
  • The probability of success and failure is the same for all trials. (Trials are identical.)

Example: Tossing a coin.

4. Normal Distribution: Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:

  • The mean, median, and mode of the distribution coincide.
  • The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
  • The total area under the curve is 1.
  • Exactly half of the values are to the left of the center and the other half to the right.

5. Poisson Distribution: A distribution is called Poisson distribution when the following assumptions are valid:

  • Any successful event should not influence the outcome of another successful event. 
  • The probability of success over a short interval must equal the probability of success over a longer interval. 
  • The probability of success in an interval approaches zero as the interval becomes smaller.

Example: The number of emergency calls recorded at a hospital in a day.


11. What are the statistical Tests? List Them.

Ans: Statistical tests are used in hypothesis testing. They can be used to:

  • determine whether a predictor variable has a statistically significant relationship with an outcome variable.
  • estimate the difference between two or more groups.

Statistical tests assume a null hypothesis of no relationship or no difference between groups. Then they determine whether the observed data fall outside of the range of values predicted by the null hypothesis.

Common Tests in Statistics:

    1. T-Test/Z-Test
    2. ANOVA
    3. Chi-Square Test
    4. MANOVA

statistical Tests flowchart


12. How do you calculate the sample size required?

Ans: You can use the margin of error (ME) formula to determine the desired sample size.

  • t/z = t/z score used to calculate the confidence interval
  • ME = the desired margin of error
  • S = sample standard deviation


13. What are the different Biases associated when we sample?

Ans: Potential biases include the following:

  • Sampling bias: a biased sample caused by non-random sampling
  • Under coverage bias: sampling too few observations
  • Survivorship bias: error of overlooking observations that did not make it past a form of the selection process.


14.  How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the

formula: X (standardized) = (x-µ) / σ



15. How to find the mean length of all fishes in a river?

  • Define the confidence level (most common is 95%)
  • Take a sample of fishes from the river (to get better results the number of fishes > 30)
  • Calculate the mean length and standard deviation of the lengths
  • Calculate t-statistics
  • Get the confidence interval in which the mean length of all the fishes should be.


16.  What do you mean by the degree of freedom?

  • DF is defined as the number of options we have 
  • DF is used with t-distribution and not with Z-distribution
  • For a series, DF = n-1 (where n is the number of observations in the series)


17. What do you think if DF is more than 30?

  • As DF increases the t-distribution reaches closer to the normal distribution
  • At low DF, we have fat tails
  • If DF > 30, then t-distribution is as good as the normal distribution.


18. When to use t distribution and when to use z distribution?

  • The following conditions must be satisfied to use Z-distribution
  • Do we know the population standard deviation?
  • Is the sample size > 30?
  • CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
  • Else we should use t-distribution
  • CI = x (bar) – t*s/√n to x (bar) + t*s/√n


19. What are H0 and H1? What is H0 and H1 for the two-tail test?

  • H0 is known as the null hypothesis. It is the normal case/default case.

                               For one tail test x <= µ

                               For two-tail test x = µ

  • H1 is known as an alternate hypothesis. It is the other case.

                               For one tail test x > µ

                               For two-tail test x <> µ


20. What is the Degree of Freedom? 

DF is defined as the number of options we have: 

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)


21. How to calculate p-Value?

Ans: Calculating p-value:

Using Excel:

  1. Go to the Data tab
  2. Click on Data Analysis
  3. Select Descriptive Statistics
  4. Choose the column
  5. Select summary statistics and confidence level (0.95)

By Manual Method:

  1. Find H0 and H1
  2. Find n, x(bar) and s
  3. Find DF for t-distribution
  4. Find the type of distribution – t or z distribution
  5. Find t or z value (using the look-up table)
  6. Compute the p-value to the critical value


22. What is ANOVA?

Ans: ANOVA expands to the analysis of variance, is described as a statistical technique used to determine the difference in the means of two or more populations, by examining the amount of variation within the samples corresponding to the amount of variation between the samples. It bifurcates the total amount of variation in the dataset into two parts, i.e. the amount ascribed to chance and the amount ascribed to specific causes.

It is a method of analyzing the factors which are hypothesized or affect the dependent variable. It can also be used to study the variations amongst different categories, within the factors, that consist of numerous possible values. It is of two types:

One way ANOVA: When one factor is used to investigate the difference between different categories, having many possible values.

Two way ANOVA: When two factors are investigated simultaneously to measure the interaction of the two factors influencing the values of a variable.


23.  What is ANCOVA?

Ans: ANCOVA stands for Analysis of Covariance, is an extended form of ANOVA, that eliminates the effect of one or more interval-scaled extraneous variable, from the dependent variable before carrying out research. It is the midpoint between ANOVA and regression analysis, wherein one variable in two or more populations can be compared while considering the variability of other variables.

When in a set of independent variables consist of both factor (categorical independent variable) and covariate (metric independent variable), the technique used is known as ANCOVA. The difference independent variables because of the covariate are taken off by an adjustment of the dependent variable’s mean value within each treatment condition.

This technique is appropriate when the metric independent variable is linearly associated with the dependent variable and not to the other factors. It is based on certain assumptions which are:

  • There is some relationship between the dependent and uncontrolled variables.
  • The relationship is linear and is identical from one group to another.
  • Various treatment groups are picked up at random from the population.
  • Groups are homogeneous in variability.


24.  What is the difference between ANOVA and ANCOVA?

Ans: The points given below are substantial so far as the difference between ANOVA and ANCOVA is concerned:

  • The technique of identifying the variance among the means of multiple groups for homogeneity is known as Analysis of Variance or ANOVA. A statistical process which is used to take off the impact of one or more metric-scaled undesirable variable from the dependent variable before undertaking research is known as ANCOVA.
  • While ANOVA uses both linear and non-linear models. On the contrary, ANCOVA uses only a linear model.
  • ANOVA entails only categorical independent variables, i.e. factor. As against this, ANCOVA encompasses a categorical and a metric independent variable.
  • A covariate is not taken into account, in ANOVA, but considered in ANCOVA.
  • ANOVA characterizes between-group variations, exclusively to treatment. In contrast, ANCOVA divides between-group variations to treatment and covariate.
  • ANOVA exhibits within-group variations, particularly individual differences. Unlike ANCOVA, which bifurcates within-group variance in individual differences and covariate.


25.  What are t and z scores? Give Details.

T-Score vs. Z-Score: Overview: A z-score and a t score are both used in hypothesis testing. 

T-score vs. z-score: When to use a t score:

The general rule of thumb for when to use a t score is when your sample:

Has a sample size below 30,

Has an unknown population standard deviation.

You must know the standard deviation of the population and your sample size should be above 30 in order for you to be able to use the z-score. Otherwise, use the t-score.


Technically, z-scores are a conversion of individual scores into a standard form. The conversion allows you to more easily compare different data. A z-score tells you how many standard deviations from the mean your result is. You can use your knowledge of normal distributions (like the 68 95 and 99.7 rule) or the z-table to determine what percentage of the population will fall below or above your result.

The z-score is calculated using the formula:

  • z = (X-μ)/σ


  • σ is the population standard deviation and
  • μ is the population mean.
  • The z-score formula doesn’t say anything about sample size; The rule of thumb applies that your sample size should be above 30 to use it.


Like z-scores, t-scores are also a conversion of individual scores into a standard form. However, t-scores are used when you don’t know the population standard deviation; You make an estimate by using your sample.

  • T = (X – μ) / [ s/√(n) ]


  • s is the standard deviation of the sample.

If you have a larger sample (over 30), the t-distribution and z-distribution look pretty much the same. 

To know more about Data Science, Artificial Intelligence, Machine Learning, and Deep Learning programs visit our website

Follow us on:




Watch our Live Session Recordings to precisely understand statistics, probability, calculus, linear algebra, and other math concepts used in data science.


To get updates on Data Science and AI Seminars/Webinars – Follow our Meetup group.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Model vs Algorithm in ML

Model vs Algorithm in ML: Introduction

Machine Learning works with “models” and “algorithms”, and both play an important role in machine learning where the algorithm tells about the process and model is built by following those rules. So, let’s study further how Model vs Algorithm in ML( Machine Learning).

Algorithms have derived by the statistician or mathematician very long ago and those algorithms are studies and applied by the individuals for their business purposes.

A model in machine learning nothing but a function that is used to take some certain input, perform a certain operation which is told by algorithms to its best on the given input, and gives a suitable output.

Some of the machine learning algorithms are:

  1. Linear regression
  2. Logistic regression
  3. Decision tree
  4. Random forest
  5. K-nearest neighbor
  6. K-means learning

What is an algorithm in Machine learning?

An algorithm is a step by step approach powered by statistics that guides the machine learning in its learning process. An algorithm is nothing but one of the several components that constitute a model.

There are several characteristics of machine learning algorithms:

  1. Machine learning algorithms can be represented by the use of mathematics and pseudo code.
  2. The effectiveness of machine learning algorithms can be measured and represented.
  3. With any of the popular programming languages, machine learning algorithms can be implemented.

What is the Model in Machine learning?

The model is dependent on factors such as features selection, tuning parameters, cost functions along with the algorithm the model just not fully dependent on algorithms.

Model is the result of an algorithm when we implement the algorithm with the code when we train the algorithms with the real data. A model is something that tells what your program learned from the data by following the rules of those algorithms. The model is used to predict the future result that is observed by the algorithm implementation of small data.

Model = Data + Algorithm 

A model contains four major steps that are:

  1. Data preprocessing
  2. Feature engineering
  3. Data management
  4. performance measurement.

How the model and algorithms work together in machine learning?

For example:

y = mx+c is an equation for a line where m is the slope of the line and c is the y-intercept, this is nothing but linear regression with only one variable.
similarly, the decision tree and random forest have something like the Gini index and K-nearest having Euclidean distance formula.

So take the linear regression algorithm:

  1. Start with a training set with x1, x2,…, and y.
  2. Find out the parameters c0, c1, c2 with the random variables.
  3. Find out the learning rate alpha
  4. Then repeat the following updates such as c0 = co-alpha +h(x)-y and for c1, c2 also.
  5. Repeat these processes till converged.

when you employing this algorithm, you are employing these exact 5 steps in your model without changing the steps, your model initiated by the algorithm and also treat all the dataset same.

If you want to apply that algorithm to the model, the model finds out the value of m and c that we don’t know, then how will you find out?
suppose you have 3 variables that are having values of x and y now your model will find the value of m1, m2, m3, and c1, c2, c3 for three variables.
The model will work with three slopes and three intercepts to find out the result of the dataset to predict the future.

The “algorithm” might be treating all the data the same but it is the “model” that actually solves the problems. An algorithm is something that you use to train the model on the data.

After building a model, a data science enthusiasts test it to get the accuracy of that model and fine-tuning to improve the results.


This article may help you yo understand about the algorithm and model (Model Vs Algorithm in ML) in Machine learning and its relationship. In summary, an algorithm is a process or a technique that we follow to get the result or to find the solution to a problem.
And a model is a computation or a formula that formed as an output of an algorithm that takes some input, so you can say that you are building a model using a given algorithm.


Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Text Stemming In NLP

Human language is an unsolved problem that there are more than 6500 languages worldwide. The tons of data are generated every day as we speak, we text, we tweet, from voice to text on every social application and to get the insights of these text data we need technology as Text Stemming In NLP. If you know there are two types of data are there one is structured and unstructured data. Structured data used for Machine learning models and unstructured data is used for Natural language processing. There are only 21% of structured data is available, so now you can estimate how much Text Stemming In NLP is required to handle unstructured data. 

To get the insights of the dataset of unstructured data to take out the important information from it. The important technique to analyze the text data is text mining. Text mining is the technique to extract useful information from the unstructured data by identifying and exploring a large amount of text data. Or we can say that text mining is used to convert the unstructured data to the structured dataset.

Normalization, lemmatization, stemming, tokenization is the technique in NLP to get out the insights from the data.

Now we will see how text it works?

Stemming is the process of reducing inflection in words to their “root” forms such as mapping a group of words to the same stem. Stem words mean the suffix and prefix that have added to the root word. It is the process to produce grammatically variants of root words.  A stemming is provided by the NLP algorithms that are stemming algorithms or stemmers. The stemming algorithm removes the stem from the word. For example, eats, eating, eatery, they are made from the root word “eat“. so here the stemmer removes s, ing, very from the above words to take out meaning that the sentence is about eating something. The words are nothing but different tenses forms of verbs.

Text stemming example

This is the general idea to reduce the different forms of the word to their root word.
Words that are derived from one another can be mapped to a base word or symbol, especially if they have the same meaning.

As we can not sure that it will give us a 100% result so we have two types of error in stemming they are: over stemming and under stemming.

Over stemming occurs when there are too many words have cut out.
This could be known as non-sensical items, where the meaning of the word has lost, or it can not be able to distinguish between two stems or resolve the same stem where they should differ from each other.

For example, take out the four words university, universities, universal, and universe. A stemmer that resolves these four stems to “Univers” that is over stemming. It should be the universe stemmer that stemmed together and university, universities stemmed together they all four are not fit for the single stem.

Under stemming: Under-stemming is the opposite of stemming. It comes from when we have different words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.

This can be seen if we have a stemming algorithm that stems from the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with the date? And is there a good general rule? So there under stemming occurs.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning,Text Stemming In NLP, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Normal and Gaussian Distribution

Gaussian Distribution

Gaussian distribution is a bell-shaped curve, it follows the normal distribution with the equal number of measurements right side and left side of the mean value. Mean is situated in the centre of the curve, the right side values from the mean are greater than the mean value and the left side values from the mean are smaller than the mean. It is used for mean, median, and mode for continuous values. You all know the basic meaning of mean, median, and mod. The mean is an average of the values, the median is the centre value of the distribution and the mode is the value of the distribution which is frequently occurred. In the normal distribution, the values of mean, median, and are all same. If the values generate skewness then it is not normally distributed. The normal distribution is very important in statistics because it fits for many occurrences such as heights, blood pressure, measurement error, and many numerical values.

Histogram for normal distribution

A gaussian and normal distribution is the same in statistics theory. Gaussian distribution is also known as a normal distribution. The curve is made with the help of probability density function with the random values. F(x) is the PDF function and x is the value of gaussian & used to represent the real values of random variables having unknown distribution.

There is a property of Gaussian distribution which is known as Empirical formula which shows that in which confidence interval the value comes under. The normal distribution contains the mean value as 0 and standard deviation 1.

Empirical formula

The empirical rule also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all data falls within three standard deviations (denoted by σ) of the mean (denoted by µ). Broken down, the empirical rule shows that 68% falls within the first standard deviation (µ ± σ), 95% within the first two standard deviations (µ ± 2σ), and 99.7% within the first three standard deviations (µ ± 3σ).

Python code for plotting the gaussian graph:

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math
mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma)) gaussian graph

The above code shows the Gaussian distribution with 99% of the confidence interval with a standard deviation of 3 with mean 0.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.


#iguru_button_62b9def90f737 .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_62b9def90f737 .wgl_button_link:hover { color: rgba(45,151,222,1); }#iguru_button_62b9def90f737 .wgl_button_link { border-color: rgba(45,151,222,1); background-color: rgba(45,151,222,1); }#iguru_button_62b9def90f737 .wgl_button_link:hover { border-color: rgba(45,151,222,1); background-color: rgba(255,255,255,1); }#iguru_button_62b9def913ac7 .wgl_button_link { color: rgba(102,75,196,1); }#iguru_button_62b9def913ac7 .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_62b9def913ac7 .wgl_button_link { border-color: rgba(102,75,196,1); background-color: transparent; }#iguru_button_62b9def913ac7 .wgl_button_link:hover { border-color: rgba(102,75,196,1); background-color: rgba(102,75,196,1); }
Get The Learnbay Advantage For Your Career
Overlay Image