Call WhatsApp Enquiry

Category: Data Science

Clustering & Types Of Clustering

Clustering is the process of finding similar groups in data, called a cluster. It groups data instances that are similar to each other in one cluster and data instances that are very different(far away) from each other into different clusters. A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science. Entities in each group and is comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used clustering methods.

Steps involved in Clustering analysis:

1. Formulate the problem – select variables to be used for clustering.

2. Decide the clustering procedure whether it will be Hierarchical or Non-Hierarchical.

3. Select the measure of similarity or dissimilarity.

4. Choose clustering algorithms.

5. Decide the number of clusters.

6. Interpret the cluster output(profile the clusters).

7. Validate the clusters.

Types of clustering technique:

Broadly speaking, clustering can be divided into two subgroups :

  • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
  • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering are:

k-means clustering:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-Means minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. Better Euclidean solutions can, for example, be found using k-medians and k-medoids.

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps :

  1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
  2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
  3. Compute cluster centroids: The centroid of data points in the red cluster is shown using a red cross and those in a grey cluster using the grey cross.
  4. Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of the grey cluster. Thus, we assign that data point into a grey cluster
  5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
  6. Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

from pandas import DataFrame
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] }
df = DataFrame(Data,columns=['x','y'])
print (df) 

k-means for cluster=3

from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] }
df = DataFrame(Data,columns=['x','y'])
kmeans = KMeans(n_clusters=3).fit(df)
centroids = kmeans.cluster_centers_
plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) K-Means Clustering in Python
Hierarchical Clustering: 

Hierarchical clustering, as the name suggests is an algorithm that builds the hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using the dendrogram. The dendrogram can be interpreted as:

Two important things that you should know about hierarchical clustering are:

  • This algorithm has been implemented above using a bottom-up approach. It is also possible to follow the top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
  • The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
    • Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
    • Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
    • Manhattan distance: ||a-b||1 = Σ|ai-bi|
    • Maximum distance:||a-b||INFINITY = maxi|ai-bi|
    • Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}

import numpy as np
X = np.array([[5,3],
import matplotlib.pyplot as plt
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.scatter(X[:,0],X[:,1], label='True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
xy=(x, y), xytext=(-3, 3),
textcoords='offset points', ha='right', va='bottom')


Data point plot

from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize=(10, 7))

Dendrogram plot

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.





Know The Best Strategy To Find The Right Data Science Job in Delhi?

Data science careers are buzzing everywhere, and so the data science courses. It’s true that data science salaries are too lucrative and offer sample scopes of career growth. But the majority of candidates struggle a lot to grab the right data science job after competing in their data science courses. After Bengaluru, Mumbai, Hyderabad, and Chennai, Delhi will be the next promising destination for data science aspirants. In this blog, I’ll discuss the best strategy for grabbing the right data science job in Delhi and a brief understanding of the growth orientation of the data science salary in India.

Is data science a good career in India?

We always keep our concerned eyes on the 1st world countries job market and keep regretting the lack of opportunities in our own country. In some cases, this becomes a very hard truth that our country lacks job opportunities and growth, but if it comes to data science, then India is also proudly participating in the data science advancement race.

According to the Analytics Insight survey, by the mid of 2025, India will experience a huge data science job boom. It’s expected that the number of data science and associated job vacancies at that time in India will be around 1,37,630. The Indian job market has already experienced massive demand for a data scientist in the first phase of 2021. Even after the pandemic effect, 50,000 data science, AI, and ML job vacancies have been filled from 2020 January to May 2021. So, there is no confusion that the data science discipline is holding a promising option as a future proof career in India.

What is the data science salary in India?

According to the data available in Glassdoor (as of June 15, 2021), the average data scientist salary in India have already reached the figure of 10,00,000 INR/ year with a lower limit of 4,00,000 INR/ year (freshers) and a higher limit of 20,98,000 INR/ year (for senior-level). In the case of the other subdomains of data science, such as machine learning engineers, AI experts, deep learning experts, India’s companies offer more lucrative packages.

And not only the MNCs but SMEs are also stepping forward to invest in sky-high salary packages for data science professionals.

Is data science in demand in Delhi?

Now let’s enter into our core topic. What is the position of data science skill demand in Delhi?

According to the Linkedin job search, including all sub-domain like ML, AI, data analytics, etc., around 2000, data science jobs are now available in Delhi. At the same time, Naukri has listed an additional 4800 data science job approx.

If you search for the salary insight of data science in Delhi, then you will land on a result that indicates the average yearly salary of 10,10,000 INR. While for senior roles, the figure easily reaches 16,31,000 INR. (Source: Glassdoor Salary insight).

Which companies keep hiring a data scientist around the year in Delhi?

Below are the companies that keep hiring data science professionals of different expertise levels throughout the year in Delhi.

These are the top companies of Delhi location that offer lucrative salaries and career opportunity growth and keep recruiting a data scientist (not in bulk) 365 days a year. Apart from these, there are plenty of other options for data scientists and ML engineers in Delhi.

To find the right data science job in Delhi?

Delhi is indeed growing very rapidly in terms of job opportunities but compared to the three prime locations, Mumbai, Bangalore, and Hyderabad, digging out the opportunities is a bit hard in Delhi. But that does not mean the capital of India lacks data science job opportunities. Rather, if you follow the right strategy of job searching, you can land on the best data science opportunities in this location of India.
Let’s explore the 6-step data science job searching strategy to grab the first data science job in Delhi.

  1. Target the right Job title
  2. Typing ‘data science job’ in the job search bar and hitting ‘enter’ is the biggest and most common mistake related to the data science job search.

    The keyword ‘Data science’ indicates the entire data science domain, but while searching for a job, you need to focus on specific job roles like

    • Data scientist
    • Data analyst
    • Machine learning engineer
    • AI expert
    • Business intelligence analyst
    • Marketing data analyst
    • Database administrator, etc.

    To land on the appropriate list of available job opportunities, you need to target your job title first.
    Apart from this, to make sure your profile gets shortlisted for the interview, check the job description and skills required section before applying. Applying randomly doesn’t increase the chances of getting a job. Rather continuous rejection due to relevant skill lack might discourage you.

  3. Don’t roam across different domains.
  4. The Data science job field is highly domain-specific. Even for freshers candidates, it is always recommended to study data science, keeping a specific domain in mind.

    At present, about 70% of data science candidates remain associated with career switch. Even such candidates are very high on demand. But why so?

    Well, data science is not a completely new domain. Rather, it’s such a discipline that introduced magical, rapid, and sky-kissing advancement across all types of industries like BFSI, Health and Social CareMarketing and sales, FMCG, and so on.

    Hence every data science job roles demand appealing domain expertise in terms of

    • Core working concept
    • Domain-specific business theories and postulates
    • Customised working strategies
    • Dynamic trends
    • Special skills like extremely proficient time management or highly polished communication skills, extraordinary negotiation skills etc.

    In case you switch for the domain, then you will lack in the above-mentioned expertise aspect, which seems too harmful to your data science career initiation. Hence Stick to your domain and target for an associated data science job role.

    For example, you have been working in the FMCG industry as a marketing executive. While switching to a data science career, your target should be securing a marketing data analyst or BI analyst career only in FMCG companies.

  5. Invest sufficient time in making your online portfolio and CV
  6. No matter how credible your skill sets are or how unique your capstone project. The shortlisting for your CV, as well as visibility of your online portfolio to the right recruiters and talent acquisition team itself, undergoes several data analytics.

    Yes. Starting from possibilities of your profile view to resuming selection includes automated keywords matching processes. The associated AI-powered data analytics tools select the profiles based on keyword research. Hence to ensure the higher chances of your profile visibility and resume selection, you need to describe your skill sets and domain experience using the exact keywords that recruiters use. While making the online profile and portfolio, keep the following things in mind.

    • Make your profile to the point.
    • Mention only those skills that are relevant to your targeted job role and you own in reality. (always be loyal in this regard).
    • Keep it more important to list your working experience, hands-on achievements rather than academic achievements.
    • Mention your project in the resume briefly and provide an elaborated (but to the point) description of the same in your project portfolio.
    • Your online resume must contain information about your specific requirements such as location, work-timing, etc.
    • For insane, as you are searching for a data scientist job in Delhi, set the preferred location as Delhi only. This will help you to find a customised job opening based on the Delhi location.

  7. Don’t be conventional regarding job board choosing
  8. What are the first few names that come to your mind while someone discusses a job search? Linkedin, Naukri, Glassdoor, Indeed, etc. Right?

    No doubt these are the most popular and exposed job searching platforms, and securing the right job from such a platform, especially when you are going to grab your first data science job, will be too tough. As mentioned, these platforms are extensively exposed, so the competition per job post remains too high. Such platforms are a better option for the expertise and senior-level candidates. So, are there no chances for data science new bees like you?
    Well. Now I am going to tell you the biggest secret that most data science aspirants don’t know.

    The field of data science has its own dedicated job boards, where you can find the right job as per your domain specifications, locations, and years of working experience. Even the majority of MNCs nowadays have stopped using generic recruiting sites like Linkedin, Naukri for filling up their various data science positions. Rather, they post their vacancies on the job boards dedicated to data science. Below are a few examples of such job boards.

    • Outer Join
    • Analytics Vidhya
    • Kaggle Jobs
    • Github Jobs

    Apart from these sites, parallelly, you need to keep your eyes on the dedicated career portals of your targeted companies. The best options in this regard are to join the Linkedin and other social media groups of those companies. You can even find location-specific groups too.
    Such groups will provide you with the present as well as upcoming data science opportunities of respective companies.

  9. Target the designation as per your experience level
  10. Switching to a data career does not mean initiation of a fresh career restart. Rather, it is a kind of career up-gradation.

    So if you are already at the leadership level, then don’t target for a normal BI analyst, marketing analyst role. Rather target for leadership and managerial level in the data science field too.

    At present, data science is offering equal opportunities to all aspirants from variable working experiences. And especially in the case of leadership positions, the data science domain is suffering from a talent shortage. So to land on the right job that you actually deserve, target the higher or at least the similar level designation.

    But keep in mind to grab the right job, you need to be very cautious from the initial state of your data science career transition trajectory. The data science course you choose must be according to your experience level. This is the key to grab the right data science job at the earliest.

So, what’s next?

If you need personalised career guidance for a data science career switch, you can contact Learbay. We are providing data science IBM certified AI, ML, BI analyst and other data science courses in Delhi.
Each of our course modules is designed according to the work experience and domain experience of the candidates. Instead of providing generalised data science training, we have different courses for candidates with different degrees of working experience. Not only that, all of our courses include a live industrial capstone project that will be done directly from any product based MNCs in Delhi.

To know more, get the latest update about our courses, blogs, and data science tricks and tips, follow us on: LinkedIn, Twitter, Facebook, Youtube, Instagram, Medium.

Investing 3 lakhs on Data science Certification Course? Is it really worth it?

Should a working professional invest 2-3 lakhs on Data science Certification Course?

The world of data science comes with endless possibilities. With the advancement of time the scope of data science career is getting extremely rewarding. Data scientists, artificial intelligence and machine learning engineers are high in demand. Not only the fresher, but also the working professionals are becoming crazy about data science career transition. The craze has reached such a level, where professionals are ready to invest 2-3 lac in pursuing data science courses or its certifications.
Are you also going to do the same? If so, then please hold back your application for a few minutes and read this post, then decide.
Nothing is wrong in investing in data science career transformation. Rather, it’s an intelligent decision but doubt comes with the investment amount. 2 to 3 lakhs. Is this investment really worth it? Certainly, ‘no’.
Certification is the key for a successful career switch to data science career switch: Myths Vs Facts.
Lots of certification, master degree programs on data science advertisements comes throughout the professional network sites, social media sites, and rode-side hoardings. Massiveness of data science course promotions are making everyone believe that certification is must to shift your domain into data science.
But the fact is this is nothing but a myth. Yes, as a working professional, certification can never be the entry gate of your data science career. Instead, at this ‘level ‘hands on experience’ becomes the key to your data science career.
Is a data science course or certification a complete waste?
The answer is ‘yes’ and ‘no’ at the same time.
Getting confused?
Well, let me explain.
Perusing a data science course is too worthy if it makes you competent in the data scientist Job market . But the same becomes a complete waste of money if it makes you only knowledgeable, not job ready.
Remember, you are going to shift your career toward the data since domain, not starting a new career.
Your goal is to get a hike not getting an entry level job in the data science domain. So, to ensure the maximum possible return on investment, choose such a course of certification that makes you a successful competitor of the current data science job market.
How to choose the right data science course for you?
To choose the right course you need look into following aspects:

    • Course Curriculum: There is no defined, universal module for data science certification/ Master degree program. Every institution and universities build up their own course on the basis of contemporary market demands and upcoming scopes. So, you should be very cautious while choosing such a course.
      Check out for the course that offers in-depth learning options for programming languages and analytical tools like python, R, java, SAS, SPSS, mathematical and statistical modules like numpy, pandas, Matplotlib, and algorithms on demands. As you are at the intermediate level of your career, dive deep into the programming and algorithm.
      The basic courses of data science remain limited to the entry level projects and data analysis. So as a professional choose such a course that includes k-means algorithm, word frequency algorithm for NLP sentiment analysis, ARIMA model associated with machine learning, Tensorflow, CNN associated with deep learning.

    • Timing and class type: Being a working professional, it’s obvious that you can’t opt for full time courses. So choose courses that offer flexible timing. Live classes (online/offline) are always best but if it’s impossible to commit for scheduled classes, then choose a flexible one that offers both recorded and live classes options. If you enjoy offline learning choose courses offering weekend classes. But keep in mind, your learning should not hamper your present job.
    • Project experience: If your chosen course is not offering any real-time data science project option, immediately discard it. Companies only search for candidates having hands-on project experience. As a working professional, experience is everything for your next job. Some institutions let you practice your data science skills on a few completed projects. Be cautious in this regard. Before joining any data science course verify the offered projects are real time or not. Choose only that course, where you will get to work on hands-on industry projects. No matters if the projects are from MNC or startups. If you can manage time then choose a course with a part-time internship.
    • Throughout assistance: Being a dynamic field, data science needs more personalized assistance. As there is no domain limitation in data science, your chosen course must fit your targeted domain. Doing an investment on a generalized course is nothing but wasting your hard earned money. A valuable data science course assists you with domain specific interview questions, mock tests, and interview calls from growing companies.
    • Certification/ non certification courses: As mentioned earlier, certificates become only a decorative entity for a working professional’s CV. So don’t run after certification courses, rather you can choose any non certification course that really benefits your next job application in the field of data science. If you are already working in a core technical domain and own an impressive amount of python, R, java, etc, then you can choose a specific course like Tensorflow, a machine learning algorithm that will fill up the gap between your current job and targeted data science jobs.

How much money should you invest in a data science course?
Here comes the final answer. Up to 80k INR investment is fair enough to crack a promising career transformation. Yes, it’s true. Because, the main goal of doing a data science course is to upgrade your current experience to such a stage that will let you enter into the world of data science with a good hike.
You don’t need to master every subdomain of data science, in fact it’s impossible. Rather you need to learn and up-skill yourself in the data science subdomain of your interest or offer huge possibilities with respect to your present experience….and yes, again, the first priority of real-time industry projects.
Fulfilling above criteria doesn’t need investments of 2 to 3 lacs INR. Rather, plenty of promising and reliable online and offline courses are available that can make you highly competent in the data science and AI job market by investing 40k to 90K INR.
You can check Data science and AI courses offered by Learnbay. They offer customized courses for candidates of every working experience level. Their courses cost between 59,000 INR and 75,000 INR (without taxes). The top most benefits for their courses are multiple real-time industry projects with IBM, Amazon, Uber, Rapido, etc. You will get a change to work on your domain specific projects. They offer both in class (online/offline), and recorded session video classes.
Best of Luck ☺.

Win the COVID-19

If you slightly change your perspective towards the lock-down situation you can find hope of this pandemic to end and can hope of a brighter than ever future. Go for Data Science, it will be worth it.

Exploratory Data Analysis on Iris dataset

What is EDA?

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

It is always good to explore and compare a data set with multiple exploratory techniques. After the exploratory data analysis, you will get confidence in your data to point where you’re ready to engage a machine learning algorithm and another benefit of EDA is to the selection of feature variables that will be used later for Machine Learning.
In this post, we take Iris Dataset to get the process of EDA.

Importing libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt Loading the Iris data iris_data= pd.read_csv("Iris.csv")  Understand the data: iris_data.shape
setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64 iris_data.columns() Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'],dtype='object') 1D scatter plot of the iris data: iris_setso = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setso["petal_length"],np.zeros_like(iris_setso["petal_length"]), 'o')
plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor["petal_length"]), 'o')
plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica["petal_length"]), 'o')
plt.grid()   2D scatter plot: iris.plot(kind="scatter",x="sepal_length",y="sepal_width")  2D scatter plot with the seaborn library : import seaborn as sns
sns.FacetGrid(iris,hue="species",size=4) \
.map(plt.scatter,"sepal_length","sepal_width") \


  • Blue points can be easily separated from red and green by drawing a line.
  • But red and green data points cannot be easily separated.
  • Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
  • Separating Versicolor from Viginica is much harder as they have considerable overlap.

Pair Plot:

A pairs plot allows us to see both the distribution of single variables and relationships between two variables. For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs, in this case, will be :

  •  Sepal length, sepal width
  • sepal length, petal length
  • sepal length, petal width
  • sepal width, petal length
  • sepal width, petal width
  • petal length, petal width

So, here instead of trying to visualize four dimensions which are not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.



  1. petal length and petal width are the most useful features to identify various flower types.
  2. While Setosa can be easily identified (linearly separable), virginica and Versicolor have some overlap (almost linearly separable).
  3. We can find “lines” and “if-else” conditions to build a simple model to classify the flower types.

Cumulative distribution function:

iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
>>>[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0.   0.04]
>>>[1.   1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf) 

Mean, Median, and Std-Dev:

print(np.std(iris_versicolor["petal_length"])) OutPut: - Means: 1.464 2.4156862745098038 5.5520000000000005 4.26


Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Random forest model(RFM)

Random Forest Model:

The random forest model is also a classification model with the combination of the decision tree. The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction based on mode for classification and mean for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.

Working of Random Forest Algorithm

We can understand the working of the Random Forest algorithm with the help of following steps −

  • Step 1 − First, start with the selection of random samples from a given dataset. Do sampling without replacement.

Flowchart of Working of Random Forest Algorithm

Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.

  • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
  • Step 3 − In this step, voting will be performed for every predicted result.
    • Based on ‘n’ samples… ‘n’ tree is built
    • Each record is classified based on the n tree
    • The final class for each record is decided based on voting

Step 4 − At last, select the most voted prediction result as the final prediction result.

What is the Out of Bag score in Random Forests?

Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.

For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.

OOB Error Rate Computation Steps

  • Sample left out (out-of-bag) in Kth tree is classified using the Kth tree
  • Assume j cases are misclassified
  • The proportion of time that j is not equal to true class averaged over all cases is the OOB error rate.

Variable importance of RF: 

It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.

  • Random Forest computes two measures of Variable Importance
    • Mean Decrease in Accuracy
    • Mean Decrease in Gini
  • Mean Decrease in Accuracy is based on permutation
    • Randomly permute values of a variable for which importance is to be computed in the OOB sample
    • Compute the Error Rate with permuted values
    • Compute decrease in OOB Error rate (Permuted- Not permuted)
    • Average the decrease overall the trees
  • Mean Decrease in Gini is computed as a “total decrease in node impurities from splitting on the variable averaged over all trees”.

Finding the optimal values using grid-search cv:

It stats the optimal values of the splitting decision tree that how many trees to be split within the model.

Measuring RF model performance by Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.

Random Forest with python: 

Importing the important libraries–

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz

Read the data from csv

dummy_df = pd.read_csv("bank.csv", na_values =['NA'])
temp = dummy_df.columns.values[0] temp

Data Pre-Processing:

columns_name = temp.split(';')
data = dummy_df.values
contacts = list()
for element in data:
contact = element[0].split(';')

contact_df = pd.DataFrame(contacts,columns = columns_name)
def preprocessor(df):
res_df = df.copy()
le = preprocessing.LabelEncoder()

 encoded_df = preprocessor(contact_df)
#encoded_df = preprocessor(contacts)
x = encoded_df.drop(['"y"'],axis =1).values
y = encoded_df['"y"'].values

Split the data into Train-Test

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)

Build the Decision Tree Model

# Decision tree with depth = 2
model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2), y_train)
model_dt_2_score_train = model_dt_2.score(x_train, y_train)
print("Training score: ",model_dt_2_score_train)
model_dt_2_score_test = model_dt_2.score(x_test, y_test)
print("Testing score: ",model_dt_2_score_test)
#y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1] #Decision tree

model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy"), y_train)
y_pred_dt = model_dt.predict_proba(x_test)[:, 1]

Graphical Representation of Tree

dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Performance Metrics

fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt)
roc_auc_dt = auc(fpr_dt, tpr_dt)
predictions = model_dt.predict(x_test)
# Model Accuracy
print (model_dt.score(x_test, y_test))
y_actual_result = y_test[0] for i in range(len(predictions)):
if(predictions[i] == 1):
y_actual_result = np.vstack((y_actual_result, y_test[i]))


y_actual_result = y_actual_result.flatten()
count = 0
for result in y_actual_result:
if(result == 1):
print ("true yes|predicted yes:")
print (count/float(len(y_actual_result)))

Area Under the Curve

lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")

graph of Area Under the Curve

Confusion Matrix

print (confusion_matrix(y_test, predictions))
accuracy_score(y_test, predictions)
import itertools
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix.
cm = confusion_matrix(y_test, model, labels=[0, 1])
classes=["Success", "Default"] cmap =
title = "Confusion Matrix"
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')

plot_confusion_matrix(predictions, normalize=False)

Confusion Matrix

Pruning of the tree

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
if inner_tree.value[index].min() < threshold:
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
# if there are shildren, visit them as well
if inner_tree.children_left[index] != TREE_LEAF:
prune_index(inner_tree, inner_tree.children_left[index], threshold)
prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(model_dt.tree_.children_left < 0))
# start pruning from the root
prune_index(model_dt.tree_, 0, 5)
sum(model_dt.tree_.children_left < 0)

#It means that the code has created 17 new leaf nodes
#(by practically removing links to their ancestors). The tree, which has looked before like

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Everything About Data Preprocessing

Data Preprocessing:

Introduction to Data Preprocessing:- Before modeling the data we need to clean the data to get a training sample for the modeling. Data preprocessing is a data mining technique that involves transforming the raw data into an understandable format. It provides the technique for cleaning the data from the real world which is often incomplete, inconsistent, lacking accuracy and more likely to contain many errors. Preprocessing provides a clean the data before it gets to the modeling phase.

Preprocessing of data in a stepwise fashion in scikit learn.

1.Introduction to Preprocessing:

  • Learning algorithms have an affinity towards a certain pattern of data.
  • Unscaled or unstandardized data have might have an unacceptable prediction.
  • Learning algorithms understand the only number, converting text image to number is required.
  • Preprocessing refers to transformation before feeding to machine learning.

2. StandardScaler

  • The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.
  • Calculate – Subtract mean of column & div by the standard deviation
  • If data is not normally distributed, this is not the best scaler to use.

StandardScaler Formula

3. MinMaxScaler

  • Calculate – Subtract min of column & div by the difference between max & min
  • Data shifts between 0 & 1
  • If distribution not suitable for StandardScaler, this scaler works out.
  • Sensitive to outliers.

MinMaxScaler Formula

4. Robust Scaler

  • Suited for data with outliers
  • Calculate by subtracting 1st-quartile & div by difference between 3rd-quartile & 1st-quartile.

Robust Scaler Formula

5. Normalizer

  • Each parameter value is obtained by dividing by magnitude.
  • Enabling you to more easily compare data from different places.

Normalizer Formula

6. Binarization

  • Thresholding numerical values to binary values ( 0 or 1 )
  • A few learning algorithms assume data to be in Bernoulli distribution – Bernoulli’s Naive Bayes

7. Encoding Categorical Value

  • Ordinal Values – Low, Medium & High. Relationship between values
  • LabelEncoding with the right mapping

8. Imputation

  • Missing values cannot be processed by learning algorithms
  • Imputers can be used to infer the value of missing data from existing data

9. Polynomial Features

  • Deriving non-linear feature by converting data into a higher degree
  • Used with linear regression to learn a model of higher degree

10. Custom Transformer

  • Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing.
  • FunctionTransformer is used to create one Transformer
  • validate = False, is required for the string column.

11. Text Processing

  • Perhaps one of the most common information
  • Learning algorithms don’t understand the text but only numbers
  • Below methods convert text to numbers

12. CountVectorizer

  • Each column represents one word, count refers to the frequency of the word
  • A sequence of words is not maintained


  • n_grams – Number of words considered for each column
  • stop_words – words not considered
  • vocabulary – only words considered

13. TfIdfVectorizer

  • Words occurring more frequently in a doc versus entire corpus is considered more important
  • The importance is on the scale of 0 & 1

14. HashingVectorizer

  • All the above techniques convert data into a table where each word is converted to column
  • Learning on data with lakhs of columns is difficult to process
  • HashingVectorizer is a useful technique for out-of-core learning
  • Multiple words are hashed to limited column
  • Limitation – Hashed value to word mapping is not possible

15. Image Processing using skimage

  • skimage doesn’t come with anaconda. install with ‘pip install skimage’
  • Images should be converted from 0-255 scale to 0-1 scale.
  • skimage takes image path & returns numpy array
  • images consist of 3 dimensions.

You could be a pro in Data Science by Self Assisting

Learning Data Science is little tricky but here you may find something important!

Differentiating Data Scientist and Data Analyst

There is a pensive difference between Data Scientist and Data Analyst. It is so much interesting to know about them all.

Customer Experience Enhancement In Banks

Customers are the main asset for banks, their comfort and trust is the foremost essential thing. Know how banks maintain their customers convenience with the help of technology.

#iguru_button_61a10e24b22dc .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_61a10e24b22dc .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_61a10e24b22dc .wgl_button_link { border-color: transparent; background-color: rgba(255,149,98,1); }#iguru_button_61a10e24b22dc .wgl_button_link:hover { border-color: rgba(230,95,42,1); background-color: rgba(253,185,0,1); }#iguru_button_61a10e24b378f .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_61a10e24b378f .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_61a10e24b378f .wgl_button_link { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61a10e24b378f .wgl_button_link:hover { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61a10e24b7800 .wgl_button_link { color: rgba(241,241,241,1); }#iguru_button_61a10e24b7800 .wgl_button_link:hover { color: rgba(250,249,249,1); }#iguru_button_61a10e24b7800 .wgl_button_link { border-color: rgba(102,75,196,1); background-color: rgba(48,90,169,1); }#iguru_button_61a10e24b7800 .wgl_button_link:hover { border-color: rgba(102,75,196,1); background-color: rgba(57,83,146,1); }#iguru_soc_icon_wrap_61a10e24c9f67 a{ background: transparent; }#iguru_soc_icon_wrap_61a10e24c9f67 a:hover{ background: transparent; border-color: #3aa0e8; }#iguru_soc_icon_wrap_61a10e24c9f67 a{ color: #acacae; }#iguru_soc_icon_wrap_61a10e24c9f67 a:hover{ color: #ffffff; }#iguru_soc_icon_wrap_61a10e24c9f67 { display: inline-block; }.iguru_module_social #soc_icon_61a10e24c9f971{ color: #ffffff; }.iguru_module_social #soc_icon_61a10e24c9f971:hover{ color: #ffffff; }.iguru_module_social #soc_icon_61a10e24c9f971{ background: #44b1e4; }.iguru_module_social #soc_icon_61a10e24c9f971:hover{ background: #44b1e4; }
Get The Learnbay Advantage For Your Career
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.
Overlay Image
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.