Call WhatsApp Enquiry

Random forest model(RFM)

Random Forest  Model:

The random forest model is also a classification model with the combination of the decision tree. The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction based on mode for classification and mean for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.

Working of Random Forest Algorithm

We can understand the working of the Random Forest algorithm with the help of following steps −

  • Step 1 − First, start with the selection of random samples from a given dataset. Do sampling without replacement.

Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.

  • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
  • Step 3 − In this step, voting will be performed for every predicted result.
    • Based on ‘n’ samples… ‘n’ tree is built
    • Each record is classified based on the n tree
    • The final class for each record is decided based on voting

Step 4 − At last, select the most voted prediction result as the final prediction result.

What is the Out of Bag score in Random Forests?

Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.

For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.

OOB Error Rate Computation Steps

  • Sample left out (out-of-bag) in Kth tree is classified using the Kth tree
  • Assume j cases are misclassified
  • The proportion of time that j is not equal to true class averaged over all cases is the OOB error rate.

Variable importance of RF: 

It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.

  • Random Forest computes two measures of Variable Importance
    • Mean Decrease in Accuracy
    • Mean Decrease in Gini
  • Mean Decrease in Accuracy is based on permutation
    • Randomly permute values of a variable for which importance is to be computed in the OOB sample
    • Compute the Error Rate with permuted values
    • Compute decrease in OOB Error rate (Permuted- Not permuted)
    • Average the decrease overall the trees
  • Mean Decrease in Gini is computed as a “total decrease in node impurities from splitting on the variable averaged over all trees”.

Finding the optimal values using grid-search cv:

It stats the optimal values of the splitting decision tree that how many trees to be split within the model.

Measuring RF model performance by Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.

Random Forest with python: 

Importing the important libraries–

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz

Read the data from csv

dummy_df = pd.read_csv("bank.csv", na_values =['NA'])
temp = dummy_df.columns.values[0] temp
print(dummy_df)

Data Pre-Processing:

columns_name = temp.split(';')
data = dummy_df.values
print(data)
print(data.shape)
contacts = list()
for element in data:
contact = element[0].split(';')
contacts.append(contact)

contact_df = pd.DataFrame(contacts,columns = columns_name)
print(contact_df)
def preprocessor(df):
res_df = df.copy()
le = preprocessing.LabelEncoder()

 encoded_df = preprocessor(contact_df)
#encoded_df = preprocessor(contacts)
x = encoded_df.drop(['"y"'],axis =1).values
y = encoded_df['"y"'].values

Split the data into Train-Test

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)

Build the Decision Tree Model

# Decision tree with depth = 2
model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2)
model_dt_2.fit(x_train, y_train)
model_dt_2_score_train = model_dt_2.score(x_train, y_train)
print("Training score: ",model_dt_2_score_train)
model_dt_2_score_test = model_dt_2.score(x_test, y_test)
print("Testing score: ",model_dt_2_score_test)
#y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1] #Decision tree

model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy")
model_dt.fit(x_train, y_train)
y_pred_dt = model_dt.predict_proba(x_test)[:, 1]

Graphical Representation of Tree

plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Performance Metrics

fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt)
roc_auc_dt = auc(fpr_dt, tpr_dt)
predictions = model_dt.predict(x_test)
# Model Accuracy
print (model_dt.score(x_test, y_test))
y_actual_result = y_test[0] for i in range(len(predictions)):
if(predictions[i] == 1):
y_actual_result = np.vstack((y_actual_result, y_test[i]))

Recall

#Recall
y_actual_result = y_actual_result.flatten()
count = 0
for result in y_actual_result:
if(result == 1):
count=count+1
print ("true yes|predicted yes:")
print (count/float(len(y_actual_result)))

Area Under the Curve

plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

Confusion Matrix

print (confusion_matrix(y_test, predictions))
accuracy_score(y_test, predictions)
import itertools
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix.
cm = confusion_matrix(y_test, model, labels=[0, 1])
classes=["Success", "Default"] cmap = plt.cm.Blues
title = "Confusion Matrix"
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.figure(figsize=(6,6))
plot_confusion_matrix(predictions, normalize=False)
plt.show()

Pruning of the tree

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
if inner_tree.value[index].min() < threshold:
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
# if there are shildren, visit them as well
if inner_tree.children_left[index] != TREE_LEAF:
prune_index(inner_tree, inner_tree.children_left[index], threshold)
prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(model_dt.tree_.children_left < 0))
# start pruning from the root
prune_index(model_dt.tree_, 0, 5)
sum(model_dt.tree_.children_left < 0)

#It means that the code has created 17 new leaf nodes
#(by practically removing links to their ancestors). The tree, which has looked before like

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Decision Tree

Decision tree:

The decision tree is the classification algorithm in ML(machine learning). A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

To understand the algorithm of the decision tree we need to know about the classification.

What is Classification?

Classification is the process of dividing the datasets into different categories or groups by adding a label. It adds the data point to a particular labeled group on the basis of some condition.

As we see in daily life there are three categories in an email(Spam, Promotions, Personal) they are classified to get the proper information. Here decision tree is used to classify the mail type and fix it the proper one.

Types of classification 

  • DECISION TREE
  • RANDOM FOREST
  • NAIVE BAYES
  • KNN

Decision tree:

  1. Graphical representation of all the possible solutions to a decision.
  2. A decision is based on some conditions.
  3. The decision made can be easily explained.

There are following steps to get a decision with the decision tree

1. Entropy:

Entropy is basically used to create a tree. We find our entropy from attribute or class. A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the homogeneity of a sample.

2.Information Gain:

The information gain is based on the decrease in entropy after a data-set is split on an attribute. Constructing a decision tree is all about finding an attribute that returns the highest information gain.

  • The information gain is based on the decrease in entropy after a dataset is split on an attribute.
  • Constructing a decision tree is all about finding an attribute that returns the highest information gain (i.e., the most homogeneous branches).
  • Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]
  • We intend to choose the attribute, splitting by which information gain will be the most
  • Next step is calculating information gain for all attributes
Here the short example of a Decision tree:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
play_data=pd.read_csv('data/tennis.csv.txt')
print(play_data)
play_data=pd.read_csv('data/tennis.csv.txt')
play_data

Output:

outlook temp humidity windy play
0 sunny hot high False no
1 sunny hot high True no
2 overcast hot high False yes
3 rainy mild high False yes
4 rainy cool normal False yes
5 rainy cool normal True no
6 overcast cool normal True yes
7 sunny mild high False no
8 sunny cool normal False yes
9 rainy mild normal False yes
10 sunny mild normal True yes
11 overcast mild high True yes
12 overcast hot normal False yes
13 rainy mild high True no 

Entropy of play:

  • Entropy(play) = – p(Yes) . log2p(Yes) – p(No) . log2p(No)

play_data.play.value_counts()
Entropy_play=-(9/14)*np.log2(9/14)-(5/14)*np.log2(5/14)
print(Entropy_play)

output:
0.94028595867063114

Information Gain on splitting by Outlook

  • Gain(Play, Outlook) = Entropy(Play) – ∑ [ p(Play|Outlook) . Entropy(Play|Outlook) ]
  • Gain(Play, Outlook) = Entropy(Play) – [ p(Play|Outlook=Sunny) . Entropy(Play|Outlook=Sunny) ] – [ p(Play|Outlook=Overcast) . Entropy(Play|Outlook=Overcast) ] – [ p(Play|Outlook=Rain) . Entropy(Play|Outlook=Rain) ]

play_data[play_data.outlook == 'sunny'] 

# Entropy(Play|Outlook=Sunny)
Entropy_Play_Outlook_Sunny =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)
Entropy_Play_Outlook_Sunny
play_data[play_data.outlook == 'overcast'] # Entropy(Play|Outlook=overcast)
# Since, it's a homogenous data entropy will be 0
play_data[play_data.outlook == 'rainy'] # Entropy(Play|Outlook=rainy)
Entropy_Play_Outlook_Rain = -(2/5)*np.log2(2/5) - (3/5)*np.log2(3/5)
print(Entropy_play_Outlook_Rain)
# Entropy(Play_Sunny|)
Entropy_Play_Outlook_Sunny =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)
#Gain(Play, Outlook) = Entropy(Play) – [ p(Play|Outlook=Sunny) . Entropy(Play|Outlook=Sunny) ] –
#[ p(Play|Outlook=Overcast) . Entropy(Play|Outlook=Overcast) ] – [ p(Play|Outlook=Rain) . Entropy(Play|Outlook=Rain) ]

Other gains

  • Gain(Play, Temperature) – 0.029
  • Gain(Play, Humidity) – 0.151
  • Gain(Play, Wind) – 0.048

Conclusion – Outlook is winner & thus becomes root of the tree

Time to find the next splitting criteria

play_data[play_data.outlook == 'overcast'] play_data[play_data.outlook == 'sunny'] # Entropy(Play_Sunny|)
Entropy_Play_Outlook_Sunny =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)
print(Entropy_Play_Outlook_Sunny)
# Entropy(Play_Sunny|)
Entropy_Play_Outlook_Sunny =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)
print(Entropy_Play_Outlook_Sunny)

Information Gain for humidity

#Entropy for attribute high = 0, also entropy for attribute normal = 0
Entropy_Play_Outlook_Sunny - (3/5)*0 - (2/5)*0 

Information Gain for windy

  • False -> 3 -> [1+ 2-]
  • True -> 2 -> [1+ 1-]

Entropy_Wind_False = -(1/3)*np.log2(1/3) - (2/3)*np.log2(2/3)
print(Entropy_Wind_False)
Entropy_Play_Outlook_Sunny - (3/5)* Entropy_Wind_False - (2/5)*1  

Information Gain for temperature

  • hot -> 2 -> [2- 0+]
  • mild -> 2 -> [1+ 1-]
  • cool -> 1 -> [1+ 0-]

Entropy_Play_Outlook_Sunny - (2/5)*0 - (1/5)*0 - (2/5)* 1]

Conclusion : Humidity is the best choice on sunny branch:

play_data[(play_data.outlook == 'sunny') & (play_data.humidity == 'high')] 

Output:

outlook temp humidity windy play
0 sunny hot high False no
1 sunny hot high True no
7 sunny mild high False no 

play_data[(play_data.outlook == 'sunny') & (play_data.humidity == 'normal']

Output:
outlook temp humidity windy play
8 sunny cool normal False yes
10 sunny mild normal True yes

Splitting the rainy branch:

play_data[play_data.outlook == 'rainy'] # Entropy(Play_Rainy|)
Entropy_Play_Outlook_Rainy =-(3/5)*np.log2(3/5) -(2/5)*np.log2(2/5)outlook temp humidity windy play
3 rainy mild high False yes
4 rainy cool normal False yes
5 rainy cool normal True no
9 rainy mild normal False yes
13 rainy mild high True no 

Information Gain for temp

  • mild -> 3 [2+ 1-]
  • cool -> 2 [1+ 1-]

Entropy_Play_Outlook_Rainy - (3/5)*0.918 - (2/5)*1

Output:
0.020150594454668602

Information Gain for Windy:

Entropy_Play_Outlook_Rainy - (2/5)*0 - (3/5)*0

Output:
0.97095059445466858 

Information Gain for Humidity

  • High -> 2 -> [1+ 1-]
  • Normal -> 3 -> [2+ 1-]

Entropy_Play_Outlook_Rainy_Normal = -(1/3)*np.log2(1/3) - (2/3)*np.log2(2/3)
Entropy_Play_Outlook_Rainy_Normal
Entropy_Play_Outlook_Rainy - (2/5)*1 - (3/5)*Entropy_Play_Outlook_Rainy_Normal
Entropy_Play_Outlook_Rainy_Normal
Entropy_Play_Outlook_Rainy_Normal

Output:
0.91829583405448956
0.019973094021974891 

Final tree:

Decision trees are popular among non-statisticians as they produce a model that is very easy to interpret. Each leaf node is presented as an if/then rule. Cases that satisfy the if/then the statement is placed in the node. Are non-parametric and therefore do not require normality assumptions of the data. Parametric models specify the form of the relationship between predictors and response. An example is a linear relationship for regression. In many cases, however, the nature of the relationship is unknown. This is a case in which non-parametric models are useful. Can handle data of different types, including continuous, categorical, ordinal, and binary. Transformations of the data are not required. It can be useful for detecting important variables, interactions, and identifying outliers. It handles missing data by identifying surrogate splits in the modeling process. Surrogate splits are splitting highly associated with the primary split. In other models, records with missing values are omitted by default.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

#iguru_button_6174864612e18 .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_6174864612e18 .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_6174864612e18 .wgl_button_link { border-color: transparent; background-color: rgba(255,149,98,1); }#iguru_button_6174864612e18 .wgl_button_link:hover { border-color: rgba(230,95,42,1); background-color: rgba(253,185,0,1); }#iguru_button_61748646142ba .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_61748646142ba .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_61748646142ba .wgl_button_link { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61748646142ba .wgl_button_link:hover { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61748646183a0 .wgl_button_link { color: rgba(241,241,241,1); }#iguru_button_61748646183a0 .wgl_button_link:hover { color: rgba(250,249,249,1); }#iguru_button_61748646183a0 .wgl_button_link { border-color: rgba(102,75,196,1); background-color: rgba(48,90,169,1); }#iguru_button_61748646183a0 .wgl_button_link:hover { border-color: rgba(102,75,196,1); background-color: rgba(57,83,146,1); }#iguru_soc_icon_wrap_6174864621317 a{ background: transparent; }#iguru_soc_icon_wrap_6174864621317 a:hover{ background: transparent; border-color: #3aa0e8; }#iguru_soc_icon_wrap_6174864621317 a{ color: #acacae; }#iguru_soc_icon_wrap_6174864621317 a:hover{ color: #ffffff; }#iguru_soc_icon_wrap_6174864621317 { display: inline-block; }.iguru_module_social #soc_icon_61748646213451{ color: #ffffff; }.iguru_module_social #soc_icon_61748646213451:hover{ color: #ffffff; }.iguru_module_social #soc_icon_61748646213451{ background: #44b1e4; }.iguru_module_social #soc_icon_61748646213451:hover{ background: #44b1e4; }
Get The Learnbay Advantage For Your Career
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.
Overlay Image
GET THE LEARNBAY ADVANTAGE FOR YOUR CAREER
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.