Call WhatsApp Enquiry

Data Science for working professionals

To secure a job in any domain one has to give it a lot of preparation, should be trained for the role and should have absolute knowledge about the field, usually people will dedicate years in preparing for their desired roles. Shifting from a prepared role of domain to a different domain will not usually be easy, strong gust of skepticism would surely haunt. The process of shifting from one domain to another is hard, it gets harder to learn data science for working professionals because they will have to prepare for the new job role while maintaining their current one.

If and only if you plan the whole process of domain shifting in an organised and rational way, you can have a win-win situation.

Have a vision and plan your strategy

You must win in both the games of learning and working, for that you will have to strategize in such a way that your time in learning data science should not in any way collide with your work life and vice-versa. Because both of the activities are equally important as they require immense attention and individual preference.

let us start from the scratch, here are some possible concerns of a working professional:

  1. Time management
  2. Balancing the energy between two activities
  3. Scheduling
  4. Risk of affording a wrong move
  5. Risk of inefficient or improper execution

As a working professional you will have to manage your responsibilities in a way that you will have control over every single thing that happens to exist. With proper planning and the right way of approach, the above mentioned concerns could be easily tamed.

Firmly state your purpose of learning data science
Why do you want to change your domain into Data Science while you already have a job? firmly define the purpose. You should know that by shifting to data science everything will change, you will have to develop new skill sets for the role that you are targeting, processing of workflow will be different, your future job role will have different goals, purpose and aim. Act consciously when you are risking to give up on the comfort and expertise you have in your current job, be very sure about the purpose of doing so. Doing this will eliminate the skepticism about the risk of getting out of your comfort zone. The efforts that you put over learning Data Science will never go in vain because you will learn about the currently trending technologies and tools, that will help you survive not only in data science but anywhere in the IT firm.

Have a soft target
People think only the role of ‘data scientist’ matters the most but the fact is that there are several other roles in data science which significantly matter in the field, choose one role that which you want to become and start preparing for it. Doing this should be good for the starters, because you do not have to be a scholar in every tool that has ever been used in the field, smartly target those topics that are the essentials in Data Science. When you specifically work on a targeted role you will have the chance to completely know about it and its importance in the field. This way of approach will be a very smart move because you will not be confused regarding what exactly to study in the vast field of data science and the field generally prioritizes those who holds master expertise in specified field. So be very sure about the role you want to serve in, in data science.

Plan the execution
To perfectly plan the execution part you will first have to design the implementation part, do it wise and rationally. Revise your daily-life activities, reschedule it for the sake of balancing between learning and working.

Exercise on the way you spend time on everyday things, revise it according to your daily schedules. Practice to make a note of your tasks everyday, according to that plan on how much time you would invest on the things and try your best to act as decided. In other words, this way of dealing with the things is called as discipline, to have a structured day you will have to practice discipline in all possible ways. Revise your activities from sleeping habits to break sessions, reschedule them in such a way that the things will itself fall in the right place. Set targets, set your own deadlines and design the way that you want things to work in.

Networking and understanding the field
Involve with the people that come from the field of Data Science, know about the insider story of the field and about how it works. Having field knowledge is very much necessary, remember that when you get into data science you will have to work in teams, so practice skills in communication and confidence. Get interactive with the people by asking them about the ways to reach to the field, this way you will build good connections and will get great suggestions as well. Start associating yourself with the people who belongs to Data science, you will need to get used to that.

A good course
Everything that you do and every effort that you put is only to learn Data Science, but if you make the mistake of choosing a wrong course every effort of yours will go in vain. Your purpose of learning Data Science is to shift your domain into that of Data science, you cannot do this without the help of a good course. The course that you choose should not only help you to have fine knowledge in data science but also should help you to manage your planned schedules. There are many data science courses that are specially built for working professionals, it will greatly help if you choose the right one among them.

Conclusion
With the right approach and proper planning you can triumph in learning Data Science while maintaining a full time job. Stick to your plans and preparations, seek help from a good course, practice as much as you could and start involving yourself with the field. If you manage to everyday execute the plans you will surely reach your destination in ease.

Learnbay could help you
The data science course of Learnbay is specially designed for working professionals, the benefits provided in the course will help you balance your scheduling. Learnbay powered by IBM will help you throughout the journey of learning and experiencing data science.

Regression techniques in Machine Learning

Machine learning has become the sexiest and very trendy technology in this world of technologies, Machine learning is used every day in our life such as Virtual assistance, for making future predictions, Videos surveillance, Social media services, spam mail detection, online customer support, search engine resulting prediction, fraud detection, recommendation systems, etc. In machine learning, Regression is the most important topic that needed to be learned. There are different types of Regression techniques which we will know in this article.

Introduction:

Regression algorithms such as Linear regression and Logistic regression are the most important algorithms that people learn while they study about Machine learning algorithms. There are numerous forms of regression that are used to perform regression and each has its own specific features, that are applied accordingly. The regression techniques are used to find out the relationship between the dependent and independent variables or features. It is a part of data analysis that is used to analyze the infinite variables and the main aim of this is forecasting, time series analysis, modeling.

What is Regression?

Regression is a statistical method that mainly used for finance, investing and sales forecasting, and other business disciplines that make attempts to find out the strength and relationship among the variables.

There are two types of the variable into the dataset for apply regression techniques:

  1. Dependent Variable that is mainly denoted as Y
  2. Independent variable that is denoted as x.

And, There are two types of regression

  1. Simple Regression: Only with a single independent feature /variable
  2. Multiple Regression: With two or more than two independent features/variables.

Indeed, in all regression studies, mainly seven types of regression techniques are used firmly for complex problems.

  • Linear regression
  • Logistics regression
  • Polynomial regression
  • Stepwise Regression
  • Ridge Regression
  • Lasso Regression

Linear regression:

It is basically used for predictive analysis, and this is a supervised machine learning algorithm. Linear regression is linear approach to modeling the relationship between scalar response and the parameters or multiple predictor variables. It focuses on the conditional probability distribution. The formula for linear regression is Y = mX+c.

Where Y is the target variable, m is the slope of the line, X is the independent feature, and c is the intercept.

Simple Linear Regression in Machine learning - Javatpoint

Additional points on Linear regression:

  1. There should be a linear relationship between the variables.
  2. It is very sensitive to Outliers and can give a high variance and bias model.
  3. The problem of occurring multi colinearity with multiple independent features

Logistic regression:

It is used for classification problems with a linear dataset. In layman’s term, if the depending or target variable is in the binary form (1 0r 0), true or false, yes or no. It is better to decide whether an occurrence is possibly either success or failure.

 

Logistic Regression

Additional point:

  1. It is used for classification problems.
  2. It does not require any relation between the dependent and independent features.
  3. It can after by the outliers and can occur underfitting and overfishing.
  4. It needs a large sample size to make the estimation more accurate.
  5. It needs to avoid collinearity and multicollinearity.

Polynomial regression:

The polynomial regression technique is used to execute a model that is suitable for handling non-linear separated data. It gives a curve that is best suited to data points, rather than a straight line.
The polynomial regression suits the least-squares form. The purpose of an analysis of regression to model the expected y value for the independent x of the dependent variable. 
The formula for this Y=  β0+ β0x1+e
Polynomial Regression - Towards Data Science
Additional  features: 
Look particularly for curve towards the ends to see if those shapes to patterns make logical sense. More polynomials can lead to weird extrapolation results. 

Step-wise Regression:

It is used for statistical model fitting regression with predictive models. It is done automatically. 
The variable is supplemented or removed from the explanatory variable set at every step. The main approaches for the regression are reverse elimination and bidirectional elimination and step by step approaches. 
The formula for this: b = b(sxi/sy)
Additional points: 
  1. This regression provides two things, the very first one is to add prediction for each steep and remove predictors fro each step.
  2. It starts with the most significant predictor into the ML model and then adds features for each step.
  3. The backward elimination starts with all the predictors into the model and then removes the least significant variable.

Ridge Regression: 

It is a method that used when the dataset having multicollinearity which means, the independent variables are strongly related to each other. Although the least-squares estimates are unbiased in multicollinearity, So after adding the degree of bias to the regression, ridge regression can reduce the standard errors.
Ridge Regression for Better Usage - Towards Data Science

Additional points:

  1. In this regression, normality is not to be estimated the same as Least squares regression.
  2. In this regression, the value could be varied but doesn’t come to zero.
  3. This uses the l2 regularization method as it is also a regularization method.

Lasso Regression:

Lasso is an abbreviation of the Least Absolute shrinkage and selection operator. This is similar to the ridge regression as it also analyzes the absolute size of the regression coefficients. And the additional features of that are it is capable of reducing the accuracy and variability of the coefficients of the Linear regression models.

Lasso regression in matlab - Stack Overflow

 

Additional points: 
  1. Lasso regression shrinks the coefficients aero, which will help in feature selection for building a proper ML model.
  2. It is also a regularization method that uses l1 regularization.
  3. If there are many correlated features, it picks only one of them and shrinks it to the zero.

 

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Model vs Algorithm in ML

Machine Learning works with “models” and “algorithms”, and both play an important role in machine learning where the algorithm tells about the process and model is built by following those rules.

Algorithms have derived by the statistician or mathematician very long ago and those algorithms are studies and applied by the individuals for their business purposes.

A model in machine learning nothing but a function that is used to take some certain input, perform a certain operation which is told by algorithms to its best on the given input, and gives a suitable output.

Some of the machine learning algorithms are:

  1. Linear regression
  2. Logistic regression
  3. Decision tree
  4. Random forest
  5. K-nearest neighbor
  6. K-means learning

What is an algorithm in Machine learning?

An algorithm is a step by step approach powered by statistics that guides the machine learning in its learning process. An algorithm is nothing but one of the several components that constitute a model.

There are several characteristics of machine learning algorithms:

  1. Machine learning algorithms can be represented by the use of mathematics and pseudo code.
  2. The effectiveness of machine learning algorithms can be measured and represented.
  3. With any of the popular programming languages, machine learning algorithms can be implemented.

What is the Model in Machine learning?

The model is dependent on factors such as features selection, tuning parameters, cost functions along with the algorithm the model just not fully dependent on algorithms.

Model is the result of an algorithm when we implement the algorithm with the code when we train the algorithms with the real data. A model is something that tells what your program learned from the data by following the rules of those algorithms. The model is used to predict the future result that is observed by the algorithm implementation of small data.

                Model = Data + Algorithm 

A model contains four major steps that are:

  1. Data preprocessing
  2. Feature engineering
  3. Data management
  4. performance measurement.

How the model and algorithms work together in machine learning?

For example:

y = mx+c is an equation for a line where m is the slope of the line and c is the y-intercept, this is nothing but linear regression with only one variable.
similarly, the decision tree and random forest have something like the Gini index and K-nearest having Euclidean distance formula.

So take the linear regression algorithm:

  1. Start with a training set with x1, x2,…, and y.
  2. Find out the parameters c0, c1, c2 with the random variables.
  3. Find out the learning rate alpha
  4. Then repeat the following updates such as c0 = co-alpha +h(x)-y and for c1, c2 also.
  5. Repeat these processes till converged.

when you employing this algorithm, you are employing these exact 5 steps in your model without changing the steps, your model initiated by the algorithm and also treat all the dataset same.

If you want to apply that algorithm to the model, the model finds out the value of m and c that we don’t know, then how will you find out?
suppose you have 3 variables that are having values of x and y now your model will find the value of m1, m2, m3, and c1, c2, c3 for three variables.
The model will work with three slopes and three intercepts to find out the result of the dataset to predict the future.

The “algorithm” might be treating all the data the same but it is the “model” that actually solves the problems. An algorithm is something that you use to train the model on the data.

After building a model, a data science enthusiasts test it to get the accuracy of that model and fine-tuning to improve the results.

This article may help you yo understand about the algorithm and model in Machine learning, In summary, an algorithm is a process or a technique that we follow to get the result or to find the solution of a problem.
And a model is a computation or a formula that formed as an output of an algorithm that takes some input, so you can say that you are building a model using a given algorithm.

 

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R, and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Win the COVID-19

If you slightly change your perspective towards the lock-down situation you can find hope of this pandemic to end and can hope of a brighter than ever future. Go for Data Science, it will be worth it.

text stemming in NLP

Human language is an unsolved problem that there are more than 6500 languages worldwide. The tons of data are generating every day as we speak, we text, we tweet, from voice to text on every social application and o get the insights of these text data we need technology as NLP. If you know there are two types of data are there one is structured and unstructured data. Structured data used for Machine learning models and unstructured data is used for Natural language processing. There are only 21% of structured data is available, so now you can estimate how much NLP is required to handle unstructured data. 

To get the insights of the dataset of unstructured data to take out the important information from it. The important technique to analyze the text data is text mining. Text mining is the technique to extract useful information from the unstructured data by identifying and exploring a large amount of text data. Or we can say that text mining is used to convert the unstructured data to the structured dataset.

Normalization, lemmatization, stemming, tokenization is the technique in NLP to get out the insights from the data.

Now we will see how text stemming works?

Stemming is the process of reducing inflection in words to their “root” forms such as mapping a group of words to the same stem. Stem words mean the suffix and prefix that have added to the root word. It is the process to produce grammatically variants of root words.  A stemming is provided by the NLP algorithms that are stemming algorithms or stemmers. The stemming algorithm removes the stem from the word. For example, eats, eating, eatery, they are made from the root word “eat“. so here the stemmer removes s, ing, very from the above words to take out meaning that the sentence is about eating something. The words are nothing but different tenses forms of verbs.

This is the general idea to reduce the different forms of the word to their root word.
Words that are derived from one another can be mapped to a base word or symbol, especially if they have the same meaning.

As we can not sure that it will give us a 100% result so we have two types of error in stemming they are: over stemming and under stemming.

Over stemming occurs when there are too many words have cut out.
This could be known as non-sensical items, where the meaning of the word has lost, or it can not be able to distinguish between two stems or resolve the same stem where they should differ from each other.

For example, take out the four words university, universities, universal, and universe. A stemmer that resolves these four stems to “Univers” that is over stemming. It should be the universe stemmer that stemmed together and university, universities stemmed together they all four are not fit for the single stem.

Under stemming: Under-stemming is the opposite of stemming. It comes from when we have different words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.

This can be seen if we have a stemming algorithm that stems from the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with the date? And is there a good general rule? So there under stemming occurs.

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Data Science at Intern level

As an intern your focus must be on following the patterns of how the activity works, analyse which language will be appropriate to learn because ones journey in Data Science will not end until they get a job, but it will start from there.

Linear Regression in Machine Learning

What is Regression?

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). Regression is a predictive modeling analysis technique. It estimates a relationship between the dependent and an independent variable.

Use of Regression:

  • Determine the strength of predictors.
  • Forecasting an effect.
  • Trend forecasting.

Linear Regression:

Linear regression is a basic and commonly used type of predictive analysis.  The overall idea of regression is to examine two things, it does a set of predictor variables do a good job in predicting an outcome (dependent) variable?  in Which variables, in particular, are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?  These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.  The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

Linear Regression Selection Criteria:

  1. Classifiaction & Regression capabalities.
  2. Data quality.
  3. Computational complexity.
  4. Comprehensive & transport.

When will we use Linear Regression?

  • Evaluating trends & sales estimates.
  • Analyzing the impact of price changes.
  • Assessment of risk in financial services and insurance domain.

for example, a group of creative Tech enthusiasts started a company in Silicon Valley. This start-up — called Banana — is so innovative that it has been growing constantly since 2016. You, the wealthy investor, would like to know whether to put your money on Banana’s success in the next year or not. Let’s assume that you don’t want to risk a lot of money, especially since the stakes are high in Silicon Valley. So you decide to buy a few shares, instead of investing in a big portion of the company.

Well, you can definitely see the trend. Banana is growing like crazy, kicking up their stock price from 100 dollars to 500 in just three years. You only care about how the price is going to be like in the year 2021 because you want to give your investment some time to blossom along with the company. Optimistically speaking, it looks like you will be growing your money in the upcoming years. The trend is likely not to go through a sudden, drastic change. This leads to you hypothesizing that the stock price will fall somewhere above the $500 indicator.

Here’s an interesting thought. Based on the stock price records of the last couple of years you were able to predict what the stock price is going to be like. You were able to infer the range of the new stock price (that doesn’t exist on the plot) for a year that we don’t have data for (the year 2021). Well — kinda.

What you just did is infer your model (that head of yours) to generalize — predict the y-value for an x-value that is not even in your knowledge. However, this is not accurate in any way. You couldn’t specify what exactly is the stock price most likely going to be. For all you know, it is probably going to be above 500 dollars.

Here is where Linear Regression (LR) comes into play. The essence of LR is to find the line that best fits the data points on the plot so that we can, more or less, know exactly where the stock price is likely to fall in the year 2021.

Let’s examine the LR-generated line (in red) above, by looking at the importance of it. It looks like, with just a little modification, we were able to realize that Banana’s stock price is likely to be worth a little bit higher than $600 by the year 2021. Obviously, this is an oversimplified example. However, the process stays the same. Linear Regression as an algorithm relies on the concept of lowering the cost to maximize the performance. We will examine this concept, and how we got the red line on the plot next.

Finding the best fit line:

To check the goodness of fit we use the R-squared method.

What is the R-squared method?

R-squared value is a statistical measure of how close the data to the fitted linear regression line. It is also known as COD(coefficient of determination), or the coefficient of multiple determination.

What are overfitting and underfitting?

Overfitting: Good performance on the training data, poor generalization to other data.

Underfitting: Poor performance on the training data & poor generalization to other data.

Linear Regression with python:

1.Importing required libraries:

import numpy as np
from sklearn.linear_model import LinearRegression

2. Provide data:

x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])

print(x)
print(y) 

Output:
>>> print(x)
[[ 5]
 [15]
 [25]
 [35]
 [45]
 [55]]
>>> print(y)
[ 5 20 14 32 22 38]

3. Create a model and fit it:

model = LinearRegression().fit(x, y) 

4. Get Result:

>> r_sq = model.score(x, y)
>>> print('coefficient of determination:', r_sq)
coefficient of determination: 0.715875613747954 

5. Predict response:

>>> y_pred = model.predict(x)
>>> print('predicted response:', y_pred, sep='\n')
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.


Data Science is important!

Data Science is not only required to only become Data scientists but also to become eligible for the other technical jobs outside the field of DS. Know how!

Random forest model(RFM)

Random Forest  Model:

The random forest model is also a classification model with the combination of the decision tree. The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction based on mode for classification and mean for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.

Working of Random Forest Algorithm

We can understand the working of the Random Forest algorithm with the help of following steps −

  • Step 1 − First, start with the selection of random samples from a given dataset. Do sampling without replacement.

Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.

  • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
  • Step 3 − In this step, voting will be performed for every predicted result.
    • Based on ‘n’ samples… ‘n’ tree is built
    • Each record is classified based on the n tree
    • The final class for each record is decided based on voting

Step 4 − At last, select the most voted prediction result as the final prediction result.

What is the Out of Bag score in Random Forests?

Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.

For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.

OOB Error Rate Computation Steps

  • Sample left out (out-of-bag) in Kth tree is classified using the Kth tree
  • Assume j cases are misclassified
  • The proportion of time that j is not equal to true class averaged over all cases is the OOB error rate.

Variable importance of RF: 

It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.

  • Random Forest computes two measures of Variable Importance
    • Mean Decrease in Accuracy
    • Mean Decrease in Gini
  • Mean Decrease in Accuracy is based on permutation
    • Randomly permute values of a variable for which importance is to be computed in the OOB sample
    • Compute the Error Rate with permuted values
    • Compute decrease in OOB Error rate (Permuted- Not permuted)
    • Average the decrease overall the trees
  • Mean Decrease in Gini is computed as a “total decrease in node impurities from splitting on the variable averaged over all trees”.

Finding the optimal values using grid-search cv:

It stats the optimal values of the splitting decision tree that how many trees to be split within the model.

Measuring RF model performance by Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.

Random Forest with python: 

Importing the important libraries–

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz

Read the data from csv

dummy_df = pd.read_csv("bank.csv", na_values =['NA'])
temp = dummy_df.columns.values[0] temp
print(dummy_df)

Data Pre-Processing:

columns_name = temp.split(';')
data = dummy_df.values
print(data)
print(data.shape)
contacts = list()
for element in data:
contact = element[0].split(';')
contacts.append(contact)

contact_df = pd.DataFrame(contacts,columns = columns_name)
print(contact_df)
def preprocessor(df):
res_df = df.copy()
le = preprocessing.LabelEncoder()

 encoded_df = preprocessor(contact_df)
#encoded_df = preprocessor(contacts)
x = encoded_df.drop(['"y"'],axis =1).values
y = encoded_df['"y"'].values

Split the data into Train-Test

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)

Build the Decision Tree Model

# Decision tree with depth = 2
model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2)
model_dt_2.fit(x_train, y_train)
model_dt_2_score_train = model_dt_2.score(x_train, y_train)
print("Training score: ",model_dt_2_score_train)
model_dt_2_score_test = model_dt_2.score(x_test, y_test)
print("Testing score: ",model_dt_2_score_test)
#y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1] #Decision tree

model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy")
model_dt.fit(x_train, y_train)
y_pred_dt = model_dt.predict_proba(x_test)[:, 1]

Graphical Representation of Tree

plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Performance Metrics

fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt)
roc_auc_dt = auc(fpr_dt, tpr_dt)
predictions = model_dt.predict(x_test)
# Model Accuracy
print (model_dt.score(x_test, y_test))
y_actual_result = y_test[0] for i in range(len(predictions)):
if(predictions[i] == 1):
y_actual_result = np.vstack((y_actual_result, y_test[i]))

Recall

#Recall
y_actual_result = y_actual_result.flatten()
count = 0
for result in y_actual_result:
if(result == 1):
count=count+1
print ("true yes|predicted yes:")
print (count/float(len(y_actual_result)))

Area Under the Curve

plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()

Confusion Matrix

print (confusion_matrix(y_test, predictions))
accuracy_score(y_test, predictions)
import itertools
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix.
cm = confusion_matrix(y_test, model, labels=[0, 1])
classes=["Success", "Default"] cmap = plt.cm.Blues
title = "Confusion Matrix"
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.figure(figsize=(6,6))
plot_confusion_matrix(predictions, normalize=False)
plt.show()

Pruning of the tree

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
if inner_tree.value[index].min() < threshold:
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
# if there are shildren, visit them as well
if inner_tree.children_left[index] != TREE_LEAF:
prune_index(inner_tree, inner_tree.children_left[index], threshold)
prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(model_dt.tree_.children_left < 0))
# start pruning from the root
prune_index(model_dt.tree_, 0, 5)
sum(model_dt.tree_.children_left < 0)

#It means that the code has created 17 new leaf nodes
#(by practically removing links to their ancestors). The tree, which has looked before like

from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

Human activity recognition with smart phone

Human Activity recognition:

In this case study, we design a model by which a smartphone can detect its owner’s activity precisely. Human activity recognition with a smartphone is a very famous ML project. It is a wellness approach for a human.  Human activity is a very exciting project for AI.

Most of the smartphones have two smart sensors accelerometer and gyroscope, which is an IoT sensor. With the help of the IoT devices captures the activity of a human. The data of human activity collected through the IoT sensor. The two smartphone sensors are accelerometer and gyroscope. Accelerometer collects the data of mobile movement such as move landscape and portrait when playing mobile games and gyroscope measure the rotational movement.

An example that a smartphone has an android app that reads the accelerometers and gyroscope which can predict the human activity that he/she walking normally, walking upstairs, walking downstairs, laying down, sitting all these are the human activities.  Some of the accelerometer and gyroscope measures heart rate, calories burned, etc. by reading all the human activities these tells how much work have done in a day by the human this is also the area of the internet of things(IoT).

Working of Human activity project:

  1. Human activity recognition: With the help of sensors we collect the data of body movement which is captured by the smartphone. Movements are often indoor activities such as walking, walking upstairs, walking downstairs, lying down, sitting and standing. The data have recorded for the prediction of the data.

      2. Data set collection of activity: The data was collected from the 30 volunteers aged between 19 to 48                     performing the activities mentioned above while wearing a smartphone on waist. The example video is given below to understand Subject performing the activities and the movement data was labeled manually.

3. Human Activity Recognition Using Smartphones Data Set: The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% the test data. The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low-frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

4.Download the Dataset:

  • There are “train” and “test” folders containing the split portions of the data for modeling (e.g. 70%/30%).
  • There is a “txt” file that contains a detailed technical description of the dataset and the contents of the unzipped files.
  • There is a “txt” file that contains a technical description of the engineered features.

The contents of the “train” and “test” folders are similar (e.g. folders and file names), although with differences in the specific data they contain.

Load  set data and process it:

Important libraries to import for data processing

#start with some necessary imports
import numpy as np
import pandas as pd
from google.colab import files
uploaded = files.upload()

google.colab used to fetch the data from the collaborator files.


train_data = pd.read_csv("train.csv")
train_data.head()

we select the training data set for the modeling.

train_data.Activity.value_counts()
train_data.shape

The above function defines how many rows and columns the dataset have.


train_data.describe()  

It describes that there are (8 rows and 563 columns) with all the features of the data. For numeric data, the result’s index will include countmeanstdminmax as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.


uploaded = files.upload()
test_data = pd.read_csv('test.csv')
test_data.head()

Here we read the csv file to analyze the data set and the operation which is supposed to be programmed. head()
shows the first 5 rows with their respective columns so here we have (5 rows and 563 columns).

# suffling data
from sklearn.utils import shuffle

# test = shuffle(test)
train_data = shuffle(train_data)

Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you’d shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.

# separating data inputs and output lables
trainData = train_data.drop('Activity' , axis=1).values
trainLabel = train_data.Activity.values

testData = test_data.drop('Activity' , axis=1).values
testLabel = test_data.Activity.values
print(testLabel)

By using the above code we separate the input and output, here it determines the human activities which are captured by the IoT device. The human activities walking, standing, walking upstairs, walking downstairs, sitting and lying down are got separated to optimize the result.

# encoding labels
from sklearn import preprocessing

encoder = preprocessing.LabelEncoder()
# encoding test labels
encoder.fit(testLabel)
testLabelE = encoder.transform(testLabel)

# encoding train labels
encoder.fit(trainLabel)
trainLabelE = encoder.transform(trainLabel)

Holds the label for each class. encode categorical features using a one-hot or ordinal encoding scheme. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

# applying supervised neural network using multi-layer preceptron
import sklearn.neural_network as nn
mlpSGD = nn.MLPClassifier(hidden_layer_sizes=(90,) \
, max_iter=1000 , alpha=1e-4 \
, solver='sgd' , verbose=10 \
, tol=1e-19 , random_state=1 \
, learning_rate_init=.001) 

mlpADAM = nn.MLPClassifier(hidden_layer_sizes=(90,) \
, max_iter=1000 , alpha=1e-4 \
, solver='adam' , verbose=10 \
, tol=1e-19 , random_state=1 \
, learning_rate_init=.001)
nnModelSGD = mlpSGD.fit(trainData , trainLabelE)
y_pred = mlpSGD.predict(testData).reshape(-1,1)
#print(y_pred)
from sklearn.metrics import classification_report
print(classification_report(testLabelE, y_pred))
 

import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(32,24))
ax1 = fig.add_subplot(221)
ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,0], data=sub_01, jitter=True)
ax2 = fig.add_subplot(222)
ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,1], data=sub_01, jitter=True)
plt.show() 

 

fig = plt.figure(figsize=(32,24))
ax1 = fig.add_subplot(221)
ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,2], data=sub_01, jitter=True)
ax2 = fig.add_subplot(222)
ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,3], data=sub_01, jitter=True)
plt.show()

 

Click here to watch the video:

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.

#iguru_button_61747c6084dcf .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_61747c6084dcf .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_61747c6084dcf .wgl_button_link { border-color: transparent; background-color: rgba(255,149,98,1); }#iguru_button_61747c6084dcf .wgl_button_link:hover { border-color: rgba(230,95,42,1); background-color: rgba(253,185,0,1); }#iguru_button_61747c60862b7 .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_61747c60862b7 .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_61747c60862b7 .wgl_button_link { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61747c60862b7 .wgl_button_link:hover { border-color: rgba(218,0,0,1); background-color: rgba(218,0,0,1); }#iguru_button_61747c608a6d1 .wgl_button_link { color: rgba(241,241,241,1); }#iguru_button_61747c608a6d1 .wgl_button_link:hover { color: rgba(250,249,249,1); }#iguru_button_61747c608a6d1 .wgl_button_link { border-color: rgba(102,75,196,1); background-color: rgba(48,90,169,1); }#iguru_button_61747c608a6d1 .wgl_button_link:hover { border-color: rgba(102,75,196,1); background-color: rgba(57,83,146,1); }#iguru_soc_icon_wrap_61747c609e9d2 a{ background: transparent; }#iguru_soc_icon_wrap_61747c609e9d2 a:hover{ background: transparent; border-color: #3aa0e8; }#iguru_soc_icon_wrap_61747c609e9d2 a{ color: #acacae; }#iguru_soc_icon_wrap_61747c609e9d2 a:hover{ color: #ffffff; }#iguru_soc_icon_wrap_61747c609e9d2 { display: inline-block; }.iguru_module_social #soc_icon_61747c609ea061{ color: #ffffff; }.iguru_module_social #soc_icon_61747c609ea061:hover{ color: #ffffff; }.iguru_module_social #soc_icon_61747c609ea061{ background: #44b1e4; }.iguru_module_social #soc_icon_61747c609ea061:hover{ background: #44b1e4; }
Get The Learnbay Advantage For Your Career
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.
Overlay Image
GET THE LEARNBAY ADVANTAGE FOR YOUR CAREER
Note : Our programs are suitable for working professionals(any domain). Fresh graduates are not eligible.