 ### Clustering & Types Of Clustering

Clustering & Types Of Clustering is the process of finding similar groups in data, called a cluster. It groups data instances that are similar to each other in one cluster and data instances that are very different(far away) from each other into different clusters. A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science. Entities in each group and is comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used clustering methods.

Steps involved in Clustering analysis:

1. Formulate the problem – select variables to be used for clustering.

2. Decide the clustering procedure whether it will be Hierarchical or Non-Hierarchical.

3. Select the measure of similarity or dissimilarity.

4. Choose clustering algorithms.

5. Decide the number of clusters.

6. Interpret the cluster output(profile the clusters).

7. Validate the clusters.

### Types of clustering technique:

Broadly speaking, clustering can be divided into two subgroups :

• Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
• Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering are:

k-means clustering:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-Means minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. Better Euclidean solutions can, for example, be found using k-medians and k-medoids. K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.
2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids: The centroid of data points in the red cluster is shown using a red cross and those in a grey cluster using the grey cross.
4. Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of the grey cluster. Thus, we assign that data point into a grey cluster
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

``` from pandas import DataFrame Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46], 'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] } df = DataFrame(Data,columns=['x','y']) print (df) ```

k-means for cluster=3

``` from pandas import DataFrame import matplotlib.pyplot as plt from sklearn.cluster import KMeans Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46], 'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7] } df = DataFrame(Data,columns=['x','y']) kmeans = KMeans(n_clusters=3).fit(df) centroids = kmeans.cluster_centers_ print(centroids) plt.scatter(df['x'], df['y'], c= kmeans.labels_.astype(float), s=50, alpha=0.5) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50) ```Hierarchical Clustering:

Hierarchical clustering, as the name suggests is an algorithm that builds the hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using the dendrogram. The dendrogram can be interpreted as: Two important things that you should know about hierarchical clustering are:

• This algorithm has been implemented above using a bottom-up approach. It is also possible to follow the top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
• The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
• Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
• Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
• Manhattan distance: ||a-b||1 = Σ|ai-bi|
• Maximum distance:||a-b||INFINITY = maxi|ai-bi|
• Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}

` import numpy as np`
`X = np.array([[5,3], [10,15], [15,12], [24,10], [30,30], [85,70], [71,80], [60,78], [70,55], [80,91],]) import matplotlib.pyplot as plt labels = range(1, 11) plt.figure(figsize=(10, 7)) plt.subplots_adjust(bottom=0.1) plt.scatter(X[:,0],X[:,1], label='True Position') for label, x, y in zip(labels, X[:, 0], X[:, 1]): plt.annotate( label, xy=(x, y), xytext=(-3, 3), textcoords='offset points', ha='right', va='bottom') plt.show()` ` from scipy.cluster.hierarchy import dendrogram, linkage from matplotlib import pyplot as plt`
`linked = linkage(X, 'single') labelList = range(1, 11) plt.figure(figsize=(10, 7)) dendrogram(linked, orientation='top', labels=labelList, distance_sort='descending', show_leaf_counts=True) plt.show()` Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ### What is EDA?

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

It is always good to explore and compare a data set with multiple exploratory techniques. After the exploratory data analysis, you will get confidence in your data to point where you’re ready to engage a machine learning algorithm and another benefit of EDA is to the selection of feature variables that will be used later for Machine Learning.
In this post, we take Iris Dataset to get the process of EDA.

Importing libraries:

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt``` Loading the Iris data `iris_data= pd.read_csv("Iris.csv")` Understand the data: ```iris_data.shape
(150,5)
iris_data['Species'].value_counts()
setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64 iris_data.columns() Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'],dtype='object')``` 1D scatter plot of the iris data: ```iris_setso = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setso["petal_length"],np.zeros_like(iris_setso["petal_length"]), 'o')
plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor["petal_length"]), 'o')
plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica["petal_length"]), 'o')
plt.grid()
plt.show() ``` 2D scatter plot: ```iris.plot(kind="scatter",x="sepal_length",y="sepal_width")
plt.show()``` 2D scatter plot with the seaborn library : ```import seaborn as sns
sns.set_style("whitegrid");
sns.FacetGrid(iris,hue="species",size=4) \
.map(plt.scatter,"sepal_length","sepal_width") \
plt.show() ``` ```

Conclusion

• Blue points can be easily separated from red and green by drawing a line.
• But red and green data points cannot be easily separated.
• Using sepal_length and sepal_width features, we can distinguish Setosa flowers from others.
• Separating Versicolor from Viginica is much harder as they have considerable overlap.

Pair Plot:

A pairs plot allows us to see both the distribution of single variables and relationships between two variables. For example, let’s say we have four features ‘sepal length’, ‘sepal width’, ‘petal length’ and ‘petal width’ in our iris dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs, in this case, will be :

•  Sepal length, sepal width
• sepal length, petal length
• sepal length, petal width
• sepal width, petal length
• sepal width, petal width
• petal length, petal width

So, here instead of trying to visualize four dimensions which are not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

``````sns.set_style("whitegrid");
sns.pairplot(iris,hue="species",size=3);
plt.show()``````

Conclusion:

1. petal length and petal width are the most useful features to identify various flower types.
2. While Setosa can be easily identified (linearly separable), virginica and Versicolor have some overlap (almost linearly separable).
3. We can find “lines” and “if-else” conditions to build a simple model to classify the flower types.

Cumulative distribution function:

``````iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density = True)
pdf = counts/(sum(counts))
print(pdf);
>>>[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0.   0.04]
print(bin_edges);
>>>[1.   1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
cdf = np.cumsum(pdf)
plt.grid()
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf) `````` Mean, Median, and Std-Dev:

``````print("Means:")
print(np.mean(iris_setosa["petal_length"]))
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))``` OutPut: - Means: 1.464 2.4156862745098038 5.5520000000000005 4.26```

Std-dev:
0.17176728442867112
0.546347874526844
0.4651881339845203

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ## Random Forest Model: Introduction

Random Forest Model is also a classification model with the combination of the decision tree. The random forest model algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction is based on mode for classification and means for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.

### Random Forest Model Algorithm: Working

We can understand the working of the Random Forest algorithm with the help of following steps −

• Step 1 − First, start with the selection of random samples from a given dataset. Do sampling without replacement. Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.

• Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
• Step 3 − In this step, voting will be performed for every predicted result.
• Based on ‘n’ samples… ‘n’ tree is built
• Each record is classified based on the n tree
• The final class for each record is decided based on voting

Step 4 − At last, select the most voted prediction result as the final prediction result.

### What is the Out of Bag score in Random Forests?

Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.

For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.

OOB Error Rate Computation Steps

• Sample left out (out-of-bag) in Kth tree is classified using the Kth tree
• Assume j cases are misclassified
• The proportion of time that j is not equal to true class averaged over all cases is the OOB error rate.

Variable importance of RF:

It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.

• Random Forest computes two measures of Variable Importance
• Mean Decrease in Accuracy
• Mean Decrease in Gini
• Mean Decrease in Accuracy is based on permutation
• Randomly permute values of a variable for which importance is to be computed in the OOB sample
• Compute the Error Rate with permuted values
• Compute decrease in OOB Error rate (Permuted- Not permuted)
• Average the decrease overall the trees
• Mean Decrease in Gini is computed as a “total decrease in node impurities from splitting on the variable averaged over all trees”.

Finding the optimal values using grid-search cv:

It stats the optimal values of the splitting decision tree that how many trees to be split within the model.

Measuring RF model performance by Confusion Matrix:

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.

Random Forest with python:

Importing the important libraries–

`import pandas as pd import numpy as np from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn import svm from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz `

```dummy_df = pd.read_csv("bank.csv", na_values =['NA']) temp = dummy_df.columns.values temp print(dummy_df)```

## Data Pre-Processing:

`columns_name = temp.split(';') data = dummy_df.values print(data) print(data.shape) contacts = list() for element in data: contact = element.split(';') contacts.append(contact)`

`contact_df = pd.DataFrame(contacts,columns = columns_name) print(contact_df) def preprocessor(df): res_df = df.copy() le = preprocessing.LabelEncoder()`

` ``encoded_df = preprocessor(contact_df) #encoded_df = preprocessor(contacts) x = encoded_df.drop(['"y"'],axis =1).values y = encoded_df['"y"'].values`

## Split the data into Train-Test¶

`x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)`

## Build the Decision Tree Model

```# Decision tree with depth = 2 model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2) model_dt_2.fit(x_train, y_train) model_dt_2_score_train = model_dt_2.score(x_train, y_train) print("Training score: ",model_dt_2_score_train) model_dt_2_score_test = model_dt_2.score(x_test, y_test) print("Testing score: ",model_dt_2_score_test) #y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1] #Decision tree```

`model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy") model_dt.fit(x_train, y_train) y_pred_dt = model_dt.predict_proba(x_test)[:, 1]`

## Graphical Representation of Tree

`plt.figure(figsize=(6,6)) dot_data = StringIO() export_graphviz(model_dt, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())`

## Performance Metrics

```fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt) roc_auc_dt = auc(fpr_dt, tpr_dt) predictions = model_dt.predict(x_test) # Model Accuracy print (model_dt.score(x_test, y_test)) y_actual_result = y_test for i in range(len(predictions)): if(predictions[i] == 1): y_actual_result = np.vstack((y_actual_result, y_test[i]))```

## Recall

`#Recall y_actual_result = y_actual_result.flatten() count = 0 for result in y_actual_result: if(result == 1): count=count+1 print ("true yes|predicted yes:") print (count/float(len(y_actual_result)))`

## Area Under the Curve

`plt.figure(1) lw = 2 plt.plot(fpr_dt, tpr_dt, color='green', lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Area Under Curve') plt.legend(loc="lower right") plt.show()` ## Confusion Matrix

```print (confusion_matrix(y_test, predictions)) accuracy_score(y_test, predictions) import itertools from sklearn.metrics import confusion_matrix def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix. cm = confusion_matrix(y_test, model, labels=[0, 1]) classes=["Success", "Default"] cmap = plt.cm.Blues title = "Confusion Matrix" if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] cm = np.around(cm, decimals=3) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')```

`plt.figure(figsize=(6,6)) plot_confusion_matrix(predictions, normalize=False) plt.show()` ### Pruning of the tree¶

`from sklearn.tree._tree import TREE_LEAF`

`def prune_index(inner_tree, index, threshold): if inner_tree.value[index].min() < threshold: # turn node into a leaf by "unlinking" its children inner_tree.children_left[index] = TREE_LEAF inner_tree.children_right[index] = TREE_LEAF # if there are shildren, visit them as well if inner_tree.children_left[index] != TREE_LEAF: prune_index(inner_tree, inner_tree.children_left[index], threshold) prune_index(inner_tree, inner_tree.children_right[index], threshold)`

`print(sum(model_dt.tree_.children_left < 0)) # start pruning from the root prune_index(model_dt.tree_, 0, 5) sum(model_dt.tree_.children_left < 0)`

`#It means that the code has created 17 new leaf nodes #(by practically removing links to their ancestors). The tree, which has looked before like`

`from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus plt.figure(figsize=(6,6)) dot_data = StringIO() export_graphviz(model_dt, out_file=dot_data, filled=True, rounded=True, special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())`

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ## Human Activity recognition:

In this case study, we design a model by which a smartphone can detect its owner’s activity precisely. Human activity recognition with a smartphone is a very famous ML project. It is a wellness approach for a human.  Human activity is a very exciting project for AI.

Most of the smartphones have two smart sensors accelerometer and gyroscope, which is an IoT sensor. With the help of the IoT devices captures the activity of a human. The data of human activity collected through the IoT sensor. The two smartphone sensors are accelerometer and gyroscope. Accelerometer collects the data of mobile movement such as move landscape and portrait when playing mobile games and gyroscope measure the rotational movement.

An example that a smartphone has an android app that reads the accelerometers and gyroscope which can predict the human activity that he/she walking normally, walking upstairs, walking downstairs, laying down, sitting all these are the human activities.  Some of the accelerometer and gyroscope measures heart rate, calories burned, etc. by reading all the human activities these tells how much work have done in a day by the human this is also the area of the internet of things(IoT).

### Working of Human task project:

1. Human activity recognition: With the help of sensors we collect the data of body movement which is captured by the smartphone. Movements are often indoor activities such as walking, walking upstairs, walking downstairs, lying down, sitting and standing. The data have recorded for the prediction of the data.

2. Data set collection of activity: The data was collected from the 30 volunteers aged between 19 to 48 performing the activities mentioned above while wearing a smartphone on waist. The example video is given below to understand Subject performing the activities and the movement data was labeled manually.

3. Human Activity Recognition Using Smartphones Data Set: The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% the test data. The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low-frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

• There are “train” and “test” folders containing the split portions of the data for modeling (e.g. 70%/30%).
• There is a “txt” file that contains a detailed technical description of the dataset and the contents of the unzipped files.
• There is a “txt” file that contains a technical description of the engineered features.

The contents of the “train” and “test” folders are similar (e.g. folders and file names), although with differences in the specific data they contain.

Load  set data and process it:

Important libraries to import for data processing

`#start with some necessary imports import numpy as np import pandas as pd from google.colab import files uploaded = files.upload()`

google.colab used to fetch the data from the collaborator files.

` train_data = pd.read_csv("train.csv") train_data.head()`

we select the training data set for the modeling.

`train_data.Activity.value_counts() train_data.shape`

The above function defines how many rows and columns the dataset have.

` train_data.describe()  `

It describes that there are (8 rows and 563 columns) with all the features of the data. For numeric data, the result’s index will include `count``mean``std``min``max` as well as lower, `50` and upper percentiles. By default the lower percentile is `25` and the upper percentile is `75`. The `50` percentile is the same as the median.

` uploaded = files.upload() test_data = pd.read_csv('test.csv') test_data.head()`

Here we read the csv file to analyze the data set and the operation which is supposed to be programmed. head()
shows the first 5 rows with their respective columns so here we have (5 rows and 563 columns).

`# suffling data from sklearn.utils import shuffle`
`# test = shuffle(test) train_data = shuffle(train_data)`

Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.
The obvious case where you’d shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.

`# separating data inputs and output lables trainData = train_data.drop('Activity' , axis=1).values trainLabel = train_data.Activity.values`
`testData = test_data.drop('Activity' , axis=1).values testLabel = test_data.Activity.values print(testLabel)`

By using the above code we separate the input and output, here it determines the human activities which are captured by the IoT device. The human activities walking, standing, walking upstairs, walking downstairs, sitting and lying down are got separated to optimize the result.

`# encoding labels from sklearn import preprocessing`
`encoder = preprocessing.LabelEncoder()`
`# encoding test labels encoder.fit(testLabel) testLabelE = encoder.transform(testLabel)`
`# encoding train labels encoder.fit(trainLabel) trainLabelE = encoder.transform(trainLabel)`

Holds the label for each class. encode categorical features using a one-hot or ordinal encoding scheme. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

`# applying supervised neural network using multi-layer preceptron import sklearn.neural_network as nn mlpSGD = nn.MLPClassifier(hidden_layer_sizes=(90,) \ , max_iter=1000 , alpha=1e-4 \ , solver='sgd' , verbose=10 \ , tol=1e-19 , random_state=1 \ , learning_rate_init=.001) `
`mlpADAM = nn.MLPClassifier(hidden_layer_sizes=(90,) \ , max_iter=1000 , alpha=1e-4 \ , solver='adam' , verbose=10 \ , tol=1e-19 , random_state=1 \ , learning_rate_init=.001) ``nnModelSGD = mlpSGD.fit(trainData , trainLabelE)`
`y_pred = mlpSGD.predict(testData).reshape(-1,1) #print(y_pred) from sklearn.metrics import classification_report print(classification_report(testLabelE, y_pred))  `

`import matplotlib.pyplot as plt import seaborn as sns fig = plt.figure(figsize=(32,24)) ax1 = fig.add_subplot(221) ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,0], data=sub_01, jitter=True) ax2 = fig.add_subplot(222) ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,1], data=sub_01, jitter=True) plt.show() ` `fig = plt.figure(figsize=(32,24)) ax1 = fig.add_subplot(221) ax1 = sns.stripplot(x='Activity', y=sub_01.iloc[:,2], data=sub_01, jitter=True) ax2 = fig.add_subplot(222) ax2 = sns.stripplot(x='Activity', y=sub_01.iloc[:,3], data=sub_01, jitter=True) plt.show()`

` `

Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. Choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM. ## Introduction of  Support Vector Machine

Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for both classification and regression.

SVMs were introduced initially in the 1960s and were later refined in 1990s. However, it is only now that they are becoming extremely popular, owing to their ability to achieve brilliant results. SVMs are implemented uniquely when compared to other machine learning algorithms.

Support vector machine(SVM) is a supervised learning algorithm that is used to classify the data into different classes, now unlike most algorithms SVM makes use of hyperplane which acts as a decision boundary between the various classes. In general, SVM can be used to generate multiple separating the hyperplane so that the data is divided into segments. These segments contain some kind of data. SVM used to classify the data into two different segments depending on the feature of data.

### Feature of Support Vector Machine SVM-

SVM studies the labeled data & then classify any new input data depending on what it learned into the training phase.

It can be used for both classification and regression problems. As SVC supports vector classification SVR stands for support vector regression. One of the main features of SVM is kernel function, it can be used for nonlinear data by using the kernel trick.  The working of the kernel trick is to transform the data into another dimension so that we can draw a hyperplane that classifies the data.

How SVM work??

SVM works by mapping data to a high-dimensional feature space so that data points can be classified, even when the data are not linearly separable. A separator between the classifies is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

Importing Libraries:
` import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline bankdata = pd.read_csv("D:/Datasets/bill_authentication.csv")`

Exploratory Data Analysis:

`bankdata.shape bankdata.head()`

VarianceSkewnessCurtosisEntropyClass
03.621608.6661-2.8073-0.446990
14.545908.1674-2.4586-1.462100
23.86600-2.63831.92420.106450
33.456609.5228-4.0112-3.594400
40.32924-4.45524.5718-0.988800

Data preprocessing:

```X = bankdata.drop('Class', axis=1) y = bankdata['Class'] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20) ```

Training the Algorithm:

`from sklearn.svm import SVC svclassifier = SVC(kernel='linear') svclassifier.fit(X_train, y_train)`

Making prediction

`y_pred = svclassifier.predict(X_test)`

Evaluating the Algorithm:

`from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,y_pred)) print(classification_report(y_test,y_pred))`

Output:

```[[152 0] [ 1 122]] precision recall f1-score support```

0 0.99 1.00 1.00 152
1 1.00 0.99 1.00 123

avg / total 1.00 1.00 1.00 275

### SVM Linear Classifier:

In the linear classifier model, we assumed that training examples plotted in space. These data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. The drawn hyperplane called a maximum-margin hyperplane.

### SVM Non-Linear Classifier:

In the real world, our dataset is generally dispersed up to some extent. To solve this problem separation of data into different classes based on a straight linear hyperplane can’t be considered a good choice. For this Vapnik suggested creating Non-Linear Classifiers by applying the kernel trick to maximum-margin hyperplanes. In Non-Linear SVM Classification, data points plotted in a higher-dimensional space. Learnbay provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. ### Differentiating Data Scientist and Data Analyst

There is a pensive difference between Data Scientist and Data Analyst. It is so much interesting to know about them all.

#iguru_button_628c820ae3a4a .wgl_button_link { color: rgba(255,255,255,1); }#iguru_button_628c820ae3a4a .wgl_button_link:hover { color: rgba(45,151,222,1); }#iguru_button_628c820ae3a4a .wgl_button_link { border-color: rgba(45,151,222,1); background-color: rgba(45,151,222,1); }#iguru_button_628c820ae3a4a .wgl_button_link:hover { border-color: rgba(45,151,222,1); background-color: rgba(255,255,255,1); }#iguru_button_628c820ae95e3 .wgl_button_link { color: rgba(102,75,196,1); }#iguru_button_628c820ae95e3 .wgl_button_link:hover { color: rgba(255,255,255,1); }#iguru_button_628c820ae95e3 .wgl_button_link { border-color: rgba(102,75,196,1); background-color: transparent; }#iguru_button_628c820ae95e3 .wgl_button_link:hover { border-color: rgba(102,75,196,1); background-color: rgba(102,75,196,1); } 