Scikit-Learn or sklearn

Exploring the Power of Scikit-Learn: Unleash the Potential with 'sklearn'Thu Feb 8, 2024

What is Scikit-Learn or sklearn?

"Scikit-Learn or sklearn is the open-source python library useful in machine learning for predictive data analysis."

As we know, in machine learning, a machine automatically learns from the data and predicts outcomes.

The sklearn library provides simple and efficient tools for data mining and data analysis, built on top of other scientific computing libraries such as NumPy, SciPy, and Matplotlib which makes our work easier with arrays and machine learning techniques.

sklearn contains a wide range of algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction.

It also comes with a number of tools for data preprocessing, model selection and model evaluation.

To install scikit-learn, the command is:

pip install scikit-learn

But if you have installed Anaconda (Open-source platform for machine learning and data science), then no need to install scikit-learn, as scikit-learn is preinstalled in Anaconda.

To use it, first we need to import this model.

For example:

from sklearn.linear_model import LinearRegression

Here, linear_model is the model of sklearn library, and LinearRegression is the class.

To implement linear regression model of machine learning, we just need to instantiate that LinearRegression model.

Now let’s understand how we can use scikit-learn or sklearn to implement linear regression model which finds relationship between variables to draw a straight line, used to predict future values.

First we have to import the libraries, NumPy, Matplotlib, and Panads.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Then import the dataset (Salary of the employees corresponding to experience.

dataset=pd.read_csv("Data.csv")
X=dataset.iloc[:, :-1]
Y=dataset.iloc[:, -1]

Then split the data set into the training set and test set.

So, let’s implement one of the tools of data preprocessing toolkit, which is the split of the dataset into the training set and test set.

So how are we going to do this?

Well, we're going to do it with a function by scikit-learn, the most popular and useful library, because this library contains a module that is called model_selection, which contains itself a function called train_test_split.

from sklearn.model_selection import train_test_split

Here, the train_test_split function will create four separate sets (two pairs of matrix of features), one pair of matrix of features and dependent variable for the training set and another pair of matrix of features and dependent variable for the test set.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

So, we're basically going to get four sets, X_train, which is a matrix of features of the training set, X_test, which is the matrix of features of the test set, Y_train, which is a dependent variable of the training set, and Y_test, which is a dependent variable of the test set.

And now why do we want this format?

It's actually the format expected by the feature of machine learning models that we are going to discuss, which will be all of them expecting this format as inputs.

For the training, it will expect X_train and Y_train inputs and a method called fit method. And for the predictions, these models will predict X_test.

These four variables returned by the train_test_split function. The first two arguments of this function are X and Y, the matrix of features X and the dependent variable Y of the data set.

Then next argument is test_size=0.2. That means, the size of the split is 20% observation in the test. Because we're not going to split this dataset into a training set and a test of the same size. We need a lot of observations in a training set and a few in the test set, so as to give the future machine learning model more chance to understand and learn the correlations in the dataset.

Therefore, the recommended size of the split is 80% observation in the training set and 20% in the test set.

And one final argument random_state=1. Because the observations will be randomly split it into the training set and test set, so, to make sure we had the same random factors, random state will be equal to one. That means, we are fixing that we will get the same split and therefore the same training set and same test set.

Now, after data preprocessing, we were about to build and train the simple linear regression model on the training set.

So, the first thing we'll have to do is to import the right class with which we're going to build this simple linear regression model using libraries.

And this library that we're going to use is scikit-learn, from which we're going to get access to a certain module called linear_model.

And from this module, we're going to call a certain class called LinearRegression. And the simple linear regression model, which we're going to build will be an instance or object of this class.

Let's start by importing from scikit-learn library, which has the code name sklearn, and then we are going to access linear_model after adding a dot.

And from this linear_model of the scikit-learn library, we're going to import that LinearRegression class.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

After importing the class, we have created a new variable regressor, which will be instance of the LinearRegression class. And then in next step is to call a method that we're going to use to train our regression model is the fit() method.

For this, first we have to take the object regressor, and then add the fit() method after a dot.

And then inside the parenthesis expect the training set in the format with first, the matrix of features X_train and second, the dependent variable vector Y_train.

Now let’s see the code to visualize a training set results.

plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'green' )
plt.title('Salary vs. Experince (Training Set)')
plt.xlabel('Experince (in Years)')
plt.ylabel('Salary')
plt.show()

Basically, we're going to have a 2D plot with the x axis being the numbers of years of experience from 1 to 10, and the y axis being the salaries, in the range of salaries given in the data set.

we are going to plot in red points the real salaries and in a green straight line, the predicted salaries. And we will do that both for the predictions in the training set and the predictions in a test.

The points corresponding to the predicted salaries will follow a straight line, and therefore, we’ve used the plot method.

Here, the first coordinate of these predicted salaries is the X_train, because we're visualizing the results of the training set. so, this corresponds to the numbers of years of experience of the employees in the training set. And the second coordinate of these predicted salaries is the calling of predict method on X_train, meaning on the numbers of years of experience of the employees in the training set will get the predicted salaries of the training set.

As we can see, regression line is the line of the predictions coming as close as possible to the real results, the real salaries.

And therefore, it is called a linear regression to predictions.

And following is the code to visualize the test results for the given data set.

plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'green' )
plt.title('Salary vs. Experince (Test Set)')
plt.xlabel('Experince (in Years)')
plt.ylabel('Salary')
plt.show()

Here we have replaced X_train by X_test and then Y_train by Y_test, as these are for the real observations with the numbers of years of experience and the real salaries of the test set.

And to predict the results of the observations in the test set, we have to call predict() method:

y_pred = regressor.predict(X_test)

print(y_pred)

Conclusion:

With its emphasis on simplicity, flexibility, and performance, scikit-learn continues to empower researchers, developers, and data enthusiasts worldwide to explore complex datasets, build predictive models, and unlock insights that drive innovation across diverse domains. As the field of machine learning evolves, scikit-learn remains a steadfast companion, enabling individuals and organizations to harness the power of data to solve real-world challenges and shape a brighter future.

Happy Learning!

Dr. Hesam Akhtar
Educator.

Scikit-Learn or sklearn

What is Scikit-Learn or sklearn?

You may also be interested in