Let’s take an example and have a walkthrough, we will model the percentage of marks that a student has scored based upon the number of hours they studied.
Let’s create a sample of 25 students ourselves where X is the number of hours a student studies per day and Y is the percentage at the exam respectively.
# importing required libraries
>> import numpy as np
>> import matplotlib.pyplot as plt
>>> X = [2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5, 8.3,
2.7, 7.7, 5.9, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9,
6.1, 7.4, 2.7, 4.8, 3.8, 6.9, 7.8]
>>> Y = [21, 47, 27, 75, 30, 20, 88, 60, 81, 25, 85,
62, 41, 42, 17, 95, 30, 24, 67, 69, 30, 54,
35, 76, 86]
>>> X = np.array(X)
>>> Y = np.array(Y)
>>> plt.scatter(X,Y)
>>> plt.xlabel("No. of Hours Studied")
>>> plt.ylabel("Exam Percentage")
>>> plt.grid()
>>> plt.ylim(0,100)
>>> plt.xlim(0,10)
>>> plt.plot()
It is a challenging problem to solve analytically. There is no possible way to fit a linear equation to this set of points. The equation that will fit all the points will be all curly, touching each and every point and that is nearly impossible to model. So, we try to find a linear equation that minimizes the squared error.
As we know, the equation of straight line is, y = mx + c
Let’s consider, that line passes through (0,0) for simplicity. So,
y = mx
or
y = x.m
In matrix notation, this problem is formulated using the normal equation,
X^{T}.X.m = X^{T}.y
This can be rearranged in order to specify the solution for m as,
m = (X^{T}.X)^{-1} . X^{T}. y
>>> X = X.reshape(-1,1)
>>> m = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
>>> m
array([10.17425707])
So, the equation that we got after solving for m using normal equation,
y = 10.17425707x
This equation nearly models the actual data as we can see from the graph below.
>>> yadj = X.dot(m)
>>> plt.plot(X,yadj)
>>> plt.scatter(X,Y)
>>> plt.grid()
>>> plt.ylim(0,100)
>>> plt.xlim(0,10)
>>> plt.show()
Let’s take another example,
>>> X = np.array([1,2,3,4])
>>> Y = np.array([5,2,3,1])
>>> X = X.reshape(-1,1)
What happens in this case, according to our solution?
>>> m = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y)
>>> m
array([0.73333333])
>>> yadj = X.dot(m)
>>> plt.plot(X,yadj)
>>> plt.scatter(X,Y)
>>> plt.grid()
>>> plt.ylim(0,6)
>>> plt.xlim(0,5)
>>> plt.show()
Since, we omitted c from our equation previously, we have a straight line passing through 0. But it neverthless tries to minimize the squared error.
Now, how do we handle the case with c?
5 = m*1 + c*1
2 = m*2 + c*1
3 = m*3 + c*1
1 = m*4 + c*1
[[5] [[1 1]
[2] = [2 1] . [[m]
[3] [3 1] [c]]
[1]] [4 1]]
Y = X . b
where X consists [1],[1],[1],[1] elements too and b consists of [m],[c] elements.
So,
>>> X_ = np.hstack((X,np.array([[1],[1],[1],[1]])))
>>> X_
array([[1, 1],
[2, 1],
[3, 1],
[4, 1]])
>>> b = np.linalg.inv(X_.T.dot(X_)).dot(X_.T).dot(Y)
>>> print(b)
>>> print(f"m = {b[0]}, c = {b[1]}")
[-1.1 5.5]
m = -1.1000000000000003, c = 5.499999999999999
>>> yadj = X.dot(b[0]) + b[1] # y = mx + c
>>> print(yadj)
[[4.4]
[3.3]
[2.2]
[1.1]]
>>> plt.plot(X,yadj)
>>> plt.scatter(X,Y)
>>> plt.grid()
>>> plt.ylim(0,6)
>>> plt.xlim(0,5)
>>> plt.show()
Now, we can see that, the linear equation fits correctly models the data points.
]]>It is a field of study that gives computer ability to learn without being explicitly programmed.
But, how is a computer supposed to learn? Tom Mitchell says that
A computer program is said to learn from experience E with respect to task T and some performance measure P, if its performance on T, as measured by P improves with experience E.
For example: Playing Checkers (Btw Arthur Samuel’s Checkers playing program was among the world’s first successful self-learning programs.)
Here,
E = the experience of playing many games of checkers
T = the task of playing checkers
P = the probability that the program will win the next game.
Broadly, there are three types of Machine Learning Algorithms:
Here, we are the teachers. We tell the system what the correct answer is and train it until the system begins to predict with an acceptable level of accuracy. We can further group it into Regression and Classification.
Regression: Suppose, you have a dataset of students’ reading habits and their performance in the exam. Then you plot ‘Hours Studied’ vs ‘Test Grade’.
You can clearly see that if you studied more you can get good grades. So, the system devises a function Y = f(x) [or ŷ = wx + b where w is the slope, x is hours studied and b is y-intercept]. We train the system to devise a function with minimum error possible. Then, if a new student comes and claims that he/she studies 5 hours a day, we can predict that he can get a certain test grade.
Classification: Let’s suppose we have a dataset for breast cancer and we have features like Age and Tumor Size.
Based on our data, the algorithm will create a mapping function Y = f(x). Then, when we provide new data, it will predict whether breast cancer is malignant or benign based on Tumor Size and Age. Generally, we will have many more features (Clump thickness, Uniformity of Cell Size, Uniformity of Cell Shape, etc) to determine whether 1(Malignant) or 0(Benign) [Binary Classification].
Here, we are not given the right answer. Here’s the data. Do what you want to do with it. Algorithms are left on their own to devise and discover the interesting structure in the data. We can further group it into Clustering and Association Algorithms.
Clustering Algorithm: It groups similar things together. We don’t provide any labels but the system understands the data and can differentiate it based on features it finds. One example where clustering is used is Google News. It groups similar news from different sites in cohesive groups.
Association Algorithm: Here is the data, find an association.
At a basic level, it analyzes data for patterns, or co-occurrence in a database and identifies if-then associations which are called association rules. In a grocery store, if we know that certain items are frequently bought together and they are on the same shelf, buyers of one item would be prompted to buy another. Promotional discounts and advertisements could then be forced on the customer’s throat.
Let’s consider, itemset1 = {bread, milk} & itemset2 = {bread, shampoo}. You can guess that itemset1 will have higher support (a measure of how frequent itemset is in all transactions) than itemset2. If a cart has {bread} in it, {milk} has higher confidence (a measure of likeliness of occurrence of itemset if cart already has items) than {shampoo}.
Here, the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experiences and tries to capture the best possible knowledge to make accurate business decisions. Eg: You have an agent and reward, with many hurdles in between. The goal of the robot here is to get to the reward (diamond) avoiding the hurdles (fire). The robot learns by trying all the possible paths and choosing the path which gives him the reward with the least hurdles. Each right step will give the robot a reward and each wrong step will subtract the reward of the robot. The total reward will be calculated when it reaches the final reward.
]]>