Home Artificial Intelligence Neural Network Back Propagation from scratch!

Neural Network Back Propagation from scratch!

1
Neural Network Back Propagation from scratch!

This text is inspired by Andrej Karpathy , I might highly recommend to undergo the below playlist.

Because it is probably the most step-by-step spelled-out explanation of Back Propagation and training of neural networks.

  • It’s the strategy of calculating gradient of the loss function with respect to weights of the neural network.
  • is a way of computing gradients of expressions through recursive application of
  • What does it do ? It’s used for wonderful tuning the weights of neural nets which in turn decreases the losses.
  • Concept of Derivative is super vital in back propagation.
  • It signifies the slope of a function with respect to some variable.
  • In layman terms derivative measures the effect of bumping up a variable by a really small value (ex:0.001) leading to increase or decrease within the function value.
  • If the slope is positive it represents that variable have positive (increasing) effect on the resultant function.
  • If the slope is negative it represents that variable have negative (decreasing) effect on the resultant function.
  • Within the case of back propagation variables can be weights and biases and performance can be loss function.
  • Briefly it defines the influence of weights on the loss function, which in turn helps in wonderful tuning the weights such that it decreases the resultant of loss function
def f(x):
return 3*(x**2)-(4*x)+5

# Derivative with respect to x is given by
df_dx = (f(x+h)-f(x))/h

  • A Perceptron is a neural network unit that does certain computations to detect features.
  • Perceptron is inspired by biological neurons which models mathematical operations.
perceptron
  • x- inputs; w- weights; b-bias ; sigma — activation function.
  • Now we’re going to duplicate the job of perceptron step-by-step.
  • Also we’ll see through what class Value accommodates in the next section.
w1=Value(2.13,label='w1')
x1=Value(3,label='x1')

w2=Value(3,label='w2')
x2=Value(1,label='x2')

b = Value(5);b.label='b'

##modeling perceptron
w1x1=w1*x1 ;w1x1.label='w1*x1'
w2x2=w2*x2 ;w2x2.label='w2*x2'
w1x1w2x2 = w1x1+w2x2 ; w1x1w2x2.label='(w1*x1)+(w2*x2)'
z = w1x1w2x2+b ; n.label='z'
y = n.relu() ; y.label='y'

  • Visual Representation of the above code.
perceptron operations
perceptron operations
  • Now we’re going to back propagate through the neuron.
  • In typical neural network setting we actually give attention to calculating derivative with respect to weights.. because we give attention to changing the weights as an element of optimization.
  • Also for the sake of simplicity we’re taking a single neuron but in neural nets we might have many neuron interconnected to one another.
  • Eventually there can be a loss function which measures the lack of the neural net and we shall be back propagating with respect to the loss function to diminish it.
  • The crux of back propagation is chain rule of derivative.
  • Example of chain rule: if a automobile travels twice as fast as bicycle and bicycle travels 4x times in comparison with walking man,How can we compare speed of automobile to hurry of walking man? It could be 4×2 = 8times faster in comparison with walking man.
  • Example of chain rule might be seen below:
# set some inputs
x = -2; y = 5; z = -4

# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12

# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
dqdx = 1.0
dqdy = 1.0
# now backprop through q = x + y
dfdx = dfdq * dqdx # The multiplication here is the chain rule!
dfdy = dfdq * dqdy

  • Now we’re going to create a category called value which can store a single scalar value and it gradient(slope with respect to loss function).
class Value:
""" stores a single scalar value and its gradient """

def __init__(self, data, _children=(), _op='',label=''):
self.data = data
self.grad = 0
# internal variables used for autograd graph construction
self._backward = lambda: None
self._prev = set(_children)
self._op = _op # the op that produced this node, for graphviz / debugging / etc
self.label=''

def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')

def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward

return out

def __sub__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data - other.data, (self, other), '-')

def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward

return out

def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')

def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward

return out

def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
out = Value(self.data**other, (self,), f'**{other}')

def _backward():
self.grad += (other * self.data**(other-1)) * out.grad
out._backward = _backward

return out

def relu(self):
out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')

def _backward():
self.grad += (out.data > 0) * out.grad
out._backward = _backward

return out

def backward(self):

# topological order all of the youngsters within the graph
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)

# go one variable at a time and apply the chain rule to get its gradient
self.grad = 1
for v in reversed(topo):
v._backward()

  • class value also keeps a track of kids nodes, on which we perform reversed topological ordering .
  • From this reversed topological order of kids nodes we perform back propagation and calculate its gradient with respect to loss function.
  • With this we’re going to add nodes of mathematical operations using magic methods.
  • All these mathematical operation nodes have its own pattern of calculating and updating the gradient of its children nodes.
  • The all the time takes the gradient on its output and distributes it equally to all of its inputs, no matter what their values were in the course of the forward pass.
self.grad += out.grad  
other.grad += out.grad
# out is o/p node gradient wrt to loss
# self and other are inputs for which slope must calculated wrt to loss
  • In its local gradients are the input values (except switched), and that is multiplied by the gradient on its output in the course of the chain rule
self.grad += other.data * out.grad
other.grad += self.data * out.grad
# here we are able to observe that data of i/p values are swtiched
  • The routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the relu gate distributes the gradient (unchanged) if the output of that node is larger than 0.
self.grad += (out.data > 0) * out.grad
  • Now we all know the right way to back propagate and calculate the gradient with respect to loss function.
  • We’re going to create a function which can calculate the losses with respect to actual value from the expected value.
def cal_loss(actual_val):
""" Forward pass"""
x1=Value(3,label='x1')
x2=Value(1,label='x2')
w1x1=w1*x1 ;w1x1.label='w1*x1'
w2x2=w2*x2 ;w2x2.label='w2*x2'
w1x1w2x2 = w1x1+w2x2 ; w1x1w2x2.label='(w1*x1)+(w2*x2)'
z = w1x1w2x2+b ;z.label='z'
ypred = z.relu() ; ypred.label='y'

# Calculating the losses
l = (ypred-actual_val)**2; l.label='loss'
print(ypred.data)
return l

  • For calculating the losses ,firstly we now have to perform the forward pass.
  • After forward pass we get a predicted value from which we’ll calculate the losses.
  • After calculating the losses we’ll back propagate and and calculate the quantity of influence of every weights w.r.t to loss function.
actual_val=1000
for i in range(10): # variety of iteration
l= cal_loss(actual_val) # we'll recalculate losses w.r.t latest weights
l.backward() #we'll calculate the gradient w.r.t loss func wit latest weights
w1.data=w1.data-0.01*w1.grad # updation of weights to diminish the losses
w2.data=w2.data-0.01*w2.grad

#>>> ypred getting updated for every iteration of forward pass and backward pass
ypred=14.39
211.512
369.16960000000006
495.29568000000006
596.196544
676.9172352000002
741.4937881600001
793.1550305280001
834.4840244224
867.54721953792

  • Now we’re going to update the weights such that losses will reduce and prediction values matches the actual value.
  • We are able to observe that y prediction is getting closer and closer to actual value after each iteration.

😁 Did you just like the story? 💬 Let me know within the comments and provides it a 👏!! Share it with friends 👯!! This things take quite a lot of effort and time to be done so the feedback may be very appreciated! ❤️

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here