Recurrent Neural Networks (RNNs)
Lately I have been involved in many conversations in my current company regarding Transformers, but what I feel is that people love to understand what Transformers are, but they do not want to understand the tradeoffs. Maybe the problem that we are solving could've been solved by RNNs or LSTMs, but they go with the fancy stuff one of the reasons for that being their lack of knowledge about why Transformers came into existence first of all.
History
Recurrent Neural Networks (RNNs) are a kind of Neural Network that have the Architecture which works well on Sequential data (which can be converted to numbers). Prior to RNNs, there were algorithms which were used but some of them required very heavy preprocessing, some had an unstable training phase. Other Neural Networks were also used before RNNs, but they considered the inputs to be independent which is not really a good way to deal with Sequential Data. Upon the loopholes of RNNs, researchers came up with LSTMs and GRUs which pretty much follow the same Architecture but cover up the negatives of a RNN.
What are RNNs though?
As defined above, RNNs are Neural Networks designed to work well with Sequential Data. Let's dive a little deeper and understand how do these RNNs work. For this explanation, we can consider a small example of Stock Markets. In general, to predict prices of stock the next day, we need to understand the pattern of the company's stock from day 1 and based on that, the prediction for tomorrow is made.
Note: The real-life Stock Market is way more complicated than what will be discussing in this example, so we make a small assumption, if the Stock Market has been low for two days in a row, more often than not, its going to be low for the next day as well. If it were low, then medium for the next day, then its going to be high the next day and you can guess the rest of the pattern.
Now, let's see how a RNN predicts the stock price for the next day. For this, we will take a short detour to see what its Architecture looks like. (Take a look at the picture by clicking on the link).
Now the picture above is a simplified version of what RNNs actually look like but its good for our understanding initially. So, the structuring isn't very different from a very simple Neural Network but if you look carefully, there is a loop present in this image and yes, that's what makes it Recurrent. The values are fed back after each iteration which makes it a 'Recurrent Neural Network'. An even better (clearer) representation for RNNs can come from the picture below.
Let's say we have the data for yesterday, today and we need to predict the data for tomorrow. The first network here takes the input as yesterday's value and gives out today's value. Now using the recurrent concept, we can feed the output from the first network after it goes through the activation function to the second network which helps us predict the value for tomorrow taking the value of today as the input. Now we can unroll this Recurrent Neural Network as many time as we would like to (provided we also have the data). Let's say we had three input values, day before yesterday, yesterday and today, then we'd have to unroll it thrice to predict tomorrow's output value. I've mostly kept this article towards the theoretical aspect and haven't dived into the mathematical section since I want it to be easy to grasp for newcomers.
So, if RNNs do the job, then why not use them for every time we need to deal with Sequential data. There are two primary reasons as to why we don't do this, one known as Vanishing Gradients and the second one being Exploding Gradients. Let's understand what these two are.
To understand these concepts, a little bit of background in Backpropagation and Descent algorithms is appreciated but I'll try to break it down in a simpler way so that even people with no background in this concept will be able to understand it.
Vanishing Gradients
RNNs are trained using Backpropagation through time, where the error travels from the final output to the very first step. While this error traverses backwards, if this error becomes very weak, then it is known as Vanishing Gradients. For example, Imagine sending battery power through 50 extension cables where each cable loses a bit of energy. By the time power reaches the first cable, there's almost no electricity left.
Exploding Gradients
If you understood Vanishing Gradients, understanding Exploding Gradients is just learning the opposite case to Vanishing Gradients. Here the error instead of shrinking, it grows stronger as it travels backward.
In either of these cases, the Neural Network is not able to learn the patterns properly which leads to sub par predictions and an inefficient product. It was due to this issue that RNNs successors, Long Short Term Memory (LSTMs) and Gated Recurrent Units (GRUs) came in.
In the next article, we will be going through the Mathematical half of RNNs to develop a better intuition about it. It would be better if you understand the basic Math behind basic Neural Network's operations which will make the understanding of RNN's math better.