charRNN
This paragraph is mainly based on Karpathy’s code: min-char-rnn.py on his github gist. You’d better open a new browser window to read at the code and this post at the same time.
Data Preparation
This is a character-level training. The program read data from file ‘input.txt’. Then calculate all kinds of unique characters. Usually, in a long paragraph, the number of unique characters would be 26+, which stands for a-z, and other sepecial characters including Captial characters. Here Karpathy used one-hot vector representation for character vectors. So the dimension of character space shoud be the size of unique characters. Karpathy established two mapping table for characters and their one-hot index can be converted into each other.
In lateral training, the code would read a specific number of characters at a time. Then encode them into one-hot index through set char_to_ix{}
, and decode them from vector through ix_to_char{}
.
Main Loop
Before main loop, Karpathy had set
n = 0
as the loop counter and p = 0
as the character pointer to data. Unlike n
‘s adding 1 at each loop, p
adds seq_length
each time, which means the character pointer would move seq_length characters for each loop. You can treat seq_length
as a character window. Actually the program encodes seq_length
characters as inputs at a time.
targets
is the negibhour of inputs
: targets[i] = input[i - 1]
. Because Karpathy trained the model parameters to predict the next character according to the current and previous characters.
Through well trained model parameters, Karpathy could generate the text on character level. In the code, the exhibition is made through function sample(h, seed_ix, n)
, which would be talked later.
The forward and backward propagation is done in lossFunc(inputs, targets, hprev)
. We will talk about this function later too.
Anyway, temporarily assume that we have gone through output display and get gradients and cross entropy loss from the 2 functions.
Update in Main Loop
Use
smooth_loss
instead of coarse loss: smooth_loss = -np.log(1.0/vocab_size) seq_length for loss at iteration 0. Then calculate smooth_loss
as : smooth_loss = smooth 0.999 + loss * 0.001
The benefit of use smooth_loss
is making loss decrease smoothly…
Update model parameters with adagrad. You can read this blog to learn about adagrad, or choose other learning materials. We directly cite from this blog:
(adagrad)It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
For this reason, Karpathy use memory variables for adagrad and update them in this way:
param += -learning_rate * dparam / np.sqrt(mem + 1e-8)
Sample(h, seed_ix, n)
h memories previous states, seed_ix represents the first one-hot character vector and generate n iterations.
The method Karpathy chosen to make one-hot character vector is:
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
In this way, we get a vector for future use through an integer like index. Use ixes = [ ]
to record the generated one-hot index.
Through a forward propagation, we get softmax probabilities:
This softmax probability tells us the distribution of the next one-hot character vector. So if p[ix] scores highest among all the softmax probability, the next character vector should be one-hot at ix. This explains the code ix = np.random.choice(range(vocab_size), p=p.ravel())</strong>
, which generate n ix
according to softmax probability from inputs[0]
.
lossFunc(inputs, targets, hprev)
For each character in inputs, we fetch all the model parameters through forward propagation. As showed in the flow picture, we get
xs, hs, ys, ps = { }, { }, { }, { }
. Since we store these parameters for each input, hs[-1]
can be set as hprev
. Remember to accumulate cross entropy.
For back propagation, we do some math here. Note that H
is short for hidden_size
, V
is short for vocab_size
, and L
is short for seq_length
forward propagation:
$$ ht \in \mathbb{R}^{H \times 1} = \tan ( W{xh} \in \mathbb{R}^{H \times V} \times xt \in \mathbb{R}^{V \times 1} + W{hh} \in \mathbb{R}^{H \times H} \times h_{t-1} \in \mathbb{R}^{H \times 1} + b_h \in \mathbb{R} ^{H \times 1}) $$
$$ yt \in \mathbb{R} ^ {V \times 1}= W{hy} \in \mathbb{R}^{V \times H} \times h_t \in \mathbb{R}^{H \times 1} + b_y \in \mathbb{R}^{V \times 1} $$
$$ p_t \in \mathbb{R} ^ {V \times 1} = softmax(y_t) = \frac{ e^{yt} \in \mathbb{R}^{V \times 1}}{\sum{i = 1}^{V} e^{y_{t,i}}} $$
$$ CE \in \mathbb{R} = - \sum_{t=1}^{L} \log (p_t \in \mathbb{R} ^ {V \times 1} , targets[t] \in \mathbb{R} ^ {V \times 1}) $$
In one-hot representation, cross entropy can be rewritten as:
$$ CE = - \sum_{t=1}^{L} \log p_t [targets[t], 0] $$
back propagation:
$$ CE = - \sum_{t=1}^{L} \log ( \frac{ e^{yt} }{\sum{i = 1}^{V} e^{y{t,i}}}, targets{t} ) = \sum{t=1}^{L} (\log \sum{i = 1}^{V} e^{y_{t,i}} - y_t, targets[t]) $$
$$ \therefore \frac{ \partial CE}{ \partial y_t } = \begin{cases} p_t &\mbox{if t $\neq$ targets[t]} \ p_t - 1 &\mbox{if t = targets[t]} \end{cases} $$
From here we can see that using softmax
as probability makes it easier to calculate the derivative.
$$ \frac{\partial yt \in \mathbb{R}^{V \times 1}}{\partial W{hy} \in \mathbb{R}^{V \times H}} = h_t^T \in \mathbb{R}^{1 \times H}$$
$$ \frac{\partial y_t \in \mathbb{R}^{V \times 1}}{\partial b_y \in \mathbb{R}^{V \times 1}} = 1 \in \mathbb{R}$$
Because of the sequential probabilities:
$$ \frac{ yt - y{t-1} \in \mathbb{R}^{V \times 1}}{ ht - h{t-1} \in \mathbb{R}^{H \times 1}} = W_{hy}^T \in \mathbb{R}^{H \times V} $$
Then we simply make:
$$ \frac{ \partial y_t \in \mathbb{R}^{V \times 1}}{ \partial ht - h{t+1}\in \mathbb{R}^{H \times 1}} = W_{hy}^T \in \mathbb{R}^{H \times V} $$
For tanh nonlinearity:
$$ h_t = \tanh( \bullet ), \frac{d}{d( \bullet )} \tanh ( \bullet ) = 1 - tanh( \bullet )^2 $$
$$ \frac{\partial ht}{\partial W{xh}} = (1 - h_t^2) \times x^T $$
$$ \frac{\partial ht}{\partial W{hh}} = (1 - ht^2) \times h{t-1}^T $$
$$ \frac{\partial h_t}{\partial b_h} = (1 - h_t^2) \times 1 $$
In this way, we can get gradients and perform back propagation. Besides, Karpathy used np.clip()
to limit gradient explosion.