# SRU++ Model Speeds Up Transformer with Straightforward Recurrent Unit  Listed below are my notes on SRU, and due to the paper authors and Yannic’s Discord meetup discussions.

## Abstract:

### Consideration and Recurrence

• attention vs recurrence=graph vs sequence
• attention connects all over entire sequence as completely linked graph
• recurrence keeps info from previous states in a insist vector
• real recurrent LSTM is less parallelizable than Transformer
• future steps in LSTM depend upon the past and is no longer parallelizable  ### How SRU helps parallelization?

• whereas the insist computation of SRU is time-dependent, every insist dimension is fair
• time step: ( t ), enter vector: ( x_t ), (inner) insist ( c_t )
• (inner) forget gate ( f_t :=sigma(W_f x_t + V_f c_{t-1} + b_f) )
• problem: every ( c_t, f_t ) depend upon all dimensions ( c_{t-1} )
• as a end result of matrix-multiplication: ( V_f c_{t-1} )
• solution: pointwise (Hadamard) multiplication ( v_f odot c_{t-1} )
• gives parallel computation ( c_t, f_t )
• insist ( c_t :=f_t odot c_{t-1} + (1 – f_t) odot W x_t )
• all ( W, V, b ) are educated

### Highway Community Component

• highway network extra dynamic than a skip connection
• reset gate weights output skip connection
• outlined as ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
• combines the insist with the enter
• then faded for output ( h_t ) that enables gradient circulation
• output (hidden) vector: ( h_t :=r_t odot c_t + (1 – r_t) odot x_t )

### All Equations

• ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
• ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
• ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
• ( h_t :=r_t odot c_t + (1-r_t) odot x_t )

• ( mathrm{Way}(a, b, g, W) :=g odot a + (1 – g) odot (W b) )
• ( mathrm{Gate}(a, b, W, v, w) :=sigma(W b + v odot a + w) ) ### Similarity to LSTM

• equations are reminiscent of LSTM
• nonetheless output gate, enter gate are replaced with reset gate
• SRU equations:
• ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
• ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
• ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
• ( h_t :=r_t odot c_t + (1-r_t) odot x_t )
• LSTM equations:
• ( f_t=sigma_g (W_f x_t + U_f c_{t-1} + b_f ) )
• ( i_t=sigma_g (W_i x_t + U_i c_{t-1} + b_i ) )
• ( o_t=sigma_g (W_o x_t + U_o c_{t-1} + b_o ) )
• ( c_t=f_t odot c_{t-1} + i_t odot sigma_c (W_c x_t + b_c) )
• ( h_t=o_t odot sigma_h(c_t) )

### GPU vs CPU ### CUDA kernels

• CUDA kernels are C++ functions accomplished N times by N CUDA threads
``````// Kernel definition
{
C[i]=A[i] + B[i];
}

int significant()
{
...
// Kernel invocation with N threads
...
}
``````

### Parallel Implementation

• single matrix multiplication ( U=(W, W_f, W_r) x_t )
• level-wise operations are in a single fused CUDA kernel
• and parallelize all over every hidden insist dimension
• computation smooth sequential in time dimension
• complexity O(L · B · d)

### SRU Results

• On its possess SRU fair a minute outperforms to QRNN (Quasi-RNN)
• SRU “replaces convolutions” in QRNN and KNN with extra recurrent connections
• every SRU and QRNN identical tempo
• 5 – 9x tempo-up over cuDNN-optimized LSTM on classification and query answering datasets ## SRU++: Consideration with SRU ### SRU++ Layer

• SRU++ is SRU with self-attention as an change of ( (W, W_f, W_r) x )
• Consideration
• no positional encodings
• operates on dusky 512 as an change of 2048 “projection trick”
• residual connection every on attention and SRU
• layer normalization after attention block
• attention abet an excellent deal
• nonetheless wished most productive in every k-th layer e.g. every fifth ### Datasets

#### ENWIK8 (Hutter, 2006)

• is a personality-level language modeling dataset consisting of 100M tokens taken from Wikipedia.
• The vocabulary size of this dataset about 200k.
• BPC is bits per personality

#### WIKI-103 (Merity et al., 2017)

• is a wordlevel language modeling dataset.
• 100M tokens extracted from Wikipedia
• vocabulary of 260Okay tokens

### Results

• PPL=perplexity
• attention helps essentially the most within the closing layers
• maybe first layers study native functions
• which attention then makes spend of
• outperforms Transformer-XL baseline by -3% BPC
• if increased context, then even lower BPC

#### Stunning Comparability to Transformer-XL #### How In general To Consist of Consideration?

• 1 attention-SRU every 10 layers #### Max Efficiency Enwik8

• most efficiency comparability
• increased mannequin d=3072, disagreeable mannequin 4096
• context length put together=1024, eval 3072
• SoTA enwik8, nonetheless no longer on Wiki-103 #### Max Efficiency Wiki-103

• On par with Compressive memory, worse than kNN-LM, Routing Transformer #### Tempo Comparability ## Terraformer

• Makes spend of SRU, nonetheless no longer covered here WRITTEN BY

## Vanic

“Simplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te Ching