Listed below are my notes on SRU, and due to the paper authors and Yannic’s Discord meetup discussions.

### Featured Content Ads

add advertising here## Abstract:

- Language modelling:
- enter: text, output masked token or next token

- Contribution: Elevated tempo makes the mannequin extra accessible to low-useful resource spend-cases
- Authors:
- Tao Lei: ASAPP Inc.
- Google Mind, Princeton, Cornell

- SRU
- Straightforward Recurrent Devices for Extremely Parallelizable Recurrence, OpenReview
- is RNN, 10x faster than LSTM
- straightforward and parallelizable

- SRU++
- combines Self-Consideration and SRU
- 3x – 10x faster coaching
- aggressive with Transformer on enwik8

- Terraformer=
- Sparse is Ample in Scaling Transformers
- is SRU + sparcity + many tricks
- 37x faster decoding tempo than Transformer

### Consideration and Recurrence

- attention vs recurrence=graph vs sequence
- attention connects all over entire sequence as completely linked graph
- dependency parse is a syntactic graph over the discover sequence

- recurrence keeps info from previous states in a insist vector
- real recurrent LSTM is less parallelizable than Transformer
- future steps in LSTM depend upon the past and is no longer parallelizable

### How SRU helps parallelization?

- whereas the insist computation of SRU is time-dependent, every insist dimension is fair
- time step: ( t ), enter vector: ( x_t ), (inner) insist ( c_t )
- (inner) forget gate ( f_t :=sigma(W_f x_t + V_f c_{t-1} + b_f) )
- problem: every ( c_t, f_t ) depend upon all dimensions ( c_{t-1} )
- as a end result of matrix-multiplication: ( V_f c_{t-1} )
- solution: pointwise (Hadamard) multiplication ( v_f odot c_{t-1} )
- gives parallel computation ( c_t, f_t )

- insist ( c_t :=f_t odot c_{t-1} + (1 – f_t) odot W x_t )
- all ( W, V, b ) are educated

### Highway Community Component

- highway network extra dynamic than a skip connection
- offers regulated gradient circulation

- reset gate weights output skip connection
- outlined as ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
- combines the insist with the enter
- then faded for output ( h_t ) that enables gradient circulation

- output (hidden) vector: ( h_t :=r_t odot c_t + (1 – r_t) odot x_t )

### All Equations

- ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
- ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
- ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
- ( h_t :=r_t odot c_t + (1-r_t) odot x_t )

Can additionally decompose into primitives:

- ( mathrm{Way}(a, b, g, W) :=g odot a + (1 – g) odot (W b) )
- ( mathrm{Gate}(a, b, W, v, w) :=sigma(W b + v odot a + w) )

### Featured Content Ads

add advertising here### Similarity to LSTM

- equations are reminiscent of LSTM
- nonetheless output gate, enter gate are replaced with reset gate
- highway network

- SRU equations:
- ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
- ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
- ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
- ( h_t :=r_t odot c_t + (1-r_t) odot x_t )

- LSTM equations:
- ( f_t=sigma_g (W_f x_t + U_f c_{t-1} + b_f ) )
- ( i_t=sigma_g (W_i x_t + U_i c_{t-1} + b_i ) )
- ( o_t=sigma_g (W_o x_t + U_o c_{t-1} + b_o ) )
- ( c_t=f_t odot c_{t-1} + i_t odot sigma_c (W_c x_t + b_c) )
- ( h_t=o_t odot sigma_h(c_t) )

### GPU vs CPU

- Tesla T4 has 2,560 CUDA cores
- NVIDIA V100 Tensor Core has 640 tensor cores (specialized AI cores)
- Comparability of GPU and CPU from Nvidia documentation.

### CUDA kernels

- CUDA kernels are C++ functions accomplished N times by N CUDA threads

```
// Kernel definition
__global__ void VecAdd(circulationA, circulationB, circulationC)
{
int i=threadIdx.x;
C[i]=A[i] + B[i];
}
int significant()
{
...
// Kernel invocation with N threads
VecAdd>>(A, B, C);
...
}
```

### Parallel Implementation

- single matrix multiplication ( U=(W, W_f, W_r) x_t )
- level-wise operations are in a single fused CUDA kernel
- and parallelize all over every hidden insist dimension
- computation smooth sequential in time dimension
- complexity O(L · B · d)

### SRU Results

- On its possess SRU fair a minute outperforms to QRNN (Quasi-RNN)
- SRU “replaces convolutions” in QRNN and KNN with extra recurrent connections

- every SRU and QRNN identical tempo
- 5 – 9x tempo-up over cuDNN-optimized LSTM on classification and query answering datasets

## SRU++: Consideration with SRU

- When Consideration Meets Swiftly Recurrence: Coaching Language Devices with Diminished Compute
- combines Self-Consideration and SRU
- aggressive on enwik8, wiki-103, Billion Note datasets
- noteworthy less attention blocks wished
- 3x – 10x faster coaching than Transformer-XL
- 1.6 days on 8-GPU machine

### SRU++ Layer

- SRU++ is SRU with self-attention as an change of ( (W, W_f, W_r) x )
- Consideration
- no positional encodings
- operates on dusky 512 as an change of 2048 “projection trick”
- residual connection every on attention and SRU
- layer normalization after attention block

- attention abet an excellent deal
- nonetheless wished most productive in every k-th layer e.g. every fifth

### Featured Content Ads

add advertising here### Datasets

#### ENWIK8 (Hutter, 2006)

- is a personality-level language modeling dataset consisting of 100M tokens taken from Wikipedia.
- The vocabulary size of this dataset about 200k.
- BPC is bits per personality

#### WIKI-103 (Merity et al., 2017)

- is a wordlevel language modeling dataset.
- 100M tokens extracted from Wikipedia
- vocabulary of 260Okay tokens

### Results

- PPL=perplexity
- attention helps essentially the most within the closing layers
- maybe first layers study native functions
- which attention then makes spend of

- outperforms Transformer-XL baseline by -3% BPC
- if increased context, then even lower BPC

#### Stunning Comparability to Transformer-XL

#### How In general To Consist of Consideration?

- 1 attention-SRU every 10 layers

#### Max Efficiency Enwik8

- most efficiency comparability
- increased mannequin d=3072, disagreeable mannequin 4096
- context length put together=1024, eval 3072
- SoTA enwik8, nonetheless no longer on Wiki-103

#### Max Efficiency Wiki-103

- On par with Compressive memory, worse than kNN-LM, Routing Transformer

#### Tempo Comparability

## Terraformer

- Makes spend of SRU, nonetheless no longer covered here