SRU++ Model Speeds Up Transformer with Straightforward Recurrent Unit

SRU++ Model Speeds Up Transformer with Straightforward Recurrent Unit

SRU++ Model Speeds Up Transformer with Simple Recurrent Unit

Listed below are my notes on SRU, and due to the paper authors and Yannic’s Discord meetup discussions.


Consideration and Recurrence

  • attention vs recurrence=graph vs sequence
  • attention connects all over entire sequence as completely linked graph
  • recurrence keeps info from previous states in a insist vector
  • real recurrent LSTM is less parallelizable than Transformer
    • future steps in LSTM depend upon the past and is no longer parallelizable

Dependency parsing and sequence from Pseudo-Projective Dependency Parsing paper

Dependency parsing and sequence from Standford Speech and Language Processing Daniel Jurafsky & James H. Martin

How SRU helps parallelization?

  • whereas the insist computation of SRU is time-dependent, every insist dimension is fair
  • time step: ( t ), enter vector: ( x_t ), (inner) insist ( c_t )
  • (inner) forget gate ( f_t :=sigma(W_f x_t + V_f c_{t-1} + b_f) )
    • problem: every ( c_t, f_t ) depend upon all dimensions ( c_{t-1} )
    • as a end result of matrix-multiplication: ( V_f c_{t-1} )
    • solution: pointwise (Hadamard) multiplication ( v_f odot c_{t-1} )
    • gives parallel computation ( c_t, f_t )
  • insist ( c_t :=f_t odot c_{t-1} + (1 – f_t) odot W x_t )
  • all ( W, V, b ) are educated

Highway Community Component

  • highway network extra dynamic than a skip connection
    • offers regulated gradient circulation
  • reset gate weights output skip connection
    • outlined as ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
    • combines the insist with the enter
    • then faded for output ( h_t ) that enables gradient circulation
  • output (hidden) vector: ( h_t :=r_t odot c_t + (1 – r_t) odot x_t )

All Equations

  • ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
  • ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
  • ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
  • ( h_t :=r_t odot c_t + (1-r_t) odot x_t )

Can additionally decompose into primitives:

  • ( mathrm{Way}(a, b, g, W) :=g odot a + (1 – g) odot (W b) )
  • ( mathrm{Gate}(a, b, W, v, w) :=sigma(W b + v odot a + w) )

Simple Recurrent Unit diagram

Similarity to LSTM

  • equations are reminiscent of LSTM
  • nonetheless output gate, enter gate are replaced with reset gate
  • SRU equations:
    • ( f_t :=sigma( W_f x_t + v_f odot c_{t-1} + b_f) )
    • ( r_t :=sigma( W_r x_t + v_r odot c_{t-1} + b_r ) )
    • ( c_t :=f_t odot c_{t-1} + (1-f_t) odot (W x_t) )
    • ( h_t :=r_t odot c_t + (1-r_t) odot x_t )
  • LSTM equations:
    • ( f_t=sigma_g (W_f x_t + U_f c_{t-1} + b_f ) )
    • ( i_t=sigma_g (W_i x_t + U_i c_{t-1} + b_i ) )
    • ( o_t=sigma_g (W_o x_t + U_o c_{t-1} + b_o ) )
    • ( c_t=f_t odot c_{t-1} + i_t odot sigma_c (W_c x_t + b_c) )
    • ( h_t=o_t odot sigma_h(c_t) )


From Nvidia: GPU vs CPU in CUDA documentation

CUDA kernels

  • CUDA kernels are C++ functions accomplished N times by N CUDA threads
// Kernel definition
__global__ void VecAdd(circulationA, circulationB, circulationC)
    int i=threadIdx.x;
    C[i]=A[i] + B[i];

int significant()
    // Kernel invocation with N threads
    VecAdd>>(A, B, C);

Parallel Implementation

  • single matrix multiplication ( U=(W, W_f, W_r) x_t )
  • level-wise operations are in a single fused CUDA kernel
  • and parallelize all over every hidden insist dimension
  • computation smooth sequential in time dimension
  • complexity O(L · B · d)

SRU Results

  • On its possess SRU fair a minute outperforms to QRNN (Quasi-RNN)
    • SRU “replaces convolutions” in QRNN and KNN with extra recurrent connections
  • every SRU and QRNN identical tempo
  • 5 – 9x tempo-up over cuDNN-optimized LSTM on classification and query answering datasets

SRU results on enwik8

SRU++: Consideration with SRU

SRU++ Simple Recurrent Unit on Enwik8 bits per character

SRU++ Layer

  • SRU++ is SRU with self-attention as an change of ( (W, W_f, W_r) x )
  • Consideration
    • no positional encodings
    • operates on dusky 512 as an change of 2048 “projection trick”
    • residual connection every on attention and SRU
    • layer normalization after attention block
  • attention abet an excellent deal
    • nonetheless wished most productive in every k-th layer e.g. every fifth

SRU++ diagram - Simple Recurrent Unit with attention


ENWIK8 (Hutter, 2006)

  • is a personality-level language modeling dataset consisting of 100M tokens taken from Wikipedia.
  • The vocabulary size of this dataset about 200k.
  • BPC is bits per personality

WIKI-103 (Merity et al., 2017)

  • is a wordlevel language modeling dataset.
  • 100M tokens extracted from Wikipedia
  • vocabulary of 260Okay tokens


  • PPL=perplexity
  • attention helps essentially the most within the closing layers
    • maybe first layers study native functions
    • which attention then makes spend of
  • outperforms Transformer-XL baseline by -3% BPC
  • if increased context, then even lower BPC

Stunning Comparability to Transformer-XL

SRU++ comparison to Trans-XL

How In general To Consist of Consideration?

  • 1 attention-SRU every 10 layers

SRU++ attention every k layers

Max Efficiency Enwik8

  • most efficiency comparability
  • increased mannequin d=3072, disagreeable mannequin 4096
  • context length put together=1024, eval 3072
  • SoTA enwik8, nonetheless no longer on Wiki-103

Comparison with top-performing modesl on enwik8 dataset

Max Efficiency Wiki-103

  • On par with Compressive memory, worse than kNN-LM, Routing Transformer

SRU++ WIKI-103 results Routing Transformer

Tempo Comparability

SRU++ inference speed


  • Makes spend of SRU, nonetheless no longer covered here

Read More



β€œSimplicity, patience, compassion.
These three are your greatest treasures.
Simple in actions and thoughts, you return to the source of being.
Patient with both friends and enemies,
you accord with the way things are.
Compassionate toward yourself,
you reconcile all beings in the world.”
― Lao Tzu, Tao Te Ching