Songlin Yang

Songlin (松琳) is a final-year PhD student at MIT CSAIL, advised by Prof. Yoon Kim. She is also a Member of Technical Staff at Thinking Machines Lab.

Her research focuses on language model architectures, in particular attention mechanisms. To learn more about her work, see talks.

Flash Linear Attention efficient attention implementations in Triton
FLA Discord community for Flash Linear Attention
ASAP Seminar Advances in Sequence Modeling from Algorithmic Perspectives

latest posts

Dec 3, 2024	DeltaNet Explained (Part III)
Dec 3, 2024	DeltaNet Explained (Part II)
Dec 3, 2024	DeltaNet Explained (Part I)

selected publications

ICML

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang*, Bailin Wang*, Yikang Shen, Rameswar Panda, and Yoon Kim

In , 2024

Abs HTML Code Poster

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a global decay term to the additive RNN update rule greatly improves performance, sometimes outperforming standard Transformers with softmax attention when trained at scale. In this work we show that adding a data-dependent gating mechanism further improves performance. We derive a parallel form of this gated linear attention layer that enables efficient training. However, a straightforward, numerically stable implementation of this parallel form requires generalized matrix multiplications in log-space for numerical stability, and thus cannot take advantage of tensor cores on modern GPUs which are optimized for standard matrix multiplications. We develop a hardware-efficient version of the parallel form that can still make use of tensor cores through block-parallel computations over sequence chunks. Experiments on moderate-scale language modeling (340M-parameter models trained on 15B tokens, 1.3B-parameter models trained on 100B tokens) show that gated linear attention (GLA) Transformers perform competitively against a strong LLaMA-architecture Transformer baseline (Touvron et al., 2023) as well as Mamba (Gu & Dao, 2023), a recently introduced state-space model with a data-dependent state transition mechanism. For training speed, our Triton-based implementation performs comparably to CUDA-optimized FlashAttention-2 (Dao, 2023) under the regular 2048 training length setting, while outperforming FlashAttention-2 when training on longer sequences beyond 4096.
ICLR

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh

2025

HTML Code
NeurIPS

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim

In , 2024

HTML Blog Code Poster
NeurIPS

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim

2025

HTML Code