Songlin Yang

Songlin (松琳) is a second-year PhD student at MIT CSAIL, advised by Prof. Yoon Kim. She earned her bachelor’s degree from SUSTech in 2020 and her master’s degree from ShanghaiTech in 2023, where she was advised by Prof. Kewei Tu.

Her research focuses on the intersection of machine learning systems and large language models, with a particular interest in hardware-aware algorithm design for efficient sequence modeling — especially in linear attention models. For more on her work, see this video and her slides.

She is a strong advocate for open-source research 🐳. Explore the open-source library flash-linear-attention and the fully virtual seminar series Advances in Sequence Modeling from Algorithmic Perspectives — past talks are available here. The best way to reach her is through the FLA Discord Community.

latest posts

Dec 3, 2024	DeltaNet Explained (Part III)
Dec 3, 2024	DeltaNet Explained (Part II)
Dec 3, 2024	DeltaNet Explained (Part I)

selected publications

ICLR

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh

2025

HTML Code
NeurIPS

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim

In , 2024

HTML Blog Code Poster
NeurIPS

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Yu Zhang*, Songlin Yang*, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, and Guohong Fu

In , 2024

HTML Code Poster
ICML

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang*, Bailin Wang*, Yikang Shen, Rameswar Panda, and Yoon Kim

In , 2024

Abs HTML Code Poster

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a global decay term to the additive RNN update rule greatly improves performance, sometimes outperforming standard Transformers with softmax attention when trained at scale. In this work we show that adding a data-dependent gating mechanism further improves performance. We derive a parallel form of this gated linear attention layer that enables efficient training. However, a straightforward, numerically stable implementation of this parallel form requires generalized matrix multiplications in log-space for numerical stability, and thus cannot take advantage of tensor cores on modern GPUs which are optimized for standard matrix multiplications. We develop a hardware-efficient version of the parallel form that can still make use of tensor cores through block-parallel computations over sequence chunks. Experiments on moderate-scale language modeling (340M-parameter models trained on 15B tokens, 1.3B-parameter models trained on 100B tokens) show that gated linear attention (GLA) Transformers perform competitively against a strong LLaMA-architecture Transformer baseline (Touvron et al., 2023) as well as Mamba (Gu & Dao, 2023), a recently introduced state-space model with a data-dependent state transition mechanism. For training speed, our Triton-based implementation performs comparably to CUDA-optimized FlashAttention-2 (Dao, 2023) under the regular 2048 training length setting, while outperforming FlashAttention-2 when training on longer sequences beyond 4096.