Biao's Blog
ML Systems
Notes on LLM systems, training infrastructure, and the machinery underneath.
2026
2025
2024
2023
Latest blog
How JAX Allocates Memory
A deep dive into JAX GPU memory allocation, BFCAllocator preallocation, and how it differs from PyTorch caching.
Blogs
Technical notes and implementation writeups.
2025
Efficient RL Training - Optimizing Weight Sync in slime
A look at how slime synchronizes weights between training and rollout engines, and where the main performance wins come from.
Efficient RL Training - Optimizing Memory Usage in verl
A systems deep dive into memory pressure in RL training and the techniques that make larger policy rollouts feasible.
Implement Flash Attention Backend in SGLang - Basics and KV Cache
Notes on SGLang attention backend internals, from Flash Attention basics to KV cache layout and execution flow.
What is Flash Attention?
A visual explanation of Flash Attention and how IO-aware tiling reduces memory traffic for modern attention kernels.
2024
How to Calculate LLM Model Parameter Size - MoE Model
A worked guide to counting parameters in MoE language models, with Qwen-style expert layers as the running example.
How to Calculate LLM Model Parameter Size - Dense Model
A practical walkthrough for estimating dense LLM parameter counts from architecture details and model code.
2023
Model Distillation using Tensorflow, Pytorch and Google JAX
An introduction to model distillation, where a smaller student network learns from a larger teacher model.
Template for a blog post
A compact typography and Markdown sample for checking how the blog theme renders common writing elements.