Hi there! I am a machine learning engineer. My interests across the AI spectrum are:
I am also very interested in ML infra / system related topics as they are playing an increasingly critical role in the foundation model based applications.
I am expected to graduate with a master degree in CS @UWMadison soon. Previously, I worked at JD.com, Meizhai (a startup using AI to do interior design) as an AI / machine learning engineer.
Mastery Tier of LLM Agent from UC Berkeley.
National Under-graduate scholarship
Outstanding Under-graduate of Sichuan Province
2nd Prize in School's ACM Coding Contest
flash-attn-economical-gpu implements Triton-based FlashAttention kernels that beat or reach comparable performance with popular libraries such as PyTorch FlashAttention SDPA across economical GPU architectures including Turing and Ampere; Our Triton MMA kernels outperforming torch.matmul by up to 80% on Turing and 25% on Ampere, optimized Triton SDPA leading tested PyTorch SDPA backends on Turing for D=128 and D=256, sliced Triton SDPA reaching near PyTorch FlashAttention performance on Ampere, and LLM TTFT benchmarks on Qwen2.5 and Qwen3.5 where the sliced Triton kernel stays close to PyTorch FlashAttention while clearly faster than PyTorch efficient SDPA.
Time To First Token on Qwen3.5 (D=256) and Qwen2.5 (D=128)
Task: Fine-tune two multi-modal LLMs with RL under different reward designs; make sure one reward hacks but the other does not. Then trained a VLM and some fixed-length soft tokens for reward hacking detection based on multi-vector embeddings.
Implementation: RL framework developement based on OpenR1 + trl; other components implemented by myself from scratch.
Algorithms: GRPO, Selective Sample Replay (similar to Prioritized Replay), Matryoshka Embedding, MetaEmbed, LoRA.
Models: Qwen3-VL (8B/4B/2B), Qwen3(8B/4B/1.7B), GPT-OSS-20B, LLaVA-OneVision-1.5 (8B/4B).
GPUs: 8*A100
Dataset: Used a math dataset as a base and filtered for difficult problems only (pass@K < 0.1 before RL). After RL, the reward hacking detection dataset is synthesized using data generated from both models. The multi-modal reward hacking dataset is open-accessible here, and a text-only version here.
Training Results: The RL training curve (w/o reward hacking) and the reward hacking detection training curve under different vector length.
Evaluation Results: AUC > 0.74 for a Qwen3-VL-8B backbone. AUC scales well with the backbone parameter size.
Task: Lowering the probabilities of certain keywords appearing using reinforcement learning.
Implementation: I implemented the NanoGPTLMActorCriticPolicy from scratch, a policy network initialized using pretrained NanoGPT model parameters. Firstly trained the reward model. Then A value network similar to the reward model but with a different language head layer. PPO is used for RL training.
Results: The expected cumulative reward increases during training. No divergence loss is used here so early stop is adopted. After training, only 9% of the answers contain keywords — a 60% drop compared to the model before RL. No human noticeable text quality downgrade was found.
Finetuning different LLM base models using the same dataset.
Dataset: wiki_bio.
GPU: 1*A100
Results: Validation Loss + Human evaluation
| Model | Training Hours | Validation Loss | Human Eval |
|---|---|---|---|
| distilgpt2 | 2 | 0.096747 | Bad accuracy and diversity |
| gpt2-large | 4 | 0.026526 | Good accuracy and bad diversity |
| LLaMA-7b + LoRA | 6 | 0.052786 | Good accuracy and diversity |
| Vicuna-7b + LoRA | 6 | 0.047751 | Good accuracy and diversity |
| LLaMA-13b + LoRA | 10 | 0.046142 | Best accuracy and diversity |
recsys-retailrocket builds a recommendation pipeline on the Kaggle RetailRocket ecommerce dataset, learning item and user representations from product-side properties and interaction history to predict whether a user will view an item after a time cutoff; the content-based design uses item metadata, historical user events by event type and recency bucket, and a factorization-machine style prediction layer, making it practical for cold-start item scoring when metadata is available.
Highlights: AUC > 0.93 on the full test set, AUC > 0.77 on the cold-start test set, and an optimized data pipeline that reduced training time per epoch from 30 hours to 20 minutes.
A real-world CTR model achieving over 10% increase in all major metrics.
An AI-powered interior design system for wardrobes.
(AI design starts at 0:50)
Reach out to me at Linkedin.