Joseph (Ling) Zhong

AI | Machine Learning Engineer

About Me

Hi there! I am a machine learning engineer. My interests across the AI spectrum are:

  • Foundation Models – LLM Post-training, sft/rl/kd, multi-modal and etc.
  • Reasoning, reinforcement learning and decision-making problems.
  • Representation learning, single/multi vectors.
  • High performance computing for machine learning such as GPU kernel programming.
  • Deep continual learning techniques.
  • Large-scale recommender systems – CTR based and non-CTR based.
  • At the intersections of some of the above topics.

I am also very interested in ML infra / system related topics as they are playing an increasingly critical role in the foundation model based applications.

Joseph Zhong selfie
for all mankind

Experience

I am expected to graduate with a master degree in CS @UWMadison soon. Previously, I worked at JD.com, Meizhai (a startup using AI to do interior design) as an AI / machine learning engineer.

Recent Writings

Publications

Selected Certificate

Mastery Tier of LLM Agent from UC Berkeley.

Selected Honors

National Under-graduate scholarship

Outstanding Under-graduate of Sichuan Province

2nd Prize in School's ACM Coding Contest

Selected Projects

1. FlashAttention on Economical GPUs

flash-attn-economical-gpu implements Triton-based FlashAttention kernels that beat or reach comparable performance with popular libraries such as PyTorch FlashAttention SDPA across economical GPU architectures including Turing and Ampere; Our Triton MMA kernels outperforming torch.matmul by up to 80% on Turing and 25% on Ampere, optimized Triton SDPA leading tested PyTorch SDPA backends on Turing for D=128 and D=256, sliced Triton SDPA reaching near PyTorch FlashAttention performance on Ampere, and LLM TTFT benchmarks on Qwen2.5 and Qwen3.5 where the sliced Triton kernel stays close to PyTorch FlashAttention while clearly faster than PyTorch efficient SDPA.

performance on Turing performance on Ampere

Time To First Token on Qwen3.5 (D=256) and Qwen2.5 (D=128)

LLM TTFT benchmark for Qwen 3.5 and 2.5 models

2. LLM(VLM) RLVR and reward hacking detection

Task: Fine-tune two multi-modal LLMs with RL under different reward designs; make sure one reward hacks but the other does not. Then trained a VLM and some fixed-length soft tokens for reward hacking detection based on multi-vector embeddings.

Implementation: RL framework developement based on OpenR1 + trl; other components implemented by myself from scratch.

Algorithms: GRPO, Selective Sample Replay (similar to Prioritized Replay), Matryoshka Embedding, MetaEmbed, LoRA.

Models: Qwen3-VL (8B/4B/2B), Qwen3(8B/4B/1.7B), GPT-OSS-20B, LLaVA-OneVision-1.5 (8B/4B).

GPUs: 8*A100

Dataset: Used a math dataset as a base and filtered for difficult problems only (pass@K < 0.1 before RL). After RL, the reward hacking detection dataset is synthesized using data generated from both models. The multi-modal reward hacking dataset is open-accessible here, and a text-only version here.

Training Results: The RL training curve (w/o reward hacking) and the reward hacking detection training curve under different vector length.

Reward model training RL training results

Evaluation Results: AUC > 0.74 for a Qwen3-VL-8B backbone. AUC scales well with the backbone parameter size.

Reward model training RL training results

3. LLM RLHF with NanoGPT+PPO

Task: Lowering the probabilities of certain keywords appearing using reinforcement learning.

Implementation: I implemented the NanoGPTLMActorCriticPolicy from scratch, a policy network initialized using pretrained NanoGPT model parameters. Firstly trained the reward model. Then A value network similar to the reward model but with a different language head layer. PPO is used for RL training.

Results: The expected cumulative reward increases during training. No divergence loss is used here so early stop is adopted. After training, only 9% of the answers contain keywords — a 60% drop compared to the model before RL. No human noticeable text quality downgrade was found.

Reward model training RL training results

4. Supervised finetuning of LLMs

Finetuning different LLM base models using the same dataset.

Dataset: wiki_bio.

GPU: 1*A100

Results: Validation Loss + Human evaluation

Model Training Hours Validation Loss Human Eval
distilgpt2 2 0.096747 Bad accuracy and diversity
gpt2-large 4 0.026526 Good accuracy and bad diversity
LLaMA-7b + LoRA 6 0.052786 Good accuracy and diversity
Vicuna-7b + LoRA 6 0.047751 Good accuracy and diversity
LLaMA-13b + LoRA 10 0.046142 Best accuracy and diversity

5. CTR Models in RecSys

  1. recsys-retailrocket builds a recommendation pipeline on the Kaggle RetailRocket ecommerce dataset, learning item and user representations from product-side properties and interaction history to predict whether a user will view an item after a time cutoff; the content-based design uses item metadata, historical user events by event type and recency bucket, and a factorization-machine style prediction layer, making it practical for cold-start item scoring when metadata is available.

    Highlights: AUC > 0.93 on the full test set, AUC > 0.77 on the cold-start test set, and an optimized data pipeline that reduced training time per epoch from 30 hours to 20 minutes.

    RetailRocket full test AUC results RetailRocket cold-start AUC results
  2. A real-world CTR model achieving over 10% increase in all major metrics.

    A/B test results

6. Wardrobe AI System

An AI-powered interior design system for wardrobes.

(AI design starts at 0:50)

Wardrobe AI result

Contact

Reach out to me at Linkedin.