Sharing session on DeepSeek V3 – deep dive into its inner workings

Hello, this is Cheng. I did sharing sessions(2 sessions) on DeepSeek V3 – deep dive into its inner workings covering Mixture of Experts, Multi-Head Latent Attention and Multi-Token Prediction. It is my first time sharing, so the first few minutes was not so smooth. But if you stick to it, the content is solid. If you enjoy it, please help thumb up and sharing. Thanks.

Session1 – Mixture of Experts and Multi-Head Latent Attention

Introduction MoE – Intro (Mixture of Experts) MoE – Deepseek MoE MoE – Auxiliary loss free load balancing MoE – High level flow MLA – Intro MLA – Key, value, query(memory reduction) formulas MLA – High level flow MLA – KV Cache storage requirement comparision MLA – Matrix Associative to improve performance Transformer – Simplified source code MoE – Simplified source code

Session2 – Multi-Head Latent Attention and Multi-Token Prediction.

Auxiliary loss free load balancing step size implementation explained (my own version) MLA: Naive source code implementation (Modified from deepseek v3) MLA: Associative source code implementation (Modified from deepseek v3) MLA: Matrix absorption concepts and implementation(my own version) MTP: High level flow and concepts MTP: Source code implementation (my own version) Auxiliary loss derivation

submitted by /u/AdKey5091 to r/learnmachinelearning
[link] [comments]


Commentaires

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *