Community/technology

Large Language Models (LLMs) Explained

Axis Lab

2025年12月22日·20 slides

This video introduces Large Language Models (LLMs), exploring their architecture, training, and key characteristics. Learn about the different types of LLMs, including decoder-only models, and the concept of Mixture of Experts (MoE).

Key highlights:

  • LLM definition and characteristics: size, data, compute.
  • Decoder-only architecture and its prevalence in modern LLMs.
  • Introduction to Mixture of Experts (MoE) and its benefits.
  • Examples of popular LLMs like GPT, Llama, and Gemma.

概要

This lecture introduces Large Language Models (LLMs), defining them as language models scaled up in terms of model size (billions of parameters), training data (hundreds of billions to trillions of tokens), and computational resources. The lecture distinguishes LLMs from earlier models like BERT, emphasizing their decoder-only architecture and text-to-text capabilities. A key focus is on Mixture of Experts (MoE), a technique used to improve the efficiency of LLMs by activating only a subset of the model's parameters during inference. The lecture also covers decoding strategies, including maximum probability selection and beam search, highlighting their trade-offs in terms of diversity and global optimality. The target audience includes students and professionals interested in the architecture, training, and application of LLMs.

重要ポイント

  • 1LLMs are characterized by their large size, extensive training data, and significant computational requirements, distinguishing them from smaller language models.
  • 2Modern LLMs predominantly use a decoder-only architecture, leveraging masked self-attention and feedforward neural networks for text generation.
  • 3Mixture of Experts (MoE) enhances LLM efficiency by activating only a subset of parameters during inference, reducing computational costs.
  • 4Sparse MoEs select a limited number of top experts for computation, further optimizing resource utilization and FLOPS.
  • 5Routing collapse, where only a few experts are consistently activated, is a challenge in MoE training, mitigated by modified loss functions.
  • 6Decoding strategies like maximum probability selection offer simplicity but lack diversity, while beam search aims for global optimality at the cost of increased computation.
  • 7Beam search maintains multiple probable paths during decoding, using a length normalization term to counteract the bias towards shorter sequences.

ウォークスルー

スライド内容

Announcements
スライド 1Announcements

Slides are available on the website before class for annotation. Slides will be published every Thursday evening.

Recap Last Week's Episodes
スライド 2Recap Last Week's Episodes

Lecture 1 and 2 introduced self-attention and transformers. Last lecture covered model types based on the transformer architecture: encoder-decoder (e.g., T5), encoder-only (e.g., BERT), and decoder-only (e.g., GPT). BERT's embeddings are meaningful and expressive, useful for tasks like classification and sentiment extraction.

Recap Last Week's Episodes
スライド 3Recap Last Week's Episodes

Encoder-decoder models (text-in, text-out) use both encoder and decoder transformers. Encoder-only models (e.g., BERT) provide meaningful embeddings. Decoder-only models (text-in, text-out) like GPT are now the most common type of LLM.

LLM
スライド 4LLM

LLM stands for Large Language Model. A language model assigns probabilities to token sequences, predicting the next token. LLMs are large in model size (at least billions of parameters), training data (hundreds of billions to trillions of tokens), and compute requirements (multiple GPUs).

LLM
スライド 5LLM

The term LLM is relatively new. Current definition includes text-to-text models that are large in size, data, and compute. These models are decoder-only, removing the encoder and keeping masked self-attention, feedforward neural network, addition, and normalization.

LLM
スライド 6LLM

Examples of decoder-only LLMs include LLaMA, Gemma, DeepSeek, Mistral, and Qwen. Over 90% of modern LLMs are decoder-only. These models are large, requiring significant compute for training and inference.

ここに表示されている動画とPDF素材は、教育デモンストレーション目的でのみ公開されているソースから提供されています。すべての著作権はそれぞれの所有者に帰属します。資産があなたの権利を侵害していると思われる場合は、 support@video2ppt.com までご連絡ください。速やかに削除いたします。

関連リソース