STEP 1: DATA COLLECTION

LLMs learn from vast amounts of text data from across the internet and published works

Books & Articles

Websites

Code

Scientific Data

Trillions of words are collected to train modern LLMs:

STEP 2: TOKENIZATION

Text is broken down into smaller pieces called "tokens"

A tokenizer converts text into numerical values the model can understand:

Artificial intelligence is transforming how we work.

The model learns patterns by predicting the next token in sequences

When given a prompt, the model generates text one token at a time

User prompt:

Write a short tagline for AI Mindset's course.

The model predicts the next token based on probability distributions

For the prompt "AI will help us..."

The model calculates the probability of each possible next token:

Key concepts behind large language models

LLMs learn patterns from massive datasets of text collected from books, websites, and more
Text is broken into tokens that the model can process mathematically
During training, the model learns to predict what comes next in a sequence
When you provide a prompt, the model generates one token at a time
Each new token is predicted based on the context of all previous tokens
The model has no true understanding—it predicts patterns based on statistical relationships

AI Mindset Footer Navigation