Building an LLM from scratch: how tokens become vectors (with actual code)

Computers don't understand words. They understand electricity. Zeros and ones. This creates an architecture problem: how do you teach a neural network to read when its native language is pure mathematics?

The obvious solution is a lookup table. 'Apple' equals 1, 'Ball' equals 2, 'Cat' equals 3. It works. The computer can parse text. But it loses semantics. To the machine, 'Apple' and 'Ball' are just adjacent integers with no actual relationship.

The gap between text and tensors

Everything in the real world (text, audio, video) needs dense numerical representation before entering a neural network. We don't just want numbers. We want embeddings.

An embedding isn't a single value. It's a vector: a list of numbers that functions like GPS coordinates. 'King' lives at one point in space. 'Queen' lives nearby. 'Apple' lives across the map, close to 'Banana' but far from royalty.

This is vector space. When you train an LLM, it learns to organise this map itself. It discovers that 'dog' and 'cat' share characteristics, so it places them geometrically close. Mathematics captures semantics.

Tokenization: breaking text into pieces

LLMs don't actually read whole words. If the model memorised every dictionary entry (plus slang, names, typos), the vocabulary would be infinite.

The solution is tokenization. We break text into chunks called tokens:

  1. Parse the input text
  2. Identify unique substrings
  3. Build a numerical vocabulary

This creates a two-way bridge. Encoding transforms 'Hello World' into [245, 981]. Decoding takes [245, 981] back to 'Hello World'. It's the interface between humans and machines.

BPE: the Lego blocks of language

What happens when the model sees a word it's never encountered? Like 'Supercalifragilisticexpialidocious'?

With whole-word tokens, you hit the Out of Vocabulary error. Game over.

Modern tokenizers use Byte Pair Encoding (BPE). If the model doesn't recognise the full word, it breaks it into subword pieces it does know. 'Unfortunately' becomes: Un + fortun + ate + ly.

With this technique, models read and write any word in any language using just a limited set of Lego pieces. No infinite vocabularies required.

Training with sliding windows

How do you prepare this data for training? An LLM learns to predict the future by looking at the past.

We take text and create sliding windows:

  • Input: 'The cat' → Target: 'climbed'
  • Input: 'The cat climbed' → Target: 'the'

This generates millions of training examples from a single book.

The complete pipeline

Putting it together, here's the data assembly line:

Raw text → Tokenization → Token IDs → Word embeddings + Positional embeddings → Model input

Each token embedding is learned during training. The model adjusts these vectors so that semantically similar tokens end up close together in high-dimensional space. This is how transformers capture meaning.

Recent developments: beyond basic embeddings

The landscape keeps evolving. Value Aggregation (VA) and AlignedWVA methods (2026) generate sentence embeddings by aggregating attention vectors from LLMs, outperforming traditional hidden state pooling. These approaches work without additional training.

For production use, open-source models dominate. Sentence-BERT and EmbeddingGemma excel at semantic similarity and RAG tasks. The dev community rates them highly for efficiency and quality.

Actually building it

This isn't just theory. The accompanying notebook implements:

  • A tokenizer from scratch
  • The BPE merge algorithm
  • PyTorch embedding layers

You can turn Shakespeare into mathematical tensors today.

Colab notebook: Run the code

GitHub repo: vongrossi/fazendo-um-llm-do-zero

Next up: attention mechanisms. The part that makes language models actually understand context.

T
Written by TheVibeish Editorial