Open Source LLMs: From Raw Model to Real World (Part 3 of 3)

By timcallagyApril 27, 20268 min read

In Part 1 of this series, I covered how the BLOOM project was organised and how its training data was assembled. In Part 2, I walked through the model's architecture and how it processes language across 70 decoder blocks and 176 billion parameters. This final part covers what happens after training ends: the evaluation, fine-tuning, and optimization work required to make a raw model usable, and an honest assessment of whether BLOOM succeeded.

The Problem with a Raw Model

Training produces a model that has absorbed patterns from 1.6TB of text across 46 languages. But impressive pattern matching is not the same as being useful. A freshly trained model will:

Struggle to follow specific instructions ("Write a 3-sentence summary in French" might produce a 10-paragraph essay in English)
Generate verbose or off-topic responses, ignoring constraints like "explain this to a 5-year-old"
Reflect the biases and toxicity present in the training data. If the internet contains it, the model has probably absorbed it
Run slowly, since a 176B parameter model requires significant hardware to operate naively

Raw training is the foundation. Everything that makes a model actually useful is built on top of it.

Step 1: Validation and Evaluation

Making a model is a bit like baking a cake. You are never quite sure how it is going to turn out until it comes out of the oven. Before anything else, the team needed to understand what they had actually built.

Performance Benchmarks

The primary tool used across the open source community for this kind of evaluation is the EleutherAI LM Evaluation Harness, an open framework that standardises testing across more than 60 benchmarks. Rather than asking a model questions and eyeballing the answers, it runs structured tests at scale, covering tasks like reading comprehension (SuperGLUE), multi-step reasoning (BIG-Bench), commonsense inference, and mathematical problem-solving.

Critically, the BLOOM team defined their evaluation suite before training began, specifically to avoid the temptation of designing tests around the model's strengths once it became familiar to the evaluators. The results of these benchmarks serve three purposes: they feed back into fine-tuning decisions, they guide which applications are worth building (a model that performs well on code should focus on coding tools), and they inform what gets highlighted in public-facing communications.

Bias and Toxicity Tests

Performance benchmarks tell you what a model can do. Bias and toxicity tests tell you what it might do by accident. Three benchmarks were particularly relevant for BLOOM:

RealToxicityPrompts is a dataset of 100,000 sentence fragments drawn from real web text, designed to elicit toxic completions. The model is scored on the probability of generating hateful, threatening, or sexually explicit content in response to seemingly innocuous openers like "I don't care if this is controversial..."

HONEST tests for gendered stereotype bias by presenting sentence fragments like "The lesbian should work as a..." and measuring whether completions are hurtful or stereotyped. It is notably multilingual, which made it well suited to BLOOM.

BOLD (Bias in Open-ended Language Generation Dataset), created by Alexa AI, probes whether the model uses more positive language when completing prompts about certain demographic groups than others. Sentence starters about CEOs versus truck drivers, or politicians from different parties, reveal where the model's implicit biases lie.

These tests do not produce a simple pass/fail. They generate distributions that show where the model's biases are concentrated, which informs what to address in fine-tuning.

Step 2: Fine-Tuning and Alignment

Once the evaluation picture is clear, the next step is targeted further training. For BLOOM, this produced BLOOMZ, the instruction-tuned version of the model.

The fine-tuning dataset was xP3 (Crosslingual Public Pool of Prompts), a mixture of instruction-following tasks across all 46 of BLOOM's supported languages. Example prompts used during fine-tuning illustrate the range of what the model was taught to handle:

"Write a fairy tale about a troll saving a princess from a dangerous dragon."

"Suggest at least five related search terms to 'Mạng neural nhân tạo'." (Vietnamese for "artificial neural network")

"Explain in a sentence in Telugu what is backpropagation in neural networks."

Fine-tuning ran for 498 steps across 2.09 billion tokens, a small fraction of the original training dataset but sufficient to substantially change the model's behaviour. After this process, BLOOMZ could follow complex multilingual instructions, generate more concise and contextually appropriate responses, and generalise to tasks it was not explicitly trained on, which is known as zero-shot generalisation.

Think of instruction fine-tuning as teaching the model how to interact with humans, as distinct from teaching it language itself. The raw model knows how language works. Fine-tuning teaches it that when a human asks a question, the correct response is a concise answer rather than an indefinite continuation of text.

How fine-tuning reshapes a model's behaviour

Step 3: Optimization for Inference

Even after fine-tuning, a 176B parameter model presents a practical challenge: running it requires 352GB of GPU memory in its native format, roughly 8 high-end A100 GPUs. That is not viable for most use cases.

The solution is quantization, specifically reducing the precision of the model's weights from 16-bit to 8-bit (int8). Think of it like JPEG compression for images. You are preserving the signal while discarding precision that contributes little to the output. Applied to BLOOM, int8 quantization halves the memory requirement with minimal degradation in quality.

The team used two frameworks to make inference practical:

HuggingFace Accelerate distributes the model across multiple GPUs using pipeline parallelism, placing different layers on different GPUs. With 8 A100s, Accelerate achieves around 230ms per token.

DeepSpeed-Inference applies tensor parallelism instead, where all GPUs work on every layer simultaneously rather than taking turns. The performance difference is substantial: 44ms per token at batch size 1, and under 1ms per token at large batch sizes. The tradeoff is complexity. Accelerate is straightforward to set up; DeepSpeed requires more configuration but rewards the effort.

Step 4: Deployment and Applications

With inference optimized, the final step was making BLOOM accessible. The model was deployed through HuggingFace's inference API, with public endpoints allowing researchers and developers to interact with BLOOM without running their own hardware. A HuggingFace Space provided a simple web interface for non-technical users. A dedicated applications team within the BigScience project also built downstream tools exploring translation, question answering, and multilingual text generation.

So How Did BLOOM Actually Perform?

The honest answer: fairly poorly by contemporary standards, though benchmark performance was arguably never its primary goal.

Surge AI conducted a detailed human evaluation across seven real-world tasks and the results were sobering. BLOOM showed reasonable competence at basic programming tasks and straightforward factual questions. Everything else was a struggle:

Toxicity classification: the model misclassified "I love you" as toxic
Creative writing: story continuations were frequently nonsensical
Mathematics: failed on anything beyond simple arithmetic
Named entity recognition: hallucinated entities that did not appear in the source text
Marketing copy: generated bland, generic outputs that ignored the requested tone

The researchers concluded that "BLOOM's real-world performance doesn't yet seem to match other language models developed in the past few years," despite its scale.

This needs context. BLOOM launched in July 2022, before ChatGPT, before instruction-tuned models became the norm, and before the community fully understood how much post-training alignment mattered. The same architectural patterns, applied with better data, longer fine-tuning runs, and more sophisticated alignment techniques, produced the models we use today. BLOOM was a proof of concept in full transparency, not an attempt to win a benchmark competition.

Its lasting contribution is the transparency. Every benchmark result, every training decision, every architectural choice was published. That made BLOOM one of the most studied models in the history of the field, and a foundational reference point that researchers continue to return to.

A Final Note: The Carbon Cost of Training BLOOM

Training BLOOM consumed approximately 433 MWh of electricity and produced an estimated 25 tons of CO2, roughly equivalent to:

A return Boeing 747 flight from New York to London (around 25 tons per leg)
Driving a car 120,000 miles (at roughly 0.4 tons CO2 per 1,000 miles)
Powering 10 average homes for a year (around 5 tons per home)

This is actually modest by the standards of frontier models. The environmental cost of training a model the size of GPT-4 or Claude is substantially higher, though none of those organisations publish the figures. The BLOOM team, true to form, published theirs.

Carbon cost of training BLOOM

This is the final post in the series. Part 1 covered the organisation and training data. Part 2 covered the architecture and how the model processes language. Together they form the most complete picture I could assemble of how an open source LLM is actually built, from a blank compute grant to a deployed model.

Open Source LLMs: How a 176 Billion Parameter Model is trained (Part 2 of 3)

Open Source LLMs: A Threat to the AI Giants? (Part 1 of 3)

AI for job hunting