Model Information

We are excited to announce the release of Cerebras DocChat, our first iteration of models designed for document-based conversational question answering. This series includes two models: Cerebras Llama3-DocChat, a large language model (LLM), and Cerebras Dragon-DocChat, a multi-turn retriever model.

This model – Cerebras Llama3-DocChat 1.0 8B – was built on top of Llama 3 base using insights from the latest research on document-based Q&A, most notably Nvidia’s ChatQA model series. As part of this work, we leveraged our experience in LLM model training and dataset curation to overcome the gaps in ChatQA's released datasets and training recipes. Additionally, we employed synthetic data generation to address limitations that couldn't be fully resolved with the available real data. Using a single Cerebras System, Llama3-DocChat 8B was trained in a few hours.

You can find more information about DocChat at the following locations:


ChatRAG Benchmark Llama3 Instruct 8B Command-R-Plus Nvidia Llama3-ChatQA 1.5 8B GPT-4-Turbo-2024-04-09 Cerebras Llama3-DocChat 1.0 8B
Doc2Dial 31.33 33.51 39.33 35.35 39.19
QuAC 32.64 34.16 39.73 40.1 36
QReCC 43.4 49.77 49.03 51.46 50.27
CoQA 73.25 69.71 76.46 77.73 79.56
DoQA 30.34 40.67 49.6 41.6 48.77
ConvFinQA 53.15 71.21 78.46 84.16 80.13
SQA 36.6 74.07 73.28 79.98 74.19
TopioCQA 34.64 53.77 49.96 48.32 52.13
HybriDial* 40.77 46.7 65.76 47.86 64
INSCIT 32.09 35.76 30.1 33.75 32.88
Average (all) 40.82 50.93 55.17 54.03 55.71
Average (exclude HybriDial) 40.83 51.4 53.99 54.72 54.79
Eleuther Eval Harness Benchmark Llama3 Instruct 8B Nvidia Llama3-ChatQA 1.5 8B Cerebras Llama3-DocChat 1.0 8B
hellaswag 57.68 61.37 61.68
winogrande 71.98 73.95 74.11
truthfulqa_mc1 36.23 28.52 29.25
truthfulqa_mc2 51.65 43.56 45.14
mmlu 63.84 60.68 62.86
gsm8k 76.12 13.72 55.57
arc_easy 81.61 80.56 82.03
arc_challenge 52.99 51.02 53.92
Average 61.51 51.67 58.07

Prompt Format

DocChat supports the standard Llama3 Instruct chat template – no fancy formatting functions required! When providing a context document to the model, simply prepend the user turn with <context> {put your document here} </context>. You may also provide an “instruction” before the user input to better align the model’s response with the desired behavior. Examples include:

  • Please give a full and complete answer for the question.
  • Answer the following question with a short span

We use the same system prompt as ChatQA: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.

Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "cerebras/Llama3-DocChat-1.0-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

system = "This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
instruction = "Please give a full and complete answer for the question."

document = """
# Cerebras Wafer-Scale Cluster

Exa-scale performance, single device simplicity

## AI Supercomputers

Condor Galaxy (CG), the supercomputer built by G42 and Cerebras, is the simplest and fastest way to build AI models in the cloud. With over 16 ExaFLOPs of AI compute, Condor Galaxy trains the most demanding models in hours rather than days. The terabyte scale MemoryX system natively accommodates 100 billion+ parameter models, making large scale training simple and efficient.

| Cluster  | ExaFLOPs | Systems  | Memory |
| -------- | -------- | -------- | ------ |
| CG1      | 4        | 64 CS-2s | 82 TB  |
| CG2      | 4        | 64 CS-2s | 82 TB  |
| CG3      | 8        | 64 CS-3s | 108 TB |

question = "How many total CS systems does Condor Galaxy 1, 2, and 3 have combined, and how many flops does this correspond to?"

user_turn = f"""<context>
{instruction} {question}"""

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": user_turn}

input_ids = tokenizer.apply_chat_template(

terminators = [

outputs = model.generate(
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))


This model was trained from Llama 3 8B base, and therefore is subject to the META LLAMA 3 COMMUNITY LICENSE AGREEMENT. Furthermore, it is trained on ChatQA's synthetic conversational QA dataset which was generated using GPT-4. As a result this model can be used for non-commercial purposes only, and is subject to Terms of Use of the data generated by OpenAI. Additionally, please see the licensing information of individual datasets.


DocChat was built on top of a large body of ML work, spanning training datasets, recipes, and evaluation. We want to thank each of these resources.

