Text Generation

Generate text based on a prompt.

If you are interested in a Chat Completion task, which generates a response based on a list of messages, check out the chat-completion task.

For more details about the text-generation task, check out its dedicated page! You will find examples and related materials.

Recommended models

google/gemma-2-2b-it: A text-generation model trained to follow instructions.
bigcode/starcoder: A code generation model that can generate code in 80+ languages.
meta-llama/Meta-Llama-3.1-8B-Instruct: Very powerful text generation model trained to follow instructions.
microsoft/Phi-3-mini-4k-instruct: Small yet powerful text generation model.
HuggingFaceH4/starchat2-15b-v0.1: Strong coding assistant model.
mistralai/Mistral-Nemo-Instruct-2407: Very strong open-source large language model.

This is only a subset of the supported models. Find the model that suits you best here.

Using the API

Python

JavaScript

cURL

API specification

Request

Payload
inputs*	string
parameters	object
adapter_id	string	Lora adapter id
best_of	integer	Generate best_of sequences and return the one if the highest token logprobs.
decoder_input_details	boolean	Whether to return decoder input token logprobs and ids.
details	boolean	Whether to return generation details.
do_sample	boolean	Activate logits sampling.
frequency_penalty	number	The parameter for frequency penalty. 1.0 means no penalty Penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
grammar	unknown	One of the following:
(#1)	object
type*	enum	Possible values: json.
value*	unknown	A string that represents a JSON Schema. JSON Schema is a declarative language that allows to annotate JSON documents with types and descriptions.
(#2)	object
type*	enum	Possible values: regex.
value*	string
max_new_tokens	integer	Maximum number of tokens to generate.
repetition_penalty	number	The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
return_full_text	boolean	Whether to prepend the prompt to the generated text
seed	integer	Random sampling seed.
stop	string[]	Stop generating tokens if a member of `stop` is generated.
temperature	number	The value used to module the logits distribution.
top_k	integer	The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_n_tokens	integer	The number of highest probability vocabulary tokens to keep for top-n-filtering.
top_p	number	Top-p value for nucleus sampling.
truncate	integer	Truncate inputs tokens to the given size.
typical_p	number	Typical Decoding mass See Typical Decoding for Natural Language Generation for more information.
watermark	boolean	Watermarking with A Watermark for Large Language Models.
stream	boolean

Some options can be configured by passing headers to the Inference API. Here are the available headers:

Headers
authorization	string	Authentication header in the form `'Bearer: hf_**'` when `hf_**` is a personal user access token with Inference API permission. You can generate one from your settings page.
x-use-cache	boolean, default to `true`	There is a cache layer on the inference API to speed up requests we have already seen. Most models can use those results as they are deterministic (meaning the outputs will be the same anyway). However, if you use a nondeterministic model, you can set this parameter to prevent the caching mechanism from being used, resulting in a real new query. Read more about caching here.
x-wait-for-model	boolean, default to `false`	If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error, as it will limit hanging in your application to known places. Read more about model availability here.

For more information about Inference API headers, check out the parameters guide.

Response

Output type depends on the stream input parameter. If stream is false (default), the response will be a JSON object with the following fields:

Body
details	object
best_of_sequences	object[]
finish_reason	enum	Possible values: length, eos_token, stop_sequence.
generated_text	string
generated_tokens	integer
prefill	object[]
id	integer
logprob	number
text	string
seed	integer
tokens	object[]
id	integer
logprob	number
special	boolean
text	string
top_tokens	array[]
id	integer
logprob	number
special	boolean
text	string
finish_reason	enum	Possible values: length, eos_token, stop_sequence.
generated_tokens	integer
prefill	object[]
id	integer
logprob	number
text	string
seed	integer
tokens	object[]
id	integer
logprob	number
special	boolean
text	string
top_tokens	array[]
id	integer
logprob	number
special	boolean
text	string
generated_text	string

If stream is true, generated tokens are returned as a stream, using Server-Sent Events (SSE). For more information about streaming, check out this guide.

Body
details	object
finish_reason	enum	Possible values: length, eos_token, stop_sequence.
generated_tokens	integer
input_length	integer
seed	integer
generated_text	string
index	integer
token	object
id	integer
logprob	number
special	boolean
text	string
top_tokens	object[]
id	integer
logprob	number
special	boolean
text	string

< > Update on GitHub