Inference Endpoints (dedicated) documentation

Deploying a llama.cpp Container

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Deploying a llama.cpp Container

You can deploy any llama.cpp compatible GGUF on the Hugging Face Endpoints. When you create an endpoint with a GGUF model, a llama.cpp container is automatically selected using the latest image built from the master branch of the llama.cpp repository. Upon successful deployment, a server with an OpenAI-compatible endpoint becomes available.

Llama.cpp supports multiple endpoints like /tokenize, /health, /embedding and many more. For a comprehensive list of available endpoints, please refer to the API documentation.

Deployment Steps

To deploy an endpoint with a llama.cpp container, follow these steps:

  1. Create a new endpoint and select a repository containing a GGUF model. The llama.cpp container will be automatically selected.
Select model
  1. Choose the desired GGUF file, noting that memory requirements will vary depending on the selected file. For example, an F16 model requires more memory than a Q4_K_M model.
Select GGUF file
  1. Select your desired hardware configuration.
Select hardware
  1. Optionally, you can customize the container’s configuration settings like Max Tokens, Number of Concurrent Requests. For more information on those, please refer to the Configurations section below.

  2. Click the Create Endpoint button to complete the deployment.

Alternatively, you can follow the video tutorial below for a step-by-step guide on deploying an endpoint with a llama.cpp container:

Configurations

The llama.cpp container offers several configuration options that can be adjusted. After deployment, you can modify these settings by accessing the Settings tab on the endpoint details page.

Basic Configurations

  • Max Tokens (per Request): The maximum number of tokens that can be sent in a single request.
  • Max Concurrent Requests: The maximum number of concurrent requests allowed for this deployment. Increasing this limit requires additional memory allocation. For instance, setting this value to 4 requests with 1024 tokens maximum per request requires memory capacity for 4096 tokens in total.

Advanced Configurations

In addition to the basic configurations, you can also modify specific settings by setting environment variables. A list of available environment variables can be found in the API documentation.

Please note that the following environment variables are reserved by the system and cannot be modified:

  • LLAMA_ARG_MODEL
  • LLAMA_ARG_HTTP_THREADS
  • LLAMA_ARG_N_GPU_LAYERS
  • LLAMA_ARG_EMBEDDINGS
  • LLAMA_ARG_HOST
  • LLAMA_ARG_PORT
  • LLAMA_ARG_NO_MMAP
  • LLAMA_ARG_CTX_SIZE
  • LLAMA_ARG_N_PARALLEL
  • LLAMA_ARG_ENDPOINT_METRICS

Troubleshooting

In case the deployment fails, please watch the log output for any error messages.

You can access the logs by clicking on the Logs tab on the endpoint details page. To learn more, refer to the Logs documentation.

  • Malloc failed: out of memory
    If you see this error message in the log:

    ggml_backend_cuda_buffer_type_alloc_buffer: allocating 67200.00 MiB on device 0: cuda
    Malloc failed: out of memory
    llama_kv_cache_init: failed to allocate buffer for kv cache
    llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
    ...

    That means the selected hardware configuration does not have enough memory to accommodate the selected GGUF model. You can try to:

    • Lower the number of maximum tokens per request
    • Lower the number of concurrent requests
    • Select a smaller GGUF model
    • Select a larger hardware configuration
  • Workload evicted, storage limit exceeded
    This error message indicates that the hardware has too little memory to accommodate the selected GGUF model. Try selecting a smaller model or select a larger hardware configuration.

  • Other problems
    For other problems, please refer to the llama.cpp issues page. In case you want to create a new issue, please also include the full log output in your bug report.

< > Update on GitHub