Running Your Private Local LLM Step by Step • Luca ML Blog

Quick Intro

In this post, we’ll outline a step-by-step process of implementing a Large Language Model (LLM) on Docker with a user interface similar to GPT. We’ll use the latest Llama 3.1 model available as of August 12, 2024 as an example.

Key Terms and Concepts

Q1. Why Docker?

Docker is a popular containerization platform that allows developers to package, ship, and run applications in containers. Containers are lightweight and portable environments for running applications, which provide isolation and consistency across different computing systems.

Running LLMs on Docker offers several advantages. To begin with, Docker provides a convenient and intuitive way to interact with your model, making it easier to manage and maintain.

With Docker, you can easily switch between different models, configurations, or even versions of the same model without affecting your system’s stability.

By far, you should have: 1. Ollama installed 2. Model downloaded 3. Docker installed on your local device.

Regardless your usage of the model, while some of the most up-to-date models such as Llama 3.1 are impressive models in their own rights, the free version of GPT-4o may still offer better performance for certain tasks. This is because GPT-4o has been optimized for specific use cases and architectures, which can result in improved efficiency even on high-end hardware. If this still didn’t change your mind, please continue reading.

Q2. How to Read Model Name?

When selecting a model to implement, it’s not uncommon to feel overwhelmed by the various options available on platforms such as Hugging Face. To make this process easier, let’s break down some common words you might encounter in model names and learn what they typically refer to.

This knowledge will help you navigate the selection process with more confidence and choose a model that suits your needs.

e.g. "Meta Llama 3.1 8B Instruct 8bit"

Meta: ‘Meta’ is the model maintainer, who is responsible for developing, updating and maintaining the model. Look for models with clear maintenance schedules and regular updates from reputable maintainers.

8B: The model has parameter size of 8 billion. Larger parameter sizes generally mean better accuracy but increased computational requirements.

Instruct: This model is designed specifically for instruction-following tasks (e.g., generating text based on a prompt or set of instructions). Trained to follow specific rules or guidelines, making it well-suited for tasks that require a clear understanding of the input. You may also encounter the Base Model, which refers to a model with general-purpose language understanding capabilities. Not specifically optimized for any particular task or use case.

8bit : This refers to the number of bits used to represent each model weight. Larger bit sizes (e.g., 32-bit) can offer improved precision, while smaller ones (e.g., 16-bit) may sacrifice some accuracy but result in faster computations.

Running a large language model (LLM) locally using a smaller bit size, such as 3, 4, or 6 bits, is primarily tied to the technique known as Quantization. This method is especially useful when there are constraints related to hardware resources, such as memory (VRAM) or processing power (CPU/GPU capabilities).

Typically, the parameters of neural networks are stored as 32-bit floating-point numbers. Quantization involves converting these 32-bit floats into lower precision formats, such as 16-bit floats or even lower precision integers (like 8-bit, 6-bit, or 4-bit).

A general rule of thumb is:

High parameter or bit size = Accuracy-focused

Low parameter or bit size = Performance-focused

Keep in mind that these are rough guidelines and may not apply universally. Always consult the documentation for specific models and experiment with different configurations to find the optimal trade-off between accuracy and performance.

Choosing Your Right Model

When selecting a model size, the most important factor is the GPU VRAM size. This is because larger models require more memory to process, which can impact performance on lower-end hardware.

However, even if you don’t have access to a GPU or prefer not to use one, it’s still possible to run Llama 3.1 using the technique called Quantization. Please note that this method may come with a trade-off in terms of output quality, so be prepared for potential degradation in results.

Finding the hardware requirements for your model: Please refer to this resource: Hardware Requirements, where you can apply filter to find out your optimal model configurations.

If you want to set up Llama 3.1 with me together, make sure to check your hardware configurations. Verify that your machine meets the minimum requirements specified by the model maintainer, in this case, Meta AI for running your model.

You can find these details on their website: Llama 3.1 Requirements.

To summarize, to run any LLMs on your local device, most importantly, you need:

Enough disk space: Make sure you have sufficient storage capacity on your local hard drive to accommodate the model and any related files.
GPU VRAM size (if applicable): If you’re planning to use a GPU, ensure that it has sufficient VRAM to run your model efficiently. Alternatively, if you are not using a GPU, consider reducing the parameter scale to enable successful execution.

Install & Running Llama 3.1 Locally Step-by-Step

1. Download & Install Ollama: Get the latest version of Ollama from their website: Download Ollama Here

Follow the instructions provided by Ollama to install it on your system.

2. (Optional) Change model storage directory: If you drive is out of space, or you want to run the model from a desired drive, you can change the default storage location for downloaded models by following these instructions: How to change model saving directory

3. Download Llama 3.1 model: Visit the Llama model library at Llama 3.1 Library to find and download the desired model. Follow the instruction below:

Open Terminal on your device

4. Install Docker: Download and install the latest version of Docker from their website: Download Docker Here

5. Install Open WebUI:

Follow this link to the Open WebUI website, find Installation with Default Configuration section on the page, copy and past the following command in your terminal.

Open Terminal on your device and past the command line.

Wait until complete installed.

6. Reopen Docker If everything installed correctly, your Docker homepage should look like this:

Click the Port URL and open the UI in your browser

Note: If this is your first time using OpenWeb UI, you will be prompt to register your account. Don’t worry, even after online registration, you can still run your model offline.

7. Select the model you just donwloaded. In this case, choose ‘llama3.1:8b’.