Skip to content

Instantly share code, notes, and snippets.

@Francesco149
Last active April 4, 2026 12:31
Show Gist options
  • Select an option

  • Save Francesco149/354ae20985b0852c230ac36c4cf05185 to your computer and use it in GitHub Desktop.

Select an option

Save Francesco149/354ae20985b0852c230ac36c4cf05185 to your computer and use it in GitHub Desktop.

running the bonsai 1-bit models on AnythingLLM on linux

First, you need to build the llama.cpp fork for your particular GPU.

Expand one of the sections below according to your GPU and follow it.

NVIDIA

Click to see NVIDIA instructions

Tested on a RTX 5060 8GB

install dependencies

These are for arch linux but you can adapt to other distros.

sudo pacman -S --needed \
        base-devel \
        git \
        cmake \
        ninja \
        python \
        cuda \
        cudnn \
        openblas \
        fmt \
        clang \
        lldb

clone and compile the llama fork with CUDA support

git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp
cd prism-ml-llama.cpp

cmake \
    -B build \
    -G Ninja \
    -DGGML_CUDA=ON \
    -DCUDAToolkit_ROOT=/opt/cuda \
    -DCMAKE_BUILD_TYPE=Release
        
cmake --build build -j

AMD

Click to see AMD instructions

Tested on a 7800XT 16GB

verify driver

Verify that you're on amdgpu. If not, sort out your drivers

lspci -k | grep -A 3 VGA

install ROCm stack

sudo pacman -S base-devel git cmake clang rocm-hip-sdk rocblas hipblas rocm-opencl-runtime

clone and build the llama.cpp fork with HIPBLAS support

Replace gfx1101 with the version given by rocminfo | grep gfx .

git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp
cd prism-ml-llama.cpp

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
  cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -- -j $(nproc)

download the quantized model

You can run this in parallel to the compile to save time.

wget https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf -O Bonsai-8B.gguf

run the llama server

# open port
sudo ufw allow 8080/tcp

./build/bin/llama-server \
    -m Bonsai-8B.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99 \
    --ctx-size 65536

If you don't have enough VRAM, you can trade off speed or context. lower -ngl to trade speed for context, lower --ctx-size to trade context size for maintaining full speed

set up AnythingLLM

Install AnythingLLM desktop if you haven't or adapt instructions to whatever setup you have.

Edit .config/anythingllm-desktop/storage/.env and add:

PROVIDER_SUPPORTS_NATIVE_TOOL_CALLING='generic-openai'

This is very important as the model will be very dumb about using tools otherwise.

Restart your AnythingLLM.

Wrench bottom left -> AI Providers -> LLM -> LLM provider -> set to Generic OpenAI.

  • Base URL: http://127.0.0.1:8080/v1
  • API Key: sk-dummy
  • Chat Model Name: openai
  • Model context window: match what you had on the llama server
  • Max Tokens: 1024

Click "Save Changes" top right.

You should now be able to chat with the model. Removing the default prompt for the workspace might help the model be a bit more chatty, play around with that and the temperature.

using the docker container

Click to see docker instructions

If you want to run AnythingLLM as a docker container, there's a few things to note.

Add yourself to the docker group and re-login

sudo usermod -aG docker youruser
su -
groups # verify you are in docker

Restart docker

sudo systemctl restart docker

Set up the container:

mkdir ~/anythingllm
cd ~/anythingllm

now create a run.sh file in your ~/anythingllm folder using your favorite text editor

#!/bin/sh

export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run --rm -it -p 3001:3001 \
--cap-add SYS_ADMIN \
--name anythingllm \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllm

Make it executable and run it:

chmod +x ./run.sh
./run.sh

Make sure to open the necessary ports

sudo ufw allow 3001/tcp # for anythingllm
sudo ufw allow 8080/tcp # for llama

Now go to http://localhost:3001 and confirm it's working. You can also access it over the lan if you opened the port above.

After the first run, stop it and edit the .env file in ~/anythingllm to add:

PROVIDER_SUPPORTS_NATIVE_TOOL_CALLING='generic-openai'

Restart it, and set up the OpenAI endpoint. Remember, 127.0.0.1 won't work in the container since it just refers to the container itself.

Use ip addr to find your lan ip. Likely something like 192.168.1.x or 10.0.10.x .

Wrench bottom left -> AI Providers -> LLM -> LLM provider -> set to Generic OpenAI.

  • Base URL: http://your-lan-ip:8080/v1
  • API Key: sk-dummy
  • Chat Model Name: openai
  • Model context window: match what you had on the llama server
  • Max Tokens: 1024

Click "Save Changes" top right. Test out a chat with the model

Once you have confirmed it's working, you can remove the --rm -it from the container script so it runs in the background and you don't need to keep the terminal open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment