First, you need to build the llama.cpp fork for your particular GPU.
Expand one of the sections below according to your GPU and follow it.
Click to see NVIDIA instructions
Tested on a RTX 5060 8GB
These are for arch linux but you can adapt to other distros.
sudo pacman -S --needed \
base-devel \
git \
cmake \
ninja \
python \
cuda \
cudnn \
openblas \
fmt \
clang \
lldbgit clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp
cd prism-ml-llama.cpp
cmake \
-B build \
-G Ninja \
-DGGML_CUDA=ON \
-DCUDAToolkit_ROOT=/opt/cuda \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -jClick to see AMD instructions
Tested on a 7800XT 16GB
Verify that you're on amdgpu. If not, sort out your drivers
lspci -k | grep -A 3 VGAsudo pacman -S base-devel git cmake clang rocm-hip-sdk rocblas hipblas rocm-opencl-runtimeReplace gfx1101 with the version given by rocminfo | grep gfx .
git clone https://github.com/Mintplex-Labs/prism-ml-llama.cpp
cd prism-ml-llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j $(nproc)You can run this in parallel to the compile to save time.
wget https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf -O Bonsai-8B.gguf# open port
sudo ufw allow 8080/tcp
./build/bin/llama-server \
-m Bonsai-8B.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
--ctx-size 65536If you don't have enough VRAM, you can trade off speed or context. lower -ngl to trade
speed for context, lower --ctx-size to trade context size for maintaining full speed
Install AnythingLLM desktop if you haven't or adapt instructions to whatever setup you have.
Edit .config/anythingllm-desktop/storage/.env and add:
PROVIDER_SUPPORTS_NATIVE_TOOL_CALLING='generic-openai'This is very important as the model will be very dumb about using tools otherwise.
Restart your AnythingLLM.
Wrench bottom left -> AI Providers -> LLM -> LLM provider -> set to Generic OpenAI.
- Base URL:
http://127.0.0.1:8080/v1 - API Key:
sk-dummy - Chat Model Name:
openai - Model context window: match what you had on the llama server
- Max Tokens: 1024
Click "Save Changes" top right.
You should now be able to chat with the model. Removing the default prompt for the workspace might help the model be a bit more chatty, play around with that and the temperature.
Click to see docker instructions
If you want to run AnythingLLM as a docker container, there's a few things to note.
Add yourself to the docker group and re-login
sudo usermod -aG docker youruser
su -
groups # verify you are in dockerRestart docker
sudo systemctl restart dockerSet up the container:
mkdir ~/anythingllm
cd ~/anythingllmnow create a run.sh file in your ~/anythingllm folder using your favorite
text editor
#!/bin/sh
export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run --rm -it -p 3001:3001 \
--cap-add SYS_ADMIN \
--name anythingllm \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllmMake it executable and run it:
chmod +x ./run.sh
./run.shMake sure to open the necessary ports
sudo ufw allow 3001/tcp # for anythingllm
sudo ufw allow 8080/tcp # for llamaNow go to http://localhost:3001 and confirm it's working. You can also access it over the lan if you opened the port above.
After the first run, stop it and edit the .env file in ~/anythingllm to add:
PROVIDER_SUPPORTS_NATIVE_TOOL_CALLING='generic-openai'Restart it, and set up the OpenAI endpoint. Remember, 127.0.0.1 won't work
in the container since it just refers to the container itself.
Use ip addr to find your lan ip. Likely something like 192.168.1.x or
10.0.10.x .
Wrench bottom left -> AI Providers -> LLM -> LLM provider -> set to Generic OpenAI.
- Base URL:
http://your-lan-ip:8080/v1 - API Key:
sk-dummy - Chat Model Name:
openai - Model context window: match what you had on the llama server
- Max Tokens: 1024
Click "Save Changes" top right. Test out a chat with the model
Once you have confirmed it's working, you can remove the --rm -it from the
container script so it runs in the background and you don't need to keep the
terminal open.