Run Qwen 3.5 9B Uncensored Locally on a 3090 or 4090 with OpenCode
A Local Model That Will Actually Do The Work
If you want a local model that runs on a single RTX 4090, works with OpenCode, and is far less likely to refuse your prompts, this stack is a lot of fun.
It is also the kind of local model you can realistically run on a 3090 or 4090 without needing some ridiculous multi-GPU box.
If you are searching for the best uncensored local LLM for an RTX 3090 or RTX 4090, this is the sort of setup worth looking at.
I do not know, or want to know, what you will do with it. I just know it is a very capable model, and this one is much less interested in telling you no.
I have been testing Qwen3.5-9B-Uncensored-HauhauCS-Aggressive with llama-server from llama.cpp, then using it through OpenCode. It works.
This setup was inspired by the excellent llama.cpp discussion here:
That thread is worth reading. It covers the broader pattern of using local OpenAI-compatible servers with coding agents. This post is the focused version for one specific stack: uncensored Qwen 3.5 9B, llama-server, and OpenCode.
Why This Model Stands Out
The model card for Qwen3.5-9B-Uncensored-HauhauCS-Aggressive makes the pitch pretty clearly:
- 9B parameters
- 262K native context
- GGUF downloads including BF16, Q8, Q6, and Q4 variants
- an uncensored aggressive variant intended to remove refusals without nerfing capability
The BF16 file is large, but a 4090 can handle it. If you are looking for a local LLM for 24GB VRAM, this is exactly why the setup is interesting. In my case, I was able to run the BF16 GGUF fully on GPU with llama-server and a 131072 context window.
That also puts it in the category of models people actually care about: stuff you can run on a single RTX 3090 or RTX 4090 at home.
If you want the exact model page, it is here:
The Exact llama-server Script
I keep a tiny helper script called qwen.sh in my working directory. This starts llama-server, waits for the health check, and also serves a simple local chat UI.
#!/bin/bash
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
MODEL="/home/matt/models/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
SERVER="/home/matt/llama.cpp/build/bin/llama-server"
API_PORT=8080
CHAT_PORT=8081
PID_FILE="/tmp/qwen-server.pid"
CHAT_PID_FILE="/tmp/qwen-chat.pid"
start() {
if [ -f "$PID_FILE" ] && kill -0 "$(cat "$PID_FILE")" 2>/dev/null; then
echo "Already running (PID $(cat "$PID_FILE"))"
else
echo "Starting Qwen3.5-9B on port $API_PORT..."
nohup "$SERVER" \
--model "$MODEL" \
--host 0.0.0.0 \
--port "$API_PORT" \
--n-gpu-layers 999 \
--ctx-size 131072 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
> /tmp/qwen-server.log 2>&1 &
echo $! > "$PID_FILE"
echo "PID $! - waiting for model to load..."
until curl -s "http://localhost:$API_PORT/health" | grep -q '"ok"'; do sleep 2; done
echo "API ready at http://localhost:$API_PORT/v1"
fi
if [ -f "$CHAT_PID_FILE" ] && kill -0 "$(cat "$CHAT_PID_FILE")" 2>/dev/null; then
echo "Chat UI already running (PID $(cat "$CHAT_PID_FILE"))"
else
nohup python3 -m http.server "$CHAT_PORT" --directory "$SCRIPT_DIR" --bind 0.0.0.0 \
> /tmp/qwen-chat.log 2>&1 &
echo $! > "$CHAT_PID_FILE"
echo "Chat UI at http://localhost:$CHAT_PORT/chat.html"
fi
}
stop() {
for name_pid in "API server:$PID_FILE" "Chat UI:$CHAT_PID_FILE"; do
name="${name_pid%%:*}"
pf="${name_pid#*:}"
if [ -f "$pf" ] && kill -0 "$(cat "$pf")" 2>/dev/null; then
kill "$(cat "$pf")"
rm -f "$pf"
echo "$name stopped."
else
echo "$name not running."
rm -f "$pf"
fi
done
}
status() {
if [ -f "$PID_FILE" ] && kill -0 "$(cat "$PID_FILE")" 2>/dev/null; then
echo "API: running (PID $(cat "$PID_FILE")) - http://localhost:$API_PORT/v1"
curl -s "http://localhost:$API_PORT/health" 2>/dev/null || echo " (not responding)"
else
echo "API: not running"
fi
if [ -f "$CHAT_PID_FILE" ] && kill -0 "$(cat "$CHAT_PID_FILE")" 2>/dev/null; then
echo "Chat UI: running (PID $(cat "$CHAT_PID_FILE")) - http://localhost:$CHAT_PORT/chat.html"
else
echo "Chat UI: not running"
fi
}
case "${1:-}" in
start) start ;;
stop) stop ;;
restart) stop; sleep 2; start ;;
status) status ;;
log) tail -f /tmp/qwen-server.log ;;
*) echo "Usage: $0 {start|stop|restart|status|log}" ;;
esac
The important parts are:
--ctx-size 131072to keep a large context window--n-gpu-layers 999so the model fully offloads to GPU if it fits- sampling settings from the model card's recommended thinking defaults
You do need a recent llama.cpp build. Qwen 3.5 support landed very recently.
Quick API Smoke Test
Once the server is up, verify that it is healthy and exposing an OpenAI-compatible API:
curl http://localhost:8080/health
curl http://localhost:8080/v1/models
You should see your model in the /v1/models response, something like:
Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf
And if you want to test a chat completion directly:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf",
"messages": [{"role": "user", "content": "Reply with exactly: pong"}],
"max_tokens": 128,
"temperature": 0
}'
Add It To OpenCode Globally
OpenCode can talk to any OpenAI-compatible provider, which is exactly what llama-server gives us.
Add this to your global OpenCode config at ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama.cpp": {
"name": "llama-server (local)",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://127.0.0.1:8080/v1"
},
"models": {
"qwen3.5-9b-uncensored-local": {
"id": "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf",
"name": "Qwen3.5-9B Uncensored (local)",
"reasoning": true,
"tool_call": true,
"temperature": true,
"interleaved": {
"field": "reasoning_content"
},
"limit": {
"context": 131072,
"output": 8192
},
"modalities": {
"input": ["text"],
"output": ["text"]
}
}
}
}
}
}
Two notes here:
- The
idneeds to match the model name exposed byllama-server. - The
interleaved.reasoning_contentbit helps OpenCode deal with Qwen's reasoning output more cleanly.
This OpenCode wiring follows the same general pattern discussed in the llama.cpp guide, just adapted for this exact model.
Make opencodeqwen Just Work
Once the provider is configured, you can add a helper function to ~/.bashrc so opencodeqwen starts the server if needed and launches OpenCode on the local uncensored model.
opencodeqwen() {
local model="llama.cpp/qwen3.5-9b-uncensored-local"
if ! curl -fsS http://127.0.0.1:8080/health >/dev/null 2>&1; then
/home/matt/working/qwen.sh start || return $?
fi
command opencode --model "$model" "$@"
}
Then reload your shell:
source ~/.bashrc
And now this works:
opencodeqwen
Or for a one-shot run:
opencodeqwen run "summarize this repo"
Why I Like This Setup
This combination hits a useful sweet spot:
- fully local
- OpenAI-compatible API
- works with OpenCode
- runs on a single RTX 3090 or RTX 4090
- much less likely to refuse prompts than mainstream aligned models
That last point matters more than people admit. I got into repeated fights with GPT-5.2 on OpenClaw over things it was sure were security issues, including passwords in an .env file. It would just flat out refuse, no matter how hard I pushed. Sometimes you want a model that will simply attempt the task instead of moralizing, stalling, or collapsing into canned refusals.
Running this stuff yourself matters. It means you are not boxed into whatever Sam Altman or Dario decide you should be allowed to do that week.
A Few Practical Caveats
This is not magic. There are still tradeoffs.
You Need A Recent llama.cpp Build
Qwen 3.5 is new. If your local llama-server build is old, update it first.
Bigger Context Costs Memory
I like keeping 131072 context available because it's generally enough and has a real VRAM cost.
Uncensored Does Not Mean Perfect
The model card notes that it may still occasionally append a mild disclaimer, but it is not refusing the task. In practice, the important part is that it keeps generating the actual answer.
You Are Responsible For How You Use It
If you intentionally choose an uncensored model, that is a power tool. Use good judgment. Please use it responsibly.
Also, to be clear, this is not a coding model. It is capable, weirdly useful, and much more willing than the heavily aligned stuff, but it is not especially good at coding compared to stronger frontier models. The win here is freedom and local control, not raw coding quality.
Final Links
If you want the original references and downloads, start here:
- Qwen3.5-9B-Uncensored-HauhauCS-Aggressive on Hugging Face
- llama.cpp offline agentic coding guide
- OpenCode
- llama.cpp
If you have a 3090 or 4090 sitting around and want an uncensored local model that will actually take a swing at the job, this is one of the more interesting setups I have tried lately. It is hackery in the best way. Rough edges, full control, and a lot fewer lectures.