Run Qwen 3.5 9B Uncensored Locally on a 3090 or 4090 with OpenCode

A Local Model That Will Actually Do The Work

If you want a local model that runs on a single RTX 4090, works with OpenCode, and is far less likely to refuse your prompts, this stack is a lot of fun.

It is also the kind of local model you can realistically run on a 3090 or 4090 without needing some ridiculous multi-GPU box.

If you are searching for the best uncensored local LLM for an RTX 3090 or RTX 4090, this is the sort of setup worth looking at.

I do not know, or want to know, what you will do with it. I just know it is a very capable model, and this one is much less interested in telling you no.

I have been testing Qwen3.5-9B-Uncensored-HauhauCS-Aggressive with llama-server from llama.cpp, then using it through OpenCode. It works.

This setup was inspired by the excellent llama.cpp discussion here:

That thread is worth reading. It covers the broader pattern of using local OpenAI-compatible servers with coding agents. This post is the focused version for one specific stack: uncensored Qwen 3.5 9B, llama-server, and OpenCode.

Why This Model Stands Out

The model card for Qwen3.5-9B-Uncensored-HauhauCS-Aggressive makes the pitch pretty clearly:

  • 9B parameters
  • 262K native context
  • GGUF downloads including BF16, Q8, Q6, and Q4 variants
  • an uncensored aggressive variant intended to remove refusals without nerfing capability

The BF16 file is large, but a 4090 can handle it. If you are looking for a local LLM for 24GB VRAM, this is exactly why the setup is interesting. In my case, I was able to run the BF16 GGUF fully on GPU with llama-server and a 131072 context window.

That also puts it in the category of models people actually care about: stuff you can run on a single RTX 3090 or RTX 4090 at home.

If you want the exact model page, it is here:

The Exact llama-server Script

I keep a tiny helper script called qwen.sh in my working directory. This starts llama-server, waits for the health check, and also serves a simple local chat UI.

#!/bin/bash
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
MODEL="/home/matt/models/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
SERVER="/home/matt/llama.cpp/build/bin/llama-server"
API_PORT=8080
CHAT_PORT=8081
PID_FILE="/tmp/qwen-server.pid"
CHAT_PID_FILE="/tmp/qwen-chat.pid"

start() {
  if [ -f "$PID_FILE" ] && kill -0 "$(cat "$PID_FILE")" 2>/dev/null; then
    echo "Already running (PID $(cat "$PID_FILE"))"
  else
    echo "Starting Qwen3.5-9B on port $API_PORT..."
    nohup "$SERVER" \
      --model "$MODEL" \
      --host 0.0.0.0 \
      --port "$API_PORT" \
      --n-gpu-layers 999 \
      --ctx-size 131072 \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      > /tmp/qwen-server.log 2>&1 &
    echo $! > "$PID_FILE"
    echo "PID $! - waiting for model to load..."
    until curl -s "http://localhost:$API_PORT/health" | grep -q '"ok"'; do sleep 2; done
    echo "API ready at http://localhost:$API_PORT/v1"
  fi

  if [ -f "$CHAT_PID_FILE" ] && kill -0 "$(cat "$CHAT_PID_FILE")" 2>/dev/null; then
    echo "Chat UI already running (PID $(cat "$CHAT_PID_FILE"))"
  else
    nohup python3 -m http.server "$CHAT_PORT" --directory "$SCRIPT_DIR" --bind 0.0.0.0 \
      > /tmp/qwen-chat.log 2>&1 &
    echo $! > "$CHAT_PID_FILE"
    echo "Chat UI at http://localhost:$CHAT_PORT/chat.html"
  fi
}

stop() {
  for name_pid in "API server:$PID_FILE" "Chat UI:$CHAT_PID_FILE"; do
    name="${name_pid%%:*}"
    pf="${name_pid#*:}"
    if [ -f "$pf" ] && kill -0 "$(cat "$pf")" 2>/dev/null; then
      kill "$(cat "$pf")"
      rm -f "$pf"
      echo "$name stopped."
    else
      echo "$name not running."
      rm -f "$pf"
    fi
  done
}

status() {
  if [ -f "$PID_FILE" ] && kill -0 "$(cat "$PID_FILE")" 2>/dev/null; then
    echo "API: running (PID $(cat "$PID_FILE")) - http://localhost:$API_PORT/v1"
    curl -s "http://localhost:$API_PORT/health" 2>/dev/null || echo "  (not responding)"
  else
    echo "API: not running"
  fi
  if [ -f "$CHAT_PID_FILE" ] && kill -0 "$(cat "$CHAT_PID_FILE")" 2>/dev/null; then
    echo "Chat UI: running (PID $(cat "$CHAT_PID_FILE")) - http://localhost:$CHAT_PORT/chat.html"
  else
    echo "Chat UI: not running"
  fi
}

case "${1:-}" in
  start)   start ;;
  stop)    stop ;;
  restart) stop; sleep 2; start ;;
  status)  status ;;
  log)     tail -f /tmp/qwen-server.log ;;
  *)       echo "Usage: $0 {start|stop|restart|status|log}" ;;
esac

The important parts are:

  • --ctx-size 131072 to keep a large context window
  • --n-gpu-layers 999 so the model fully offloads to GPU if it fits
  • sampling settings from the model card's recommended thinking defaults

You do need a recent llama.cpp build. Qwen 3.5 support landed very recently.

Quick API Smoke Test

Once the server is up, verify that it is healthy and exposing an OpenAI-compatible API:

curl http://localhost:8080/health
curl http://localhost:8080/v1/models

You should see your model in the /v1/models response, something like:

Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf

And if you want to test a chat completion directly:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf",
    "messages": [{"role": "user", "content": "Reply with exactly: pong"}],
    "max_tokens": 128,
    "temperature": 0
  }'

Add It To OpenCode Globally

OpenCode can talk to any OpenAI-compatible provider, which is exactly what llama-server gives us.

Add this to your global OpenCode config at ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "name": "llama-server (local)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "qwen3.5-9b-uncensored-local": {
          "id": "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf",
          "name": "Qwen3.5-9B Uncensored (local)",
          "reasoning": true,
          "tool_call": true,
          "temperature": true,
          "interleaved": {
            "field": "reasoning_content"
          },
          "limit": {
            "context": 131072,
            "output": 8192
          },
          "modalities": {
            "input": ["text"],
            "output": ["text"]
          }
        }
      }
    }
  }
}

Two notes here:

  1. The id needs to match the model name exposed by llama-server.
  2. The interleaved.reasoning_content bit helps OpenCode deal with Qwen's reasoning output more cleanly.

This OpenCode wiring follows the same general pattern discussed in the llama.cpp guide, just adapted for this exact model.

Make opencodeqwen Just Work

Once the provider is configured, you can add a helper function to ~/.bashrc so opencodeqwen starts the server if needed and launches OpenCode on the local uncensored model.

opencodeqwen() {
    local model="llama.cpp/qwen3.5-9b-uncensored-local"

    if ! curl -fsS http://127.0.0.1:8080/health >/dev/null 2>&1; then
        /home/matt/working/qwen.sh start || return $?
    fi

    command opencode --model "$model" "$@"
}

Then reload your shell:

source ~/.bashrc

And now this works:

opencodeqwen

Or for a one-shot run:

opencodeqwen run "summarize this repo"

Why I Like This Setup

This combination hits a useful sweet spot:

  • fully local
  • OpenAI-compatible API
  • works with OpenCode
  • runs on a single RTX 3090 or RTX 4090
  • much less likely to refuse prompts than mainstream aligned models

That last point matters more than people admit. I got into repeated fights with GPT-5.2 on OpenClaw over things it was sure were security issues, including passwords in an .env file. It would just flat out refuse, no matter how hard I pushed. Sometimes you want a model that will simply attempt the task instead of moralizing, stalling, or collapsing into canned refusals.

Running this stuff yourself matters. It means you are not boxed into whatever Sam Altman or Dario decide you should be allowed to do that week.

A Few Practical Caveats

This is not magic. There are still tradeoffs.

You Need A Recent llama.cpp Build

Qwen 3.5 is new. If your local llama-server build is old, update it first.

Bigger Context Costs Memory

I like keeping 131072 context available because it's generally enough and has a real VRAM cost.

Uncensored Does Not Mean Perfect

The model card notes that it may still occasionally append a mild disclaimer, but it is not refusing the task. In practice, the important part is that it keeps generating the actual answer.

You Are Responsible For How You Use It

If you intentionally choose an uncensored model, that is a power tool. Use good judgment. Please use it responsibly.

Also, to be clear, this is not a coding model. It is capable, weirdly useful, and much more willing than the heavily aligned stuff, but it is not especially good at coding compared to stronger frontier models. The win here is freedom and local control, not raw coding quality.

Final Links

If you want the original references and downloads, start here:

If you have a 3090 or 4090 sitting around and want an uncensored local model that will actually take a swing at the job, this is one of the more interesting setups I have tried lately. It is hackery in the best way. Rough edges, full control, and a lot fewer lectures.