Ollama model llama-guard3:8b - Tricked

hatzisn · May 6, 2026

Good morning everyone,

based on the following post I tried to check in ollama the llama-guard3:8b model.

OpenClaw's Agency

www.b4x.com

I tried several rules violating statements (do not start yelling I do not do any of these) and I have got an answer in a statement that defines this model as predictable most of the time but unpredictable some times. Again do not start yelling, it was just a checking of violating statements... Here is what I have got:

Mashiane · May 6, 2026

Interesting, is this able to run in parallel with my coding model?

aeric · May 6, 2026

I don't get it. It only response safe or unsafe?

hatzisn · May 6, 2026

aeric said:
I don't get it. It only response safe or unsafe?

It responds with either 2 lines or 1 line. 2 Lines = "unsafe\nS1" where S1 or S? is the category violated (go to ollama and search for the model to see the categories). 1 Line = "safe".

hatzisn · May 6, 2026

Mashiane said:
Interesting, is this able to run in parallel with my coding model?

It can run but ollama (free version) will keep changing models according to the model you put in your request resulting in more processing time.

aeric · May 6, 2026

hatzisn said:
It responds with either 2 lines or 1 line. 2 Lines = "unsafe\nS1" where S1 or S? is the category violated (go to ollama and search for the model to see the categories). 1 Line = "safe".

What is the point?
Where else can I use it and what it can do?

hatzisn · May 6, 2026

The model is just a safe-guarding layer for the input of the user or the output of an other model. For example. Let's say you have a model that you use for a chatbot and it can use some tools. The user writes something. You check with this guard model if it is safe to send it to your standard model and if it is you send it. Your other model responds now. You get the response and pass it through guard's filtering if it is safe to send it back to the user and if it is you send it.

aeric · May 6, 2026

Still not clear to me.
Let say I have a Gemma4 model.
How do you check if the prompt is safe by using this llama-guard model against this Gemma4 model ?

hatzisn · May 6, 2026

The user writes something. You make a request to llama-guard3 to check it if it is ok. If it is not then you respond to the user you violated this category with this prompt. If it is ok, then you continue to make the request to gemma4 with user's input. In a local computer what it would do is that it will unload llama-guard and load gemma4 resulting in some processing time. It will make the request and you will get the response. Then you make again a request to llama-guard3 and checks the response if it is safe to send it back to the user. A local computer will unload gemma4, load llama-guard3, and it will check and give a response if it is safe to send it to the user. If it is, you send it else it stops it from propagating to the user. This cycle goes again in each user's prompt. The solution is to seperate it and put one model in one computer and the llama-guard in an other to avoid unload model - load model - unload model - load model, and this goes on. If this is possible then you have a communications cop in front of your main model.

Mashiane · May 6, 2026

I just asked ChatGPT to explain to me...

User → Llama Guard → Main LLM → Llama Guard → User.

Practical use cases

You’d use Llama Guard 3 (8B) if you are:

Building a chatbot and need content moderation
Running local LLMs and want safety without external APIs
Creating AI agents with tools (search/code execution)
Implementing compliance filtering (enterprise / public apps)

Template...

B4X:

Task: Check if the following text is safe.

Categories:
- Violence
- Hate
- Sexual
- Self-harm
- Criminal
- etc.

Answer ONLY in this format:
SAFE
or
UNSAFE: <category>

Text:
{{INPUT}}

B4X:

async function isSafe(text) {
  const prompt = `
Task: Check if the following text is safe.

Answer ONLY:
SAFE
or
UNSAFE: <category>

Text:
${text}
`;

  const result = await callOllama("llama-guard3:8b", prompt);

  return result.startsWith("SAFE");
}
async function safeChat(userInput) {
  // 1. Check input
  const inputSafe = await isSafe(userInput);
  if (!inputSafe) {
    return "Input blocked due to safety policy.";
  }

  // 2. Generate response
  const response = await callOllama("llama3", userInput);

  // 3. Check output
  const outputSafe = await isSafe(response);
  if (!outputSafe) {
    return "Response blocked due to safety policy.";
  }

  return response;
}

Thank you so much @hatzisn for pointing this tool out.. awesome.

hatzisn · May 6, 2026

Mashiane said:
I just asked ChatGPT to explain to me...

User → Llama Guard → Main LLM → Llama Guard → User.

Practical use cases

You’d use Llama Guard 3 (8B) if you are:

Building a chatbot and need content moderation

Running local LLMs and want safety without external APIs

Creating AI agents with tools (search/code execution)

Implementing compliance filtering (enterprise / public apps)

Template...

B4X:

Task: Check if the following text is safe. Categories: - Violence - Hate - Sexual - Self-harm - Criminal - etc. Answer ONLY in this format: SAFE or UNSAFE: <category> Text: {{INPUT}}

B4X:

async function isSafe(text) { const prompt = ` Task: Check if the following text is safe. Answer ONLY: SAFE or UNSAFE: <category> Text: ${text} `; const result = await callOllama("llama-guard3:8b", prompt); return result.startsWith("SAFE"); } async function safeChat(userInput) { // 1. Check input const inputSafe = await isSafe(userInput); if (!inputSafe) { return "Input blocked due to safety policy."; } // 2. Generate response const response = await callOllama("llama3", userInput); // 3. Check output const outputSafe = await isSafe(response); if (!outputSafe) { return "Response blocked due to safety policy."; } return response; }

Thank you so much @hatzisn for pointing this tool out.. awesome.

Exactly, this is it. I do not know though if it is possible to change llama-guard3 output. I will check it.

aeric · May 6, 2026

I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.

hatzisn · May 6, 2026

aeric said:
I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.

Normal models do not check if the prompt violates any categories. Or if they are, I do not know it... My knowledge up to now says that they are not (at least for the models I have used).

Mashiane · May 6, 2026

aeric said:
I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.

Well, in all models that we are currently using, there is a disclaimer that is made that, the information that is provided might not be accurate.

With that said, its possible that the guard model might report false positives, its not a chat model after all, its just a "filter".

I think for the fact that people are able to "hack" using AI models, none could be fully "tested on all models" and the fact that they can also go rouge by themselves and execute harmful content, we have a long way to go.

Use of this guard will greatly depend on a use case, depending on use cases that one can use on I guess. Im not an expert, just my two cents..

aeric · May 6, 2026

Okay.
If I understand correctly, it seems this model could be useful for integrating a chatbot in our system for clients or end users.
At least it filters simple prompts and the developers are not getting blamed or sued by the end users for not providing a safe chatbot to use by their children.

hatzisn · May 6, 2026

aeric said:
Okay.
If I understand correctly, it seems this model could be useful for integrating a chatbot in our system for clients or end users.
At least it filters simple prompts and the developers are not getting blamed or sued by the end users for not providing a safe chatbot to use by their children.

That is correct.

Daestrum · May 6, 2026

Some models come with them built in, the one I use (locally) will not talk about

Illicit or illegal activities
Violence, self‑harm, or suicide‑related material
Harassment, hate speech, or discrimination
Adult or sexual content involving minors
Extremist, terrorist, or violent radicalization material
Misinformation or disinformation that could cause harm
Privacy‑invasive or personally identifying information
Copyright‑protected media or software that is shared without permission

hatzisn · May 7, 2026

Daestrum said:
Some models come with them built in, the one I use (locally) will not talk about

Illicit or illegal activities
Violence, self‑harm, or suicide‑related material
Harassment, hate speech, or discrimination
Adult or sexual content involving minors
Extremist, terrorist, or violent radicalization material
Misinformation or disinformation that could cause harm
Privacy‑invasive or personally identifying information
Copyright‑protected media or software that is shared without permission

Good morning @Daestrum. Which is the model that you use?

Daestrum · May 7, 2026

I use Nemotron-3-nano 30b model

Mashiane · May 7, 2026

Daestrum said:
I use Nemotron-3-nano 30b model

Let me try this out...

With Claude Code

B4X:

ollama launch claude --model nemotron-3-nano:30b-cloud

With GitHub Copilot..

B4X:

ollama launch vscode --model nemotron-3-nano:30b-cloud

Ollama model llama-guard3:8b - Tricked

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Practical use cases​

Expert

Practical use cases​

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Privacy & Transparency

Privacy & Transparency

Practical use cases

Practical use cases