Ollama model llama-guard3:8b - Tricked

hatzisn

Expert
Licensed User
Longtime User
Good morning everyone,

based on the following post I tried to check in ollama the llama-guard3:8b model.


I tried several rules violating statements (do not start yelling I do not do any of these) and I have got an answer in a statement that defines this model as predictable most of the time but unpredictable some times. Again do not start yelling, it was just a checking of violating statements... Here is what I have got:

1778059031429.png
 

aeric

Expert
Licensed User
Longtime User
I don't get it. It only response safe or unsafe?
 

hatzisn

Expert
Licensed User
Longtime User
I don't get it. It only response safe or unsafe?

It responds with either 2 lines or 1 line. 2 Lines = "unsafe\nS1" where S1 or S? is the category violated (go to ollama and search for the model to see the categories). 1 Line = "safe".
 

hatzisn

Expert
Licensed User
Longtime User
Interesting, is this able to run in parallel with my coding model?

It can run but ollama (free version) will keep changing models according to the model you put in your request resulting in more processing time.
 

aeric

Expert
Licensed User
Longtime User
It responds with either 2 lines or 1 line. 2 Lines = "unsafe\nS1" where S1 or S? is the category violated (go to ollama and search for the model to see the categories). 1 Line = "safe".
What is the point?
Where else can I use it and what it can do?
 

hatzisn

Expert
Licensed User
Longtime User
The model is just a safe-guarding layer for the input of the user or the output of an other model. For example. Let's say you have a model that you use for a chatbot and it can use some tools. The user writes something. You check with this guard model if it is safe to send it to your standard model and if it is you send it. Your other model responds now. You get the response and pass it through guard's filtering if it is safe to send it back to the user and if it is you send it.
 

aeric

Expert
Licensed User
Longtime User
Still not clear to me.
Let say I have a Gemma4 model.
How do you check if the prompt is safe by using this llama-guard model against this Gemma4 model ?
 

hatzisn

Expert
Licensed User
Longtime User
The user writes something. You make a request to llama-guard3 to check it if it is ok. If it is not then you respond to the user you violated this category with this prompt. If it is ok, then you continue to make the request to gemma4 with user's input. In a local computer what it would do is that it will unload llama-guard and load gemma4 resulting in some processing time. It will make the request and you will get the response. Then you make again a request to llama-guard3 and checks the response if it is safe to send it back to the user. A local computer will unload gemma4, load llama-guard3, and it will check and give a response if it is safe to send it to the user. If it is, you send it else it stops it from propagating to the user. This cycle goes again in each user's prompt. The solution is to seperate it and put one model in one computer and the llama-guard in an other to avoid unload model - load model - unload model - load model, and this goes on. If this is possible then you have a communications cop in front of your main model.
 
Last edited:

Mashiane

Expert
Licensed User
Longtime User
I just asked ChatGPT to explain to me...

User → Llama Guard → Main LLM → Llama Guard → User.

🧠 Practical use cases​


You’d use Llama Guard 3 (8B) if you are:
  • Building a chatbot and need content moderation
  • Running local LLMs and want safety without external APIs
  • Creating AI agents with tools (search/code execution)
  • Implementing compliance filtering (enterprise / public apps)

Template...

B4X:
Task: Check if the following text is safe.

Categories:
- Violence
- Hate
- Sexual
- Self-harm
- Criminal
- etc.

Answer ONLY in this format:
SAFE
or
UNSAFE: <category>

Text:
{{INPUT}}

B4X:
async function isSafe(text) {
  const prompt = `
Task: Check if the following text is safe.

Answer ONLY:
SAFE
or
UNSAFE: <category>

Text:
${text}
`;

  const result = await callOllama("llama-guard3:8b", prompt);

  return result.startsWith("SAFE");
}
async function safeChat(userInput) {
  // 1. Check input
  const inputSafe = await isSafe(userInput);
  if (!inputSafe) {
    return "Input blocked due to safety policy.";
  }

  // 2. Generate response
  const response = await callOllama("llama3", userInput);

  // 3. Check output
  const outputSafe = await isSafe(response);
  if (!outputSafe) {
    return "Response blocked due to safety policy.";
  }

  return response;
}

Thank you so much @hatzisn for pointing this tool out.. awesome.
 

hatzisn

Expert
Licensed User
Longtime User
I just asked ChatGPT to explain to me...

User → Llama Guard → Main LLM → Llama Guard → User.

🧠 Practical use cases​


You’d use Llama Guard 3 (8B) if you are:
  • Building a chatbot and need content moderation
  • Running local LLMs and want safety without external APIs
  • Creating AI agents with tools (search/code execution)
  • Implementing compliance filtering (enterprise / public apps)

Template...

B4X:
Task: Check if the following text is safe.

Categories:
- Violence
- Hate
- Sexual
- Self-harm
- Criminal
- etc.

Answer ONLY in this format:
SAFE
or
UNSAFE: <category>

Text:
{{INPUT}}

B4X:
async function isSafe(text) {
  const prompt = `
Task: Check if the following text is safe.

Answer ONLY:
SAFE
or
UNSAFE: <category>

Text:
${text}
`;

  const result = await callOllama("llama-guard3:8b", prompt);

  return result.startsWith("SAFE");
}
async function safeChat(userInput) {
  // 1. Check input
  const inputSafe = await isSafe(userInput);
  if (!inputSafe) {
    return "Input blocked due to safety policy.";
  }

  // 2. Generate response
  const response = await callOllama("llama3", userInput);

  // 3. Check output
  const outputSafe = await isSafe(response);
  if (!outputSafe) {
    return "Response blocked due to safety policy.";
  }

  return response;
}

Thank you so much @hatzisn for pointing this tool out.. awesome.

Exactly, this is it. I do not know though if it is possible to change llama-guard3 output. I will check it.
 
Last edited:

aeric

Expert
Licensed User
Longtime User
I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.
 

hatzisn

Expert
Licensed User
Longtime User
I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.

Normal models do not check if the prompt violates any categories. Or if they are, I do not know it... My knowledge up to now says that they are not (at least for the models I have used).
 

Mashiane

Expert
Licensed User
Longtime User
I assume most LLMs are already guard railed.
I just wondering if the llama-guard3 responded "safe" but it may still flags as "unsafe" by other models.

Does the llama-guard3 already tested on all models?
If no, I don't see the point here.
Well, in all models that we are currently using, there is a disclaimer that is made that, the information that is provided might not be accurate.

With that said, its possible that the guard model might report false positives, its not a chat model after all, its just a "filter".

I think for the fact that people are able to "hack" using AI models, none could be fully "tested on all models" and the fact that they can also go rouge by themselves and execute harmful content, we have a long way to go.

Use of this guard will greatly depend on a use case, depending on use cases that one can use on I guess. Im not an expert, just my two cents..
 

aeric

Expert
Licensed User
Longtime User
Okay.
If I understand correctly, it seems this model could be useful for integrating a chatbot in our system for clients or end users.
At least it filters simple prompts and the developers are not getting blamed or sued by the end users for not providing a safe chatbot to use by their children.
 

hatzisn

Expert
Licensed User
Longtime User
Okay.
If I understand correctly, it seems this model could be useful for integrating a chatbot in our system for clients or end users.
At least it filters simple prompts and the developers are not getting blamed or sued by the end users for not providing a safe chatbot to use by their children.

That is correct.
 

Daestrum

Expert
Licensed User
Longtime User
Some models come with them built in, the one I use (locally) will not talk about

Illicit or illegal activities
Violence, self‑harm, or suicide‑related material
Harassment, hate speech, or discrimination
Adult or sexual content involving minors
Extremist, terrorist, or violent radicalization material
Misinformation or disinformation that could cause harm
Privacy‑invasive or personally identifying information
Copyright‑protected media or software that is shared without permission
 

hatzisn

Expert
Licensed User
Longtime User
Some models come with them built in, the one I use (locally) will not talk about

Illicit or illegal activities
Violence, self‑harm, or suicide‑related material
Harassment, hate speech, or discrimination
Adult or sexual content involving minors
Extremist, terrorist, or violent radicalization material
Misinformation or disinformation that could cause harm
Privacy‑invasive or personally identifying information
Copyright‑protected media or software that is shared without permission

Good morning @Daestrum. Which is the model that you use?
 
Top