Skip to content
UH.
AfriGuard banner
ActiveFeatured project

AfriGuard

A multilingual AI safety benchmark evaluating how well frontier LLMs maintain safety guardrails across South African languages. We red-teamed four models in seven languages, measuring Attack Success Rate across regionally relevant harms — and found that safety alignment catastrophically degrades outside of English.

Built over a single weekend at the Global South AI Safety Hackathon in Cape Town. AfriGuard tests whether the safety mechanisms that work in English actually hold up when harmful prompts are expressed in isiZulu, isiXhosa, Afrikaans, Sesotho, Sepedi, or Tsonga. The short answer: they don't.

AfriGuard is a multilingual AI safety benchmark designed to evaluate how well large language models maintain safety guardrails across South African languages. The project investigates whether code-switching and local language usage can increase jailbreak success rates compared to English prompts.

We created 40 seed prompts across four harm categories endemic to South Africa — financial fraud targeting SASSA grant recipients, xenophobic incitement, political disinformation, and gang facilitation — translated them into six African languages plus an English baseline, and evaluated four frontier LLMs. That produced 280 benchmark prompts and 1,120 total model responses.

The findings were stark. The mean Attack Success Rate across all evaluations was 50.1% — more than double the English baseline of 24.4%. Certain model-language combinations reached over 90% ASR, meaning models complied with harmful requests nine times out of ten. The research confirms that a model can be safe in English and catastrophically unsafe in an African language.

Project goal

Investigate whether AI safety guardrails catastrophically degrade when harmful prompts are expressed in South African languages — and expose the real-world consequences for communities most likely to be targeted.

Screenshots

Team working at the hackathonThe team at work

Key features

  • 40 seed prompts across 4 regionally relevant harm categories
  • 7 language conditions (6 African languages + English baseline)
  • 280 benchmark prompts, 1,120 total model evaluations
  • 4 frontier LLMs tested (GPT-OSS, Llama 3.3, Kimi K2.6, Qwen 3)
  • Automated judging pipeline with harm-score classification
  • Interactive Streamlit analytics dashboard
  • Publication-ready figures (ASR heatmaps, model comparisons, language gaps)
  • Reproducible end-to-end evaluation pipeline

Project details

Status

Research Project

Type

AI Safety Benchmark

Evaluations

1,120 responses

Languages

7 (6 African + English)

Models

4 frontier LLMs

Mean ASR

50.1% (vs 24.4% English)

Technologies

PythonPandasStreamlitAI SafetyNLPResearchData Pipelines