ActiveFeatured project

AfriGuard

A multilingual AI safety benchmark evaluating how well frontier LLMs maintain safety guardrails across South African languages. We red-teamed four models in seven languages, measuring Attack Success Rate across regionally relevant harms — and found that safety alignment catastrophically degrades outside of English.

Live demo View on GitHub

Built over a single weekend at the Global South AI Safety Hackathon in Cape Town. AfriGuard tests whether the safety mechanisms that work in English actually hold up when harmful prompts are expressed in isiZulu, isiXhosa, Afrikaans, Sesotho, Sepedi, or Tsonga. The short answer: they don't.

AfriGuard is a multilingual AI safety benchmark designed to evaluate how well large language models maintain safety guardrails across South African languages. The project investigates whether code-switching and local language usage can increase jailbreak success rates compared to English prompts.

We created 40 seed prompts across four harm categories endemic to South Africa — financial fraud targeting SASSA grant recipients, xenophobic incitement, political disinformation, and gang facilitation — translated them into six African languages plus an English baseline, and evaluated four frontier LLMs. That produced 280 benchmark prompts and 1,120 total model responses.

The findings were stark. The mean Attack Success Rate across all evaluations was 50.1% — more than double the English baseline of 24.4%. Certain model-language combinations reached over 90% ASR, meaning models complied with harmful requests nine times out of ten. The research confirms that a model can be safe in English and catastrophically unsafe in an African language.

Project goal

Investigate whether AI safety guardrails catastrophically degrade when harmful prompts are expressed in South African languages — and expose the real-world consequences for communities most likely to be targeted.

Screenshots