My First AI Safety Hackathon: From Idea to Research Prototype

A few weeks after launching my personal website and SA Data Hub, I found myself looking for another challenge.

Building projects on my own had taught me a lot, but I wanted to know what it felt like to work with a real deadline, real teammates, and a problem that went beyond software itself.

That came through the Global South AI Safety Hackathon, hosted by Apart Research. The event brought together students, researchers and builders from across the region to work on practical AI safety challenges relevant to Africa.

Rather than build another web app or data project, I wanted to contribute to something that tested how safe modern AI systems actually are when used in South African languages.

The Idea

Large language models are mostly trained, tested and red-teamed in English. That means we know comparatively little about whether the same safety mechanisms hold up once a model is prompted in isiZulu, Sesotho, Afrikaans or any of South Africa's other official languages.

That gap matters here specifically. A lot of the harm a model could enable, like scams targeting SASSA grant recipients, xenophobic incitement, or political disinformation, doesn't happen in English. It happens in the languages people actually use to scam, organise and persuade each other. If a model's safety filters mostly work in English, that's a blind spot with real consequences for the people most likely to be targeted.

Our project, AfriGuard, set out to investigate exactly that.

We built a multilingual AI safety benchmark focused on South African languages and regionally relevant harms. Instead of testing generic harmful content, we grounded our prompts in scenarios specific to South Africa:

Financial fraud targeting SASSA grant recipients and banking customers
Xenophobic incitement
Political disinformation
Gang and criminal facilitation

We translated these prompts into multiple South African languages and tested how different AI models responded. In total, our benchmark contained 40 seed prompts translated across seven languages, producing 280 prompt variants, which we evaluated across four frontier language models for 1,120 model responses in total.

Building the Team

Originally, I planned to attempt the project on my own.

As the idea grew, it became clear the scope was bigger than what one person could realistically finish over a single weekend.

My friend Jaswin Chinthala, a Mechatronics student at UCT, joined first and helped with prompt design, model testing and evaluation. Soon after, Seth Miguel Ferreira from Boston College joined and took on adversarial prompting and the judging pipeline. We were also fortunate to work alongside Sebastian Stent, who brought additional experience and helped with translation workflows and dataset curation.

Our responsibilities naturally evolved into:

Jaswin Chinthala — jailbreak testing, model evaluation
Seth Miguel Ferreira — adversarial prompting, judging pipeline
Sebastian Stent — translation pipeline, dataset curation
Ubayd Hattas — evaluation pipeline, automated judging, statistical analysis, dashboard development, data visualisation

The Reality of Research

One thing this hackathon taught me very quickly is that research is messy.

When people see the final dashboard, figures and report, it's easy to assume everything came together smoothly.

It didn't.

We ran into API limits, broken automation pipelines, missing datasets, evaluation bugs, deployment issues and countless edge cases that only appeared once we thought everything was working. At one point I spent hours debugging Python scripts that had worked perfectly the day before. Another issue caused our dashboard to show incorrect results despite the underlying data being correct, and later Streamlit refused to locate files that existed locally but not in production. Near the end of the project we even found a data processing bug that forced us to reprocess the entire evaluation dataset.

Looking back, a lot of this could have been solved faster by asking mentors for help earlier. That was probably the biggest lesson of the weekend: sometimes the fastest way forward is simply asking someone who has already solved the problem.

My Contribution

Most of my work went into building the infrastructure that turned raw model responses into usable research results.

I developed the evaluation pipeline responsible for collecting responses, judging model behaviour, processing the resulting data, and generating the analytics used throughout the project, covering 280 prompts per model across 4 AI models, for 1,120 responses in total.

I built the automation scripts that processed responses, classified model behaviour, computed safety metrics, generated visualisations and powered an interactive Streamlit dashboard. It let us explore attack success rates across languages, compare model performance, analyse harm categories and investigate where safety systems seemed to fail.

This was the first time I'd built something resembling a complete data pipeline rather than a standalone application.

By the end of the weekend, the project had come together as a public dashboard anyone could explore, automated evaluation pipelines that could be re-run on new prompts or models, reproducible analysis workflows that turned raw outputs into consistent metrics every time, and a public GitHub repository with the full code, prompts and results.

What I Learned

The technical skills were valuable, but the biggest lessons came from the process itself.

Over the weekend I got hands-on with APIs, Pandas, evaluation pipelines, automated data processing, Streamlit dashboards and research methodology, most of which I'd barely touched a few weeks earlier. The combination of documentation, experimentation and AI-assisted debugging let me move far faster than I could have on my own.

I also learned the value of working in a strong team.

Could I have built a version of this project by myself?

Probably.

Could I have built the version we actually submitted?

Definitely not.

It only came together because different people brought different strengths and perspectives.

Looking Back

A few weeks before this hackathon, I was building my first data projects and figuring out how website deployment even worked.

This weekend I found myself contributing to an AI safety benchmark, working alongside a multidisciplinary team, processing over a thousand model evaluations and helping build tooling that investigated AI safety in South African languages.

Regardless of where our project ultimately ranks, that alone made the weekend worth it.

More than anything, the hackathon showed me how much there still is to learn, and how quickly that learning happens when you throw yourself into something ambitious.

I'm excited to keep building, researching and contributing to projects that have real-world impact.