- SpamShield is a new multilingual spam dataset containing 149,359 messages across 23 languages for real-world NLP use.
- The multilingual spam dataset includes adversarial patterns like leetspeak, Unicode obfuscation, and code-mixed Hinglish text.
- Around 20% of the dataset uses synthetic augmentation — paraphrasing, back-translation, and leetspeak mutation — to simulate evolving spam tactics.
- Category-level labels covering phishing, crypto scams, and job fraud make it more useful than simple binary spam classifiers.
- SpamShield is a new multilingual spam dataset containing 149,359 messages across 23 languages for real-world NLP use.
- The multilingual spam dataset includes adversarial patterns like leetspeak, Unicode obfuscation, and code-mixed Hinglish text.
- Around 20% of the dataset uses synthetic augmentation — paraphrasing, back-translation, and leetspeak mutation — to simulate evolving spam tactics.
- Category-level labels covering phishing, crypto scams, and job fraud make it more useful than simple binary spam classifiers.
Why the Existing Multilingual Spam Dataset Landscape Is Broken
The multilingual spam dataset problem has been hiding in plain sight for years. Ask any NLP engineer who’s tried to build a moderation system for a non-English market and they’ll tell you the same thing: the open-source options are a mess. Most public spam corpora are tiny, years out of date, and built almost entirely on English SMS messages — the kind of data that bears almost no resemblance to how spam actually arrives in 2024.
That’s the frustration that pushed developer Arjun M to build SpamShield Datasets, a new open corpus he published on Hugging Face under a CC-BY-4.0 license. The dataset currently contains 149,359 messages spanning 23 languages, from Arabic and Bengali to Urdu and Ukrainian — with a specific focus on the messy, adversarial, code-mixed content that real spam filters have to deal with every day.
It’s an independent research effort, not a corporate release, but the scope of it is hard to dismiss. The NLP community has been asking for a robust multilingual spam dataset for a while.
What Real Spam Actually Looks Like in 2024
One of the most compelling arguments behind SpamShield is its honest reckoning with what modern spam looks like. It doesn’t arrive as a neatly formatted sentence that’s easy to classify. Today’s spam is deliberately broken — and that’s the point. Spammers have long understood that mangling text is an effective way to slip past classifiers trained on clean examples.
The techniques are well-documented: leetspeak substitutions (replacing letters with numbers or symbols), invisible Unicode characters inserted to break tokenization, emoji stuffing, mixed-script content that jumps between Latin and Devanagari characters mid-sentence, and fake urgency patterns designed to trigger clicks before the reader thinks critically. On top of all that, there’s the challenge of code-mixed language — particularly Hinglish, the Hindi-English hybrid that hundreds of millions of people use naturally in text messages and social media.
Almost no existing multilingual spam dataset covers this properly. The classic SMS Spam Collection from the UCI Machine Learning Repository — still widely cited in academic benchmarks — contains fewer than 5,600 messages, all in English, collected over a decade ago. It’s a fine dataset for what it is, but it tells you essentially nothing about how to filter spam for an Indonesian e-commerce platform or a Punjabi-language WhatsApp community.
Inside the SpamShield Multilingual Spam Dataset
SpamShield’s schema is deliberately minimal. Each record carries three fields: the message text, a binary label (0 for legitimate messages, 1 for spam), and a category label. That category field is where things get genuinely interesting. Rather than just flagging something as spam and moving on, SpamShield classifies messages into types: phishing, scam, crypto, marketing, giveaway, promo, adult, and job_scam.
That granularity matters. A moderation pipeline for a job listings platform has very different risk priorities than one protecting users of a crypto exchange. Binary labels alone don’t give you enough signal to build targeted filters — you need to know what kind of threat you’re dealing with.
The 23 languages covered in this multilingual spam dataset currently are Arabic, Bengali, Chinese, Dutch, English, French, German, Hinglish, Indonesian, Italian, Japanese, Javanese, Korean, Marathi, Norwegian, Portuguese, Punjabi, Russian, Spanish, Swedish, Turkish, Ukrainian, and Urdu. The inclusion of Javanese and Marathi alongside better-resourced languages like French and German says something deliberate about the project’s priorities. Low-resource languages are exactly where spam detection systems tend to fall apart — and where researchers have the least training data to work with.
The dataset ships in two formats. Individual JSONL files broken out by language are useful for training language-specific models or doing targeted analysis. The combined.parquet file is the recommended format for large-scale training runs — Parquet’s columnar storage means faster load times, better compression, and clean compatibility with ML frameworks like Hugging Face’s datasets library or PyTorch DataLoader pipelines.
The 20% Synthetic Augmentation Question
Here’s where SpamShield gets into territory that will divide opinion in the ML community. About 20% of the multilingual spam dataset is synthetically generated rather than sourced from real-world messages. The techniques used include paraphrasing, translation, back-translation, Unicode variation, and programmatic leetspeak mutation.
Arjun is upfront about this, which is the right call — dataset transparency is an area where the NLP field has historically been sloppy, and undisclosed synthetic content has caused real problems for downstream model evaluations. But the reasoning behind the decision is defensible. Spam is an adversarial domain. If you train exclusively on historical spam examples, your model is always playing catch-up against tactics it’s never seen. Synthetic augmentation is a way to stress-test classifiers against the kinds of obfuscation patterns that spammers are likely to deploy, even if those specific examples don’t exist in the wild yet.
It’s a similar philosophy to adversarial training techniques used in security-focused machine learning — the idea that a model needs to see edge cases and mutations to build genuine robustness, not just clean benchmark accuracy. Whether 20% is the right proportion is an empirical question, and researchers using SpamShield will want to evaluate that carefully depending on their use case.
The Unglamorous Work: Cleaning, Deduplication, and Normalization
One of the more honest parts of Arjun’s write-up is the section on how hard the data engineering work actually was. Combining datasets from multiple sources — the NLP and cybersecurity communities have produced dozens of smaller corpora over the years — means reconciling wildly inconsistent schemas, encoding formats, and labeling conventions. Some source datasets contained only spam with no legitimate messages to balance against. Others had broken Unicode that corrupted text fields. A few had the same messages duplicated thousands of times, which would introduce serious bias into any model trained on them.
Deduplication and normalization across 23 languages, each with its own script and encoding challenges, is not a trivial task. It’s the kind of work that rarely gets published as a paper because it isn’t theoretically novel — but it’s arguably the most important determinant of whether a multilingual spam dataset is actually useful in practice. Bad data in, bad model out, regardless of how sophisticated your architecture is.
Where SpamShield Fits in the Broader Moderation Landscape
The timing is notable. Spam and content moderation are becoming significantly harder problems as AI-generated text floods messaging platforms and social networks. Large language models can now produce grammatically perfect phishing messages in any language, tailored to specific cultural contexts, at essentially zero marginal cost. The tell-tale signs that used to give spam away — awkward phrasing, obvious translation errors, clunky formatting — are disappearing fast.
That raises the stakes for multilingual spam dataset quality considerably. A corpus that only captures last year’s tactics will produce models that fail against next year’s threats. SpamShield’s emphasis on adversarial patterns and synthetic augmentation is at least oriented in the right direction, even if the dataset will need continuous updates to stay relevant.
For researchers working on multilingual transformer models — think mBERT, XLM-RoBERTa, or Meta’s NLLB — having a unified multilingual spam dataset that spans 23 languages in a consistent schema is genuinely useful. The alternative is assembling your own patchwork of source datasets, which is exactly the painful process SpamShield is trying to save people from repeating.
Whether SpamShield eventually grows into something that rivals institutional datasets in coverage and validation is an open question. But as a freely available, honestly documented multilingual spam dataset for moderation research, it’s filling a gap that the major tech platforms — who hoard their proprietary spam data — have shown little interest in filling themselves.

