Skip to content

Spam Filter

crate.spam.spamfilter classifies a map of field names to values against a set of heuristics and produces a weighted score. Above a configurable threshold, the verdict is “spam.”

import crate.spam.spamfilter;
SpamFilterConfig config = {
threshold: 2.0,
spamWords: ["viagra", "crypto", "seo-services"],
disposableDomains: ["mailinator.com", "tempmail.org"],
spamDomains: ["bad-actor.example"],
};
string[string] fields = [
"name": "John",
"email": "user@mailinator.com",
"message": "Buy cheap viagra now!!!",
];
auto verdict = classify(fields, config);
if(verdict.isSpam) {
// block or tarpit
}
SpamVerdict classify(string[string] fields, SpamFilterConfig config);

Field named email is run through the email rule set; every other field is run through the text rule set. Scores from every triggered rule accumulate into SpamVerdict.score.

FieldTypeDefaultDescription
thresholdfloat2.0Score at or above which the verdict is spam
spamWordsstring[][]Case-insensitive substrings to flag in text
disposableDomainsstring[][]Exact-match email domains to flag
spamDomainsstring[][]Substrings to flag anywhere in the email
FieldTypeDescription
scorefloatSum of all triggered rule scores
isSpamboolscore >= config.threshold
triggeredRuleResult[]Each rule that fired, with its weight

RuleResult is { string rule; float score; string field; }.

Applied to every non-email field:

RuleScoreTrigger
RANDOM_CHARS1.5>85% consonant ratio, or Shannon entropy > 3.5
SPECIAL_CHARS1.0>30% non-alphanumeric characters (excluding -, ', space)
SQL_INJECTION2.0Classic SQLi substrings (' OR, UNION SELECT, 1=1, -- SELECT, …)
HTML_INJECTION2.0<script, <img, <iframe, javascript:, onerror=
CAPITALIZATION0.5All-caps text (length > 2)
NUMBERS_ONLY1.0Only digits
URL1.5Contains http://, https://, or www.
SPAM_WORDS1.0Contains any config.spamWords substring (case-insensitive)

Applied only to the email field:

RuleScoreTrigger
INVALID_FORMAT2.0No @, missing TLD, or TLD shorter than 2 characters
RESERVED_TLD2.0TLD is one of test, example, invalid, localhost, local, tst
DISPOSABLE2.0Domain matches config.disposableDomains exactly
SPAM_DOMAINS2.0Email contains any config.spamDomains substring

These are exported so you can reuse them outside the main classify flow:

FunctionDescription
isRandomChars(text)Consonant-ratio + entropy heuristic
shannonEntropy(text)Byte-level Shannon entropy
hasExcessiveSpecialChars(text)>30% special-character ratio
hasSqlInjection(text) / hasHtmlInjection(text)Injection substring scan
isAllCaps(text) / isNumbersOnly(text)Shape heuristics
containsUrl(text)URL presence
isValidEmailFormat(email) / extractDomain(email)Lightweight email parsing
isReservedTld(domain)Checks the reserved-TLD list above
isDisposableDomain(domain, list) / isSpamDomain(email, list)List membership checks

Use them directly when you want one specific check without the full scoring pass — for example, reject any form submission containing <script before you even run the scorer.