Spam Filter

crate.spam.spamfilter classifies a map of field names to values against a set of heuristics and produces a weighted score. Above a configurable threshold, the verdict is “spam.”

Import

import crate.spam.spamfilter;

Usage

SpamFilterConfig config = {
    threshold: 2.0,
    spamWords: ["viagra", "crypto", "seo-services"],
    disposableDomains: ["mailinator.com", "tempmail.org"],
    spamDomains: ["bad-actor.example"],
};

string[string] fields = [
    "name": "John",
    "email": "user@mailinator.com",
    "message": "Buy cheap viagra now!!!",
];

auto verdict = classify(fields, config);

if(verdict.isSpam) {
    // block or tarpit
}

API

`classify`

SpamVerdict classify(string[string] fields, SpamFilterConfig config);

Field named email is run through the email rule set; every other field is run through the text rule set. Scores from every triggered rule accumulate into SpamVerdict.score.

`SpamFilterConfig`

Field	Type	Default	Description
`threshold`	`float`	`2.0`	Score at or above which the verdict is spam
`spamWords`	`string[]`	`[]`	Case-insensitive substrings to flag in text
`disposableDomains`	`string[]`	`[]`	Exact-match email domains to flag
`spamDomains`	`string[]`	`[]`	Substrings to flag anywhere in the email

`SpamVerdict`

Field	Type	Description
`score`	`float`	Sum of all triggered rule scores
`isSpam`	`bool`	`score >= config.threshold`
`triggered`	`RuleResult[]`	Each rule that fired, with its weight

RuleResult is { string rule; float score; string field; }.

Text Rules

Applied to every non-email field:

Rule	Score	Trigger
`RANDOM_CHARS`	`1.5`	>85% consonant ratio, or Shannon entropy > 3.5
`SPECIAL_CHARS`	`1.0`	>30% non-alphanumeric characters (excluding `-`, `'`, space)
`SQL_INJECTION`	`2.0`	Classic SQLi substrings (`' OR`, `UNION SELECT`, `1=1`, `-- SELECT`, …)
`HTML_INJECTION`	`2.0`	`<script`, `<img`, `<iframe`, `javascript:`, `onerror=`
`CAPITALIZATION`	`0.5`	All-caps text (length > 2)
`NUMBERS_ONLY`	`1.0`	Only digits
`URL`	`1.5`	Contains `http://`, `https://`, or `www.`
`SPAM_WORDS`	`1.0`	Contains any `config.spamWords` substring (case-insensitive)

Email Rules

Applied only to the email field:

Rule	Score	Trigger
`INVALID_FORMAT`	`2.0`	No `@`, missing TLD, or TLD shorter than 2 characters
`RESERVED_TLD`	`2.0`	TLD is one of `test`, `example`, `invalid`, `localhost`, `local`, `tst`
`DISPOSABLE`	`2.0`	Domain matches `config.disposableDomains` exactly
`SPAM_DOMAINS`	`2.0`	Email contains any `config.spamDomains` substring

Standalone Helpers

These are exported so you can reuse them outside the main classify flow:

Function	Description
`isRandomChars(text)`	Consonant-ratio + entropy heuristic
`shannonEntropy(text)`	Byte-level Shannon entropy
`hasExcessiveSpecialChars(text)`	>30% special-character ratio
`hasSqlInjection(text)` / `hasHtmlInjection(text)`	Injection substring scan
`isAllCaps(text)` / `isNumbersOnly(text)`	Shape heuristics
`containsUrl(text)`	URL presence
`isValidEmailFormat(email)` / `extractDomain(email)`	Lightweight email parsing
`isReservedTld(domain)`	Checks the reserved-TLD list above
`isDisposableDomain(domain, list)` / `isSpamDomain(email, list)`	List membership checks

Use them directly when you want one specific check without the full scoring pass — for example, reject any form submission containing <script before you even run the scorer.