SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection

Yekai Li1, RuFan Zhang1, Wenxin Rong1, Xianghang Mi1
1University of Science and Technology of China,

Abstract

In this study, we introduce SpamDam, a SMS spam detection framework designed to overcome key challenges in detecting and understanding SMS spam, such as the lack of public SMS spam datasets, increasing privacy concerns of collecting SMS data, and the need for adversary-resistant detection models. SpamDam comprises four innovative modules: an SMS spam radar that identifies spam messages from online social networks(OSNs); an SMS spam inspector for statistical analysis; SMS spam detectors(SSDs) that enable both central training and federated learning; and an SSD analyzer that evaluates model resistance against adversaries in realistic scenarios.

Leveraging SpamDam we have compiled over 76K SMS spam messages from Twitter and Weibo between 2018 and 2023, forming the largest dataset of its kind. This dataset has enabled new insights into recent spam campaigns and the training of high-performing binary and multi-label classifiers for spam detection. Furthermore, effectiveness of federated learning has been well demonstrated to enable privacy-preserving SMS spam detection. Additionally, we have rigorously tested the adversarial robustness of SMS spam detection models, introducing the novel reverse backdoor attack, which has shown effectiveness and stealthiness in practical tests

The Framework of SpamDam

SpamDam, an end-to-end framework to facilitate privacy-preserving and adversary-resistant SMS spam detection, which is made possible through a set of four novel modules as summarized below. The SpamDam is bootstraped by a module named the SMS spam radar (SpamRadar) which enables a continual discovery of SMS spam messages as reported on different OSNs. Then, given spam messages identified by the SpamRadar, the second module named the SMS spam inspector will analyze these spam messages along with their metadata attributes (e.g., the reporting time, and the natural language), in an attempt to gain a deep understanding of up-to-date SMS spam messages with regards to their scale, categories, and the temporal evolution, etc. Furthermore, multiple SMS spam classifiers will be built up and comprehensively evaluated through the third module named the SMS spam detectors (SSDs). This module allows us not only to train multiple variants of a binary SMS spam classifier but also multi-class and multi-label SMS spam classifiers. Also, it enables us to evaluate, for the first time, the feasibility of federated learning for privacy-preserving training of SMS spam detection models. Then, given a set of SMS spam detection models, the fourth module named the SSD analyzer is designed to systematically evaluate the adversarial resistance of these models with regards to adversarial examples and poisoning attacks.

Frame of SpamDam

The Performance of SpamRadar

Testing Dataset Accuracy Precision Recall
Twitter 95.5% 97.8% 92.7%
Weibo 99.5% 93.3% 100.0%
Reddit 85.5% 97.7% 83.0%
Xiaohongshu 88.5% 84.0% 92.3%

End-to-end performance on Twitter and Weibo

To assess the real-world effectiveness, the while SpamRadar pipeline was applied to posts collected via the OSN-specific collector from both Weibo and Twitter. Given posts predicted through the pipeline, 200 posts per OSN were randomly sampled for manual validation. As shown in table above , our pipeline exhibits robust performance for both OSN platforms. On Twitter, it achieves an end-to-end precision of 97.8%, i.e., 97.8% images predicted as SMS spam screenshots are true cases. On the other hand, the precision slightly drops to 93.3% on Weibo.

Generalization to other OSNs

SpamRadar is designed to be OSN-agnostic, i.e., it can be extended to new OSNs with OSN-specific crawling drivers. This has been further demonstrated through our collection and analysis of data from Reddit and Xiaohongshu, thereby affirming the generalization of our pipeline.

Given the constraints presented by the Reddit APIs, we resorted to manually identifying spam-related posts that contain images, using the same meticulously crafted keywords employed in collection of Twitter/Weibo posts. For Xiaohongshu, we developed a specialized crawling driver to gather similar spam-reporting posts, adhering to the same keyword set. Given these posts, the spam radar was applied, which was followed by manual validation for 200 sampled predictions for each OSN. As shown in table above, the SpamRadar pipeline has achieved a decent performance for both OSN platforms. Particularly, a high precision of 97.7% (127 out of 130 predicted SMS spam screenshots) is achieved for Reddit while the recall of 83.0% (127 out of 153 true SMS spam screenshots) is still acceptable. On one hand, these results demonstrate the generalization of the {\radar} to different OSNs. On the other hand, fine tuning with OSN-specific ground truth can still further enhance the pipeline, e.g., improving the precision for Xiaohongshu.

Binary SMS Spam Classification

Performance of Binary Classifier

Model Precision Recall False Positive Rate
CNNAll 98.86% 98.83% 10.05%
BERTAll 99.28% 99.53% 7.49%
BERTAll-Twitter 98.68% 99.29% 13.75%
BERTAll-Weibo 99.24% 99.45% 7.92%

A direct performance comparison among the four models are listed in Table above. As we can see that BERT-based classification models have achieved better performance than the previously SOTA CNN model. Particularly, when evaluating on TestAll, our BERTAll model has increased the recall by 0.7% and the precision by 0.42% while decreasing the false positive rate by 2.56%.

Another key observation is that SMS spams reported on Twitter contribute most to the performance of BERTAll. Particularly, when evaluated on TestAll, the model BERTAll-Twitter trained without Twitter-reported SMS spam messages has the recall (99.29%) lower by 0.24%, the precision lower by 0.60%, and the false positive rate higher by 6.26%, when compared with BERTAll (99.53%). Instead, the model BERTAll-Weibo trained with Twitter-reported spams but not Weibo-reported spams have almost the same performance with BERTAll. A reasonable explanation is that SMS spam samples reported on Weibo are mostly distributed in China and cannot well represent the data distribution of globe-wide SMS spam campaigns. Thus, the model (BERTAll-Twitter) trained mostly on Weibo-reported spam messages cannot well capture Twitter-reported global spam messages.

Comparison with Publicly Available Anti-spam Options

Anti-spam Accuracy Precision Recall FPR
GPT-4 89.5% 89.1% 90.0% 11.0%
OOPSpam 61.5% 83.3% 29.7% 6.1%
Perspective 83.0% 75.4% 98.0% 32.0%
BERTAll 95.0% 90.9% 100.0% 10.0%
BERTbalance 97.0% 96.1% 98.0% 4.0%

The comparative performance results are summarized in table above. Among all the three anti-spam candidates except for ones implemented by our study, GPT-4 demonstrates superior accuracy in spam detection, while OOPSpam exhibits a low false positive rate of 6.1%, albeit at the cost of a low recall, suggesting a propensity to classify SMS messages as non-spam. Conversely, Perspective reports the highest false positive rate, indicating a tendency towards generating false alarms, aligning with findings reported in SpamHunter. Compared to public anti-spam options, the models developed in this study exhibit superior performance in both accuracy and precision. This enhancement is likely due to the specialized focus on SMS spam messages, which is not the primary target for many generic anti-spam services, resulting in their less satisfactory results. Specifically, when examining the BERTAll and BERTbalance models, the latter demonstrates the benefits of incorporating a balanced dataset. BERTbalancenot only achieves the highest precision of 96.1% but also maintains a significantly lower false positive rate of 4.0%. These results underscore the importance of including diverse-enough non-spam texts when training anti-spam models.

Multi-Label SMS Spam Classification

Multi-class Classifier

Spam Category % Groundtruth Precision Recall F1-Score
Promotion 28.21% 88.57% 90.47% 89.23%
P-Gambling 1.49% 88.69% 88.58% 88.63%
P-Sex 1.64% 86.66% 86.11% 86.30%
F-Finance 28.16% 88.33% 76.67% 79.81%
F-Account 24.24% 83.37% 86.00% 84.27%
F-Insurance 1.59% 94.40% 92.46% 93.32%
F-Delivery 7.79% 89.36% 96.67% 92.63%
F-Acquaintance 1.04% 96.00% 73.33% 80.97%
F-COVID 1.39% 95.00% 81.00% 86.11%
F-Prize 4.96% 87.68% 89.37% 88.48%

Similar to the binary SMS spam classifier, we choose the paradigm of pre-training and fine-tuning, especially considering our groundtruth dataset is of a small size. Then, the same BERT multilingual language model is selected as the pre-trained model. During the fine-tuning, the number of epochs was set as 10, with the batch size as 16, the learning rate as 1.207e-5, and the maximum sequence length as 128. The resulting model was trained on 80% of our groundtruth dataset. Evaluation upon the left 20% samples revealed a micro recall of 87.95% and a micro precision of 87.95%. Category-specific performance metrics are listed in table above. As we can see, multi-class SMS spam classification is promising. Particularly, the category-wise precision ranges from 83.33% for F-Finance to 96.00% for F-Acquaintance.

Multi-label Classifier

Spam Category % Groundtruth Precision Recall F1-Score
Promotion 49.93% 85.71% 90.45% 87.72%
P-Gambling 1.49% 40.00% 80.00% 53.33%
P-Sex 1.64% 85.71% 100.00% 92.31%
F-Finance 33.91% 87.92% 85.62% 86.75%
F-Account 28.26% 80.17% 82.20% 81.17%
F-Insurance 1.83% 100.00% 100.00% 100.00%
F-Delivery 7.24% 92.86% 86.67% 89.66%
F-Acquaintance 1.09% 100.00% 33.33% 50.00%
F-COVID 1.44% 100.00% 66.67% 80.00%
F-Prize 5.45% 81.82% 47.37% 60.00%

Following the multiclass classifier, the same model architecture as well as training parameters were utilized to train this multi-label classifier. The resulting model has achieved a label ranking average precision (LRAP) score of 0.9281, whereby the LRAP is a well-acknowledged metric for multi-label classification along with the best score being 1. Another metric we consider is the Hamming score. Considering \( N \) samples in the testing dataset, assume \( Y_i \) denotes the set of labels for the \( i^{th} \) sample, and \( \hat{Y}_i \) denotes the predicted labels for the same sample. Then, the Hamming score is calculated as \( \frac{1}{N}\sum_{i = 1}^{N}\frac{|Y_i \cap \hat{Y}_i|}{|Y_i \cup \hat{Y}_i|} \), and a Hamming score of 1 denotes a model has got 100% predictions right. Then, when evaluated on the testing dataset, our multi-label spam classifier has achieved a Hamming score of 0.7940. We also measured the category-wise precision and recall, as shown in table above. Overall, we can conclude that both multi-class and multi-label SMS spam classification tasks are promising and feasible, despite a non-negligible variance across spam categories.

SMS Spam Detection via Federated Learning

Training Type Dirichlet Precision Recall Accuracy
Central (BERT All) N/A 99.28% 99.53% 98.91%
Cross-Device FL α = 0.5 99.09% 99.45% 98.73%
α = 1 99.01% 99.35% 98.56%
α = 10 98.88% 99.20% 98.29%
Cross-Silo FL α = 0.5 99.24% 99.43% 98.80%
α = 1 99.27% 99.53% 98.91%
α = 10 99.31% 99.53% 98.92%

Table showed above presents the performance stats for FL-trained binary SMS spam classifiers under different quantity-based Dirichlet distributions. And we can see that the performance of FL-trained models is comparable to that of the centrally trained counterparts.

The Adversarial Resistance of SMS Spam Classification

Adversarial Examples

Impact of Four Imperceptible Attacks

(a) The attack impact of four imperceptible attacks of adversarial examples.

Impact of Deletion Attack on Different Languages

(b) The attack impact of the imperceptible deletion attack for spam messages of different languages.

Impact of Deletion Attack on Different Categories

(c) The attack impact of the imperceptible deletion attack for spam messages of different categories.

Impact of Adversarial Training

(d) The defensive impact of adversarial training.

Figure 1. The effectiveness of adversarial examples and adversarial training on SMS spam detection.

We also observed that the attack success rate can vary significantly when encountering spam messages of different languages or different categories. Figure 1(b) presents how the success rate of the imperceptible deletion attacks varies across four languages with most spam messages observed. And we can see that the imperceptible deletion attack is only effective for English/Spanish spam messages, but has zero success rates for both Chinese and Indonesian, which is applicable to all the three effective imperceptible attacks. Then, when it comes to different spam categories, as shown in Figure 1(c), these attacks work best for the spam category of Promotion, which is expected since promotional spam messages tend to reside closest to the decision plane between spam messages and non-spam ones. Particularly, under the attack budget of 5 perturbations, the imperceptible deletion attack has achieved a success rate of 15% for spam messages belonging to Promotion while it is only 3% for F-Account.

We then evaluated the robustness of this adversarially trained spam detection model, using the aforementioned 250 SMS spam messages. As illustrated in Figure 1(d), adversarial training has significantly lowered the attacking success rates for all the three imperceptible attacks. For instance, adversarial training has lowered the success rate of deletion attack from 20.0% to 4.80%.

Poisoning Attacks

Poisoning scenario I: the untargeted poisoning attack

Metrics The Poisoning Rate
0 1% 5% 10% 45% 50% 60%
Accuracy 98.91% 97.52% 97.47% 97.13% 97.45% 97.27% 96.56%
Precision 99.28% 99.13% 99.26% 99.27% 99.31% 97.30% 96.48%
Recall 99.53% 98.11% 97.93% 97.54% 97.86% 99.74% 99.83%
FPR 7.49% 7.99% 6.84% 6.70% 6.34% 25.77% 33.91%

Table showed above illustrates how the poisoned spam detection model varies in its performance when evaluated under different poisoning rates. As we can see, a practical untargeted poisoning attack can degrade the recall performance to a notable extent but has almost no impact on the precision. Specifically, even the injection of only 1% poisoned samples (p = 1%) can degrade the recall by 1.42%, but the impact on the precision is within the margin of error. Then, as the p is over 50% which is considered impractical, we start to see a notable drop in precision, and more importantly, the significant increase in the false positive rate. Particularly, as the p has increased from 45% to 50%, the false positive rate has jumped from 6.34% to 25.77%. Regarding what p is practical, we consider a practical poisoning attacker that has control of up to hundreds of OSN accounts. We thus consider a practical poisoning rate as p ≤ 5%. Under such a practical range of the poisoning rate, we argue that this untargeted poisoning attack has minor attack impact on our SMS spam detection model, i.e., up to 1.60% drop in recall and almost no impact on the precision and false positive rate.

Poisoning scenario II: reverse backdoor attack through reporting stamped benign messages as spam

(a) The attack success rate.

(b) The impact on the model performance (the decrease in accuracy).

Figure 2. The effectiveness of reverse backdoor attacks via injecting stamped benign messages under the label of spam.

Figure 2(a) presents the backdoor effect in terms of mis-classifying benign testing messages that are stamped with the respective backdoor word (e.g., google), while Figure 2(b) presents the impact of these backdoor attacks on the overall model performance as gained from evaluation on the held-out testing dataset. And we can conclude that this reverse backdoor attack is very effective in terms of misleading the model to falsely alarm stamped but benign messages as spam, while maintaining a low impact on the overall model performance. Particularly, when the p is just as low as 1%, the attack success rate can be as high as 54% while the impact on the overall accuracy is just 1.58%. Then, when the $p$ is larger than 15%, the poisoned model tends to mis-classify all stamped benign messages as spam.

Poisoning scenario III: reverse backdoor attack through reporting stamped spam messages

Class The Poisoning Rate
1% 5% 10% 15% 20% 40%
Benign 0.0167 0.0196 0.0213 0.0272 0.0250 0.0274
Spam 0.0156 0.0157 0.0165 0.0175 0.0163 0.0200

Similar to poisoning scenario II, backdooring via reporting stamped spam messages can achieve a high success rate when the poisoning rate is practically low (e.g., p ≤ 5%). Along with the high backdoor success rate is the low impact on the overall performance, which is no more than 2% in accuracy decrease when p ≤ 30%. A table also presents a direct comparison between backdooring with benign messages and backdooring with spam messages, with regards to their impact on overall model performance. As we can see, backdooring with spam messages consistently incurs a lower impact on model performance, and is thus more stealthy than backdooring with benign messages. By now, we can conclude that SMS spam detection built upon crowdsourced SMS spam messages is vulnerable to practical reverse backdoor attacks.

The Effect of Concept Drift

Model UCI ExAIS 2018Q1 2019Q1 2020Q1 2021Q1 2022Q1 2023Q1
Model2012 99.03% 62.56% 76.12% 77.67% 60.34% 82.60% 70.97% 78.13%
Model2015 97.87% 88.71% 72.64% 68.41% 52.29% 73.48% 66.18% 70.27%
Model2018 98.07% 89.47% 99.00% 99.05% 98.77% 98.65% 98.69% 98.28%
Model2021 97.10% 88.20% 99.50% 100% 99.66% 99.49% 99.85% 100%

Table showed above presents the performance of these four models in terms of recall, when evaluated against spam datasets of different time periods. As you can see, the decay in performance varies significantly across models. Particularly, Model2012 and Model2015 have decayed significantly in the recall performance when predicting spam messages of 2018 or later. For instance, Model2012 has achieved a initial recall of 94.85% when evaluated on the testing part of the UCI dataset. However, its performance degrades to 77.67% , 82.6%, and 70.97% when evaluated respectively on spams observed in 2019Q1, 2021Q1, and 2022 Q1. On the other hand, the two models trained on less outdated datasets are subject to less degradation in performance. Particularly, compared to Model2012, Model2018 has achieved a recall higher by 27.72% when evaluated on 2022Q1 and the increase in recall is 28.88% for Model2021. These results not only qualify the existence of concept drift but also quantify its impact in model aging in the area of SMS spam detection, which strongly highlights the necessity of maintaining an ever-updating SMS spam dataset and keeping retraining the detection models on fresh spam messages.

The Transferability of SMS Spam Detection Models

SMS/Email spam classification

Model TestSMS TestEmail TestSMS & TestEmail
Prec. Recall Prec. Recall Prec. Recall
ModelSMS 99.28% 99.53% 55.13% 85.03% 87.92% 96.61%
ModelEmail 91.57% 78.62% 99.68% 94.30% 92.58% 82.14%
ModelSMS & Email 99.15% 99.48% 98.69% 99.37% 99.10% 99.40%
ModelSMS→Email 97.38% 97.02% 99.03% 99.10% 97.78% 97.41%
ModelEmail→SMS 99.21% 99.18% 76.80% 91.35% 94.62% 97.57%

Table showed above presents the performance of these models when evaluated against three different testing datasets, namely, TestSMS, TestEmail, and the combination of both. As we can see, the model trained solely on SMS spam data (ModelSMS) achieves a recall of 85.00% and a precision of 55.03% when evaluated on TestEmail, which is likely due to that SMS spam and Email spam still differ in their respective data distribution. However, When evaluated on TestEmail, ModelSMS & Email achieves a comparable precision and a higher recall of 99.35% while it is only 94.34% for ModelEmail. The same performance improvement is also observed for ModelEmail→SMS, which suggests the SMS spam dataset can benefit the detection of Email spam while the reverse effect from Email spam to SMS spam is not obvious.

SMS spam classification and general toxic text classification

The toxic content dataset used in our experiments is the IBM toxic text dataset1 wherein a toxic text will be assigned with one or more toxic labels. In our experiments, a text in this dataset will be considered as toxic (positive) if it has been assigned one or more toxic labels. As a result, we got a binary dataset of 201,081 non-toxic texts and 22,468 toxic texts. Adopting the same model architecture and training process of BERTAll (the binary SMS spam classifier), a model trained on 80% of this toxic text dataset has achieved a precision of 61.09%, a recall of 79.43%, and an f1-score of 69.06%, when evaluated on the left 20% texts (the toxic testing dataset). We then tried to apply the SMS spam detection model BERTAll to predicting the toxic testing dataset. However, it only achieved a precision of 5.55%, a recall of 30.21% and a F1-score of 9.38%, which means the poor portability of the SMS spam classifier on toxic text classification. Reversely and similarly, when evaluated on the SMS spam testing dataset, the toxic binary classifier also achieved a poor performance (a precision of 53.03%, a recall of 0.47% and the F1-score of 0.94%).

BibTeX

@misc{li2024spamdam,
	title={SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection}, 
	author={Yekai Li and Rufan Zhang and Wenxin Rong and Xianghang Mi},
	year={2024},
	eprint={2404.09481},
	archivePrefix={arXiv},
	primaryClass={cs.CR}
  }