Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning

Anonymous (under review)

This study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection, namely SMS spam detection and Android malware detection, in terms of effectiveness, byzantine resilience, and efficiency.

Abstract

Despite achieving good performance and wide adoption, machine learning based security detection models (e.g., malware classifiers) are subject to concept drift and evasive evolution of attackers, which renders up-to-date threat data as a necessity. However, due to enforcement of various privacy protection regulations (e.g., GDPR), it is becoming increasingly challenging or even prohibitive for security vendors to collect individual-relevant and privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile devices. To address such obstacles, this study systematically profiles the (in)feasibility of federated learning for privacy-preserving cyber threat detection in terms of effectiveness, byzantine resilience, and efficiency. This is made possible by the build-up of multiple threat datasets and threat detection models, and more importantly, the design of realistic and security-specific experiments.

We evaluate FL on two representative threat detection tasks, namely SMS spam detection and Android malware detection. It shows that FL-trained detection models can achieve a performance that is comparable to centrally trained counterparts. Also, most non-IID data distributions have either minor or negligible impact on the model performance, while a label-based non-IID distribution of a high extent can incur non-negligible fluctuation and delay in FL training. Then, under a realistic threat model, FL turns out to be adversary-resistant to attacks of both data poisoning and model poisoning. Particularly, the attacking impact of a practical data poisoning attack is no more than 0.14% loss in model accuracy. Regarding FL efficiency, a bootstrapping strategy turns out to be effective to mitigate the training delay as observed in label-based non-IID scenarios.

The Effectiveness of FL

Quantity-based non-IID distribution.

A high degree of quantity-based non-IID distribution can yield better model performance and faster training convergence, especially for cross-device FL.

Label-based non-IID distribution.

A high degree of label-based non-IID distribution has no obvious impact on model performance, but can incur notable fluctuations (i.e., unstable convergence) in FL training, especially for the cross-device FL.

Consistent label imbalance (CLI).

The accuracy of security models under CLI scenarios.
Task	Model	PNR
Task	Model	0.25	1	4
Spam	Central	0.9865	0.9898	0.9873
	Cross-Device FL	0.9703	0.9857	0.9641
	Cross-Silo FL	0.9857	0.9905	0.9821
Malware	Central	0.9854	0.9963	0.7866
	Cross-Device FL	0.9805	0.9901	0.5000
	Cross-Silo FL	0.9755	0.9894	0.5026

The consistent label imbalance (CLI) towards positive (malicious) samples can incur a non-negligible degradation in performance, while the performance impact of CLI towards negative samples (more practical in cross-device FL) is minor.

Language-based non-IID distribution for SMS spam detection.


FL	k	Precision	Recall
Cross-Device	k = 1	0.9609 ± 0.0087	0.9889 ± 0.0025
	k = 2	0.9726 ± 0.0059	0.9896 ± 0.0022
	k = 3	0.9771 ± 0.0101	0.9861 ± 0.0042
	k = 4	0.9796 ± 0.0063	0.9891 ± 0.0024
Cross-Silo	k = 1	0.9760 ± 0.0037	0.9162 ± 0.0071
	k = 2	0.9850 ± 0.0044	0.9920 ± 0.0019
	k = 3	0.9840 ± 0.0042	0.9901 ± 0.0017
	k = 4	0.9866 ± 0.0044	0.9895 ± 0.0025

The more languages each FL client is assigned with, the better precision the respective FL-trained model can achieve, while the effect of language-based non-IID on the recall is minor.

Family-based non-IID distribution for Android malware detection.

We can see that, as k gets smaller, the training processing becomes more unstable and converges more slowly.

The Byzantine Resilience of FL

Data Poisoning Attacks (I).

The attack impact in terms of accuracy decrease for data poisoning attacks of different fractions of manipulated clients M, when poisoning rate p=100%.

Data poisoning with a practical fraction of poisoned clients (M ≤ 5%) has a negligible attack impact in terms of accuracy decrease.

Data Poisoning Attacks (II).

The attack impact in terms of accuracy decrease for **data poisoning attacks** with M = 40% and different poisoning rates p.
Task	p	Cross-device
Spam	25%	0.0081 ± 0.0038
	50%	0.0402 ± 0.0325
	75%	0.4579 ± 0.0505
	100%	0.9858 ± 0.0007
Malware	25%	0.0121 ± 0.0032
	50%	0.0126 ± 0.0107
	75%	0.0154 ± 0.0039
	100%	0.0927 ± 0.0293

Even data poisoning with an impractically high M = 40% still results in minor attack impact as long as the per-client sample poisoning rate is realistic (e.g., 25%).

Data poisoning under robust AGRs.

Given impractically powerful data poisoning attackers (equipped with impractically high M and p), robust AGRs can not only significantly lower the attack impact but also make the training process more smooth.

Model poisoning attacks under the default FL setting.

The attack impact in terms of accuracy decrease for **model poisoning attacks** in the default FL settings.
Task	M	LIE	MIN-MAX	MIN-SUM
Spam	1%	0.0003	0.0152	0.0130
	5%	0.0016	0.0177	0.0166
	10%	0.0014	0.0175	0.0164
	20%	0.0019	0.0180	0.0168
	30%	0.0014	0.0197	0.0241
	40%	0.0026	0.0201	0.0199
Malware	1%	0.0002	0.0014	-0.0022
	5%	0.0054	-0.0007	0.0073
	10%	-0.0038	0.0049	-0.0039
	20%	-0.0007	0.0042	0.0049
	30%	0.0048	0.0039	0.0053
	40%	0.0065	0.0015	0.0176

Model poisoning with a practical fraction of poisoned clients (M ≤ 1%) has an either negligible or minor attack impact for cyber threat detection tasks.

The Efficiency of FL

The (in)efficiency of robust AGRs.

The convergence of robust AGRs for the default FL settings.
Task	Scenario	Trimmed Mean		Multi-Krum
Task	Scenario	Rounds¹	Time²	Rounds	Time
Spam	Cross-device	218.18%	195.61%	101.52%	121.58%
Malware	Cross-device	122.58%	118.73%	79.03%	75.75%
¹ The ratio of the number of FL rounds over that of FedAVG. ² The ratio of time cost for the respective robust AGR over that of FedAVG.

When enabled by default for FL training, Multi-Krum, a robust AGR, has a minor overhead for the model performance and the convergence time.

Mitigating the convergence delay through bootstrapping.

A label-based non-IID data distribution can incur a significant delay in convergence, which however can be effectively mitigated through the bootstrapping strategy.

Datasets

The datasets in our experiments.

SMS Spam Datasets

Datasets	Spam	Non-Spam	Languages	Period
UCI	747	4827	English	2012
ExAIS	2,350	2,890	English	2015
SpamHunter	23,249	0	Multilingual	2018-2022
Twitter	0	18,629	Multilingual	2018-2022
Total	26,346	26,346	Multilingual	2012-2022

Android Malware Datasets

Datasets	Malware	Benign	Period
Drebin	4,255	0	2010-2012
CIC-AndMal2017	0	1,645	2015-2017
Androzoo	0	2,610	2022
Total	4,255	4,255	2010-2022

BibTeX

Anonymous (under review)