Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning

Abstract

The proliferation of harmful online content—e.g., toxicity, spam, and negative sentiment—demands robust and adaptable moderation systems. However, prevailing moderation systems are centralized and task-specific, offering limited transparency and neglecting diverse user preferences—an approach ill-suited for privacy-sensitive or decentralized environments. We propose a novel framework that leverages in-context learning (ICL) with foundation models to unify the detection of toxicity, spam, and negative sentiment across binary, multi-class, and multi-label settings. Crucially, our approach enables lightweight personalization, allowing users to easily block new categories, unblock existing ones, or extend detection to semantic variations through simple prompt-based interventions—all without model retraining.Extensive experiments on public benchmarks (TextDetox, UCI SMS, SST2) and a new, annotated Mastodon dataset reveal that: (i) foundation models achieve strong cross-task generalization, often matching or surpassing task-specific fine-tuned models; (ii) effective personalization is achievable with as few as one user-provided example or definition; and (iii) augmenting prompts with label definitions or rationales significantly enhances robustness to noisy, real-world data.Our work demonstrates a definitive shift beyond one-size-fits-all moderation, establishing ICL as a practical, privacy-preserving, and highly adaptable pathway for the next generation of user-centric content safety systems.

Defining Harmful Texts

We define harmful texts as content that threatens individual well-being or platform integrity through offensive, manipulative, or negative language. Following prior surveys, we focus on three representative types: toxicity, spam, and negative sentiment.Toxicity covers harassment, hate, and profanity; spam denotes unsolicited or manipulative messages such as promotions or link flooding; and negative sentiment reflects emotionally harmful expressions (e.g., anger or despair) with psychosocial impacts. Together, these categories balance conceptual breadth and empirical feasibility.

Overview of datasets.

Task	Dataset	Language	Label	Train	Test


Toxicity	Textdetox	English	Toxic	2,000	500
Toxicity	Textdetox	English	Non-toxic	2,000	500

Spam	UCI	English	Spam	586	161
Spam	UCI	English	Ham	3,873	954

Sentiment	SST2	English	Negative	29,780	428
Sentiment	SST2	English	Positive	37,569	444

In-Context Learning Tasks

Building upon the datasets introduced earlier, we now provide a formal definition of the learning tasks designed to enable multi-task and personalized harmful content detection using in-context learning (ICL). The evaluation of these tasks will be detailed in subsequent sections.

Let $\mathcal{X}$ denote the input space, $\mathcal{Y}$ the label space, and $t$ a natural-language task instruction. We define $D_k^{r} = \{ (x_1, y_1), (x_2, y_2), \dots, (x_k, y_k) \} \subseteq D$ as a set of $k$ demonstration examples retrieved from dataset $D$ under retrieval strategy $r$. Let $x_{\text{query}} \in \mathcal{X}$ denote a new input under evaluation (i.e., the query sample), and let $\mathcal{M}$ denote the foundation model employed for in-context learning. The predicted output $y_{\text{query}}$ is defined as: $$ y_{\text{query}} = \mathcal{M}(t,\,D_k^{r},\, x_{\text{query}}).$$ This formulation represents the general ICL process. For classification tasks, we typically employ greedy decoding to select the most probable label from the model's vocabulary that matches a label in $\mathcal{Y}$. This can be formalized as: \[ y_{\text{query}} = \operatorname*{arg\,max}_{y \in \mathcal{Y}} P_{\mathcal{M}}\left(y \mid t,\, D_{k}^{r},\, x_{\text{query}}\right), \] where $P_{\mathcal{M}}$ represents the model's conditional probability distribution over its vocabulary.

Remark. In our experiments, we primarily adopt demonstrations of the form $(x_i, y_i)$, i.e., input--label pairs. Only in the wild data experiments do we extend demonstrations with reasons $(x_i, y_i, e_i)$, where $e_i$ provides a textual rationale for the assigned label.

Single-Task Harmful Content Classification. Single-task harmful content classification independently detect each category of harmful content. In this approach, the label space $\mathcal{Y}$ corresponds to the specific harm category under consideration (e.g., spam and ham for spam detection), and the demonstration set $D_k^{r}$ is exclusively sampled from the relevant training dataset (e.g., toxicity training datasets for toxicity detection). These independent ICL detectors are collectively referred to as single-task harmful content classifiers.

Multi-Task Harmful Content Classification. In contrast to single-task classification, multi-task harmful content classification aims to simultaneously detect multiple categories of harmful content. Here, the label space $\mathcal{Y}$ is defined as the union of all harm categories under consideration. The demonstration set $D_k^{r}$ is sampled from all relevant training datasets, encompassing examples from each harm category. Depending on the structure of the label space $\mathcal{Y}$, this task can be further divided into two subcategories: multi-task binary ICL and multi-task multi-class ICL.

In multi-task binary ICL, the model distinguishes harmful content from benign content, with the label space defined as $\mathcal{Y} = \{\textit{harmful}, \textit{benign}\}$, where $\textit{harmful}$ represents harmful(toxic, spam, negative) content and $\textit{benign}$ represents non-harmful content. Conversely, in multi-task multi-class ICL, the model assigns harmful content to one of the specific harm categories, with the label space defined as $\mathcal{Y} = \{\textit{toxic}, \textit{spam}, \textit{negative}, \textit{benign}\}$.

In-Context Learning Task Execution Flow.

General ICL Settings

Foundation Models. evaluate three state-of-the-art open-weight LLM families: Llama, Mistral, and Qwen. We adopt the latest instruction-tuned versions as of July 2024: Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Qwen2-7B-Instruct, balancing performance with evaluation efficiency. Although newer releases exist as of July 2025, they are either incompatible with our setup or show no significant gains.

Prompt Template. Our prompt consists of three components: the task description $t$, the demonstration set $D_k^{r}$, and the query input $x_{\text{query}}$.

Demonstration-Relevant Parameters. We systematically study demonstration-related parameters. First, we vary the number of demonstrations $k$, covering zero-shot to many-shot settings. Demonstrations are balanced across all classes unless noted otherwise. Second, we examine different retrieval strategies for selecting $D_k^{r}$:

Random: uniform sampling per label, averaged over three seeds;
Semantic: retrieves semantically similar examples using dense embeddings from sentence-transformers via retriv;
Balanced Semantic: semantic retrieval with class balance;
Lexical: selects examples via BM25, also implemented in retriv;
Balanced Lexical: BM25 retrieval with class balance.

Finally, we consider rationale-augmented demonstrations $(x_i, y_i, e_i)$, where $e_i$ explains the label. These are used only in wild data experiments.

Task Description. We design two levels of task descriptions. Level 1 provides a concise statement of the task and label space, while Level 2 augments this with detailed definitions for each label. This design is to test whether richer task information improves ICL performance. Unless otherwise stated, Level 1 descriptions are used.

Inference Setting. For efficient deployment, we employ vLLM, a high-performance inference engine optimized for LLMs. We adopt GuidedDecodingParams to constrain models to structured outputs, i.e., predictions restricted to the label space without generating irrelevant content. To ensure deterministic and reproducible results, we set the temperature to zero and disable stochastic sampling by configuring top-p to 1 and top-k to -1.

Single-Task ICL

Our single-task experiments demonstrate that in-context learning (ICL) achieves strong and stable performance across all three binary harmful content detection tasks—spam, sentiment, and toxicity. ICL performance improves with more demonstrations and stabilizes beyond moderate shot counts, with Balanced Semantic and Balanced Lexical retrieval strategies consistently yielding the best results. All models reach near-ceiling accuracy on SST-2, while Llama performs best overall, achieving an F1-score of 0.991 on the spam task and 0.971 on toxicity. Compared to fine-tuned BERT baselines, ICL matches or surpasses performance without requiring gradient updates, underscoring its efficiency, adaptability, and potential as a lightweight alternative for harmful content detection.

The performance of ICL on binary classification.

Comparison of best F1-score for different model-retrieval pairs across tasks.

Model	Retrieval Method	Average	Spam	Sentiment	Toxicity


Bert(Baseline)	-	0.959	0.985	0.923	0.969

Llama	Random	0.967	0.970 (128)¹	0.959 (16)	0.971 (128)
	Balanced Lexical	0.965	0.991 (128)	0.939 (16)	0.966 (128)
	Balanced Semantic	0.970	0.994 (128)	0.945 (128)	0.971 (64)

Qwen	Random	0.940	0.906 (128)	0.958 (64)	0.957 (8)
	Balanced Lexical	0.954	0.971 (128)	0.942 (64)	0.950 (8)
	Balanced Semantic	0.950	0.961 (128)	0.936 (16)	0.954 (8)

Mistral	Random	0.943	0.920 (64)	0.953 (4)	0.957 (8)
	Balanced Lexical	0.943	0.943 (64)	0.936 (128)	0.949 (8)
	Balanced Semantic	0.952	0.957 (64)	0.943 (8)	0.955 (64)

¹ F1-score (Number of Shots): The F1-score achieved by the model-retrieval pair, followed by the number of shots used in that configuration.

Multi-Task Binary ICL

In the multi-task binary setting, ICL demonstrates strong generalization across heterogeneous harmful content categories. Among retrieval strategies, Random consistently yields the best overall performance, while Balanced Semantic performs better at smaller shot counts. Llama-Random achieves the highest F1-score of 0.918 at 192 shots, followed closely by Qwen and Mistral. Compared with single-task ICL, multi-task models maintain similar recall but suffer higher false positive rates, as exposure to diverse harm types broadens their decision boundaries. Enhancing task descriptions with explicit definitions (Level 2 prompts) significantly mitigates this issue, improving Llama-Random’s F1 to 0.934 and reducing false positives, demonstrating that richer prompt information enhances model discrimination and robustness in multi-task moderation.

The performance of multi-task binary ICL, as measured by F1-score.

UCI(Recall) comparison — (a) UCI(Recall)

SST2(Recall) comparison — (b) SST2(Recall)

Textdetox(Recall) comparison — (c) Textdetox(Recall)

Textdetox(FPR) comparison — (f) Textdetox(FPR)

The performance comparison for Llama-Random between multi-task binary ICL and single-task binary ICL, as measured by recall and false positive rate(FPR).

Multi-Task Multi-Class ICL

In the multi-task multi-class setting, ICL exhibits strong generalization across toxicity, spam, and negative sentiment categories. Fine-grained Balanced Semantic retrieval performs better with fewer demonstrations, while Random retrieval surpasses it as the number of shots increases. Qwen and Llama achieve the best overall performance, both reaching a weighted F1-score of 0.938 at 192 shots, with Llama selected for further analysis due to its consistent robustness. Detailed evaluation reveals that Llama distinguishes harmful categories effectively once harmfulness is detected but often confuses benign and negative texts, reflecting the semantic ambiguity between them. Introducing richer Level 2 task descriptions further improves performance, boosting F1 to 0.955 and reducing false positives, confirming that explicit label definitions enhance precision and model stability in fine-grained harmful content classification.

Mistral multi-class F1-score — (b) Mistral

The overall performance of multi-task multi-class ICL.

Llama-Random confusion matrix (Level 1 prompt) — (a) Level 1 prompt

Llama-Random confusion matrix (Level 2 prompt) — (b) Level 2 prompt

The confusion matrix of Llama-Random (192 shots) in multi-task multi-class ICL.

Personalization with ICL

This section explores the personalization capability of In-Context Learning (ICL) for harmful content detection, addressing the limitations of traditional one-size-fits-all moderation systems. The study investigates whether ICL can efficiently adapt to individual user preferences without retraining, focusing on three realistic personalization scenarios: (1) blocking a new harmful category, (2) unblocking an existing one, and (3) blocking semantic variations of specific harmful examples.

In the first scenario, ICL effectively adapts to unseen harmful categories such as toxicity, even with minimal examples. Adding just one or two new samples rapidly stabilizes performance, and richer task descriptions (Level 2 prompts) further improve results at smaller demonstration counts. This demonstrates that ICL can expand its coverage of harmful types with very limited supervision while maintaining generalization across known categories.

Blocking new harmful texts (Level 1 Prompt) — (a) Level 1 Prompt

Blocking new harmful texts (Level 2 Prompt) — (b) Level 2 Prompt

The performance of blocking new harmful texts. Note: Baseline represents use original multi-task binary ICL(UCI, SST2) to detect toxicity (Textdetox).

In the unblocking scenario, where certain harmful types are redefined as benign, ICL exhibits a balance between prior harmfulness knowledge and user-specific redefinitions. A small number of re-labeled examples or explicit definitions enables the model to override previous categorizations effectively, confirming that user-driven recalibration can be achieved with few-shot personalization.

Unblocking known harmful texts (Level 1 Prompt) — (a) Level 1 Prompt

Unblocking known harmful texts (Level 2 Prompt) — (b) Level 2 Prompt

The performance of unblocking a known category of harmful texts. Note: Baseline represents using original multi-task binary ICL(Textdetox, UCI, SST2) to unblock toxicity (Textdetox).

Finally, in the instance-level blocking scenario, ICL successfully learns to filter out perturbed variants of a harmful text. While adding one example suffices to generalize across variations, excessive perturbations can introduce semantic noise and degrade performance, especially for spam.

Blocking variants of toxic texts — (a) Toxic

Blocking variants of spam texts — (b) Spam

Blocking variants of negative texts — (c) Negative

The success rate of blocking variants of specific harmful texts. Baseline: The original harmful text is not in the prompt. 0-perturbation: The original text is included, but no perturbed variants are.

Overall, these findings highlight ICL’s adaptability, enabling effortless expansion, flexible redefinition, and fine-grained control in personalized moderation — all without model retraining.

Evaluation on Wild Data

This section evaluates the robustness and generalizability of In-Context Learning (ICL) on wild data—real-world, uncurated social media posts that exhibit domain shifts, noisy labels, and overlapping harmful categories. Unlike benchmark datasets, which are clean and well-structured, wild data better reflects the challenges faced by real-world moderation systems. To this end, a new dataset of 3,000 annotated Mastodon posts was constructed, each labeled under binary, multi-class, and multi-label schemes to capture the inherent ambiguity and overlap of real-world content.

In the multi-task binary setting, ICL experiences a noticeable performance drop when trained on benchmark-based demonstrations and tested on wild data—F1 decreases from 0.887 to 0.795—highlighting domain transfer challenges. Replacing demonstrations with wild data does not solve this issue due to higher noise levels. However, enriching each demonstration with model-generated rationales markedly improves robustness, boosting precision to 0.966 and F1 to 0.875. These results confirm that incorporating interpretive signals mitigates noise sensitivity and enhances model reliability in uncontrolled environments.

Performance of multi-task binary ICL on wild data with different demonstration settings.

Demos Source (Prompt Template)	Precision	Recall	FPR	F1-Score	Accuracy


Public Datasets (No Reason)	0.724	0.883	0.203	0.795	0.829
Wild Data (No Reason)	0.714	0.836	0.224	0.770	0.800
Wild Data (With Reason)	0.966	0.800	0.019	0.875	0.909

For the multi-task multi-class setup, ICL performance further improves when both wild data and rationales are used, increasing the weighted F1-score from 0.712 to 0.814. The largest performance gain is observed for the spam category (F1: 0.301 → 0.834). Nevertheless, frequent misclassifications between negative, toxic, and benign categories reveal that many wild posts are inherently multi-faceted, combining elements of multiple harmful behaviors, which challenges single-label classification paradigms.

F1-Score of multi-task multi-class ICL on wild data with different demonstration settings.

Demos Source (Prompt Template)	Benign	Negative	Spam	Toxic	Weighted Avg


Public Datasets (No Reason)	0.870	0.509	0.301	0.509	0.712
Wild Data (No Reason)	0.895	0.618	0.777	0.449	0.779
Wild Data (With Reason)	0.921	0.639	0.834	0.563	0.814

The multi-task multi-label formulation provides a more faithful reflection of real-world content, allowing each text to be assigned multiple overlapping labels. Under this setting, ICL achieves strong results with a Subset Accuracy of 0.707, a low Hamming Loss of 0.090, and a high Jaccard Score of 0.831. Per-class analysis shows balanced performance across all categories, with a weighted average F1 of 0.867, confirming ICL’s ability to jointly model co-occurring harmful phenomena such as toxic-spam or negative-toxic combinations.

Performance of multi-task multi-label ICL on wild data.

Label	Count	Precision	Recall	F1-Score


Benign	1907	0.917	0.874	0.895
Negative	1202	0.810	0.916	0.859
Spam	371	0.758	0.836	0.795
Toxic	532	0.845	0.829	0.837

Weighted Avg	4012	0.861	0.877	0.867

Overall, evaluation on wild data highlights three core findings. First, strong benchmark performance does not ensure real-world robustness due to domain shifts and noisy contexts. Second, reason-augmented demonstrations significantly improve stability and precision under wild conditions. Third, multi-label formulations better capture the intertwined nature of harmful content, making them essential for open-domain moderation. These insights emphasize that robust deployment of ICL for harmful content detection requires moving beyond controlled benchmarks toward richer, reason-aware, and multi-label modeling frameworks that mirror the complexities of real-world data.

BibTeX

@misc{zhang2025onesizefitsallpersonalizedharmfulcontent,
      title={Beyond One-Size-Fits-All: Personalized Harmful Content Detection with In-Context Learning}, 
      author={Rufan Zhang and Lin Zhang and Xianghang Mi},
      year={2025},
      eprint={2511.05532},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05532},}