If there’s a Trigger Warning, then where’s the Trigger? Investigating Trigger Warnings at the Passage Level (2024)

Matti Wiegmann^{${}^{1}$} Jennifer Rakete^{${}^{2}$} Magdalena Wolska^{${}^{1}$} Benno Stein^{${}^{1}$} Martin Potthast^{${}^{3}$}
^{${}^{1}$}Bauhaus-Universität Weimar ^{${}^{2}$}Leipzig University ^{${}^{3}$}University of Kassel, hessian.AI, and ScaDS.AI

Abstract

Trigger warnings are labels that preface documents with sensitive content if this content could be perceived as harmful by certain groups of readers. Since warnings about a document intuitively need to be shown before reading it, authors usually assign trigger warnings at the document level. What parts of their writing prompted them to assign a warning, however, remains unclear. We investigate for the first time the feasibility of identifying the triggering passages of a document, both manually and computationally. We create a dataset of 4,135English passages, each annotated with one of eight common trigger warnings. In a large-scale evaluation, we then systematically evaluate the effectiveness of fine-tuned and few-shot classifiers, and their generalizability. We find that trigger annotation belongs to the group of subjective annotation tasks in NLP, and that automatic trigger classification remains challenging but feasible.

Warning. This paper shows example text relating to Death.

1 Introduction

Online content is considered harmful if it has a negative emotional, psychological, or physical effect on people kirk:2022. This applies regardless of whether the effect is intentional (as with hate speech) or unintentional. Since the publication of harmful content is in some cases justified or even important (as with news), affirmative action seems warranted. Several (online) communities have therefore developed a new type of affirmative action borrowed from trauma therapy: trigger warnings. Trigger warnings are freeform labels prefaced to a document by its author to indicate that it contains potentially harmful content. Commonly used warnings range from harmful concepts that can affect any individual, such as Aggression, War, or Death, to those that only affect certain groups of people, such as Discrimination and its sub-concepts Misogyny, Racism, or hom*ophobia. Given the availability of large-scale resources of author-labeled documents with triggering content, the task of automatic trigger warning assignment has recently been picked up (Section2).

Trigger warnings usually do not specify where the triggering content occurs in a document. This poses both theoretical and practical questions for the manual and automatic annotation or assignment of trigger warnings to text documents. For example, a concept such as Death may not be addressed throughout an entire document, but only occasionally. From an application perspective, document-level warnings prevent readers from reading the non-triggering parts of a document, which may undermine harm mitigation bridgland:2023. From a linguistic perspective, triggers are at the pragmatic level of discourse and depends on context: Figure1 shows two examples addressing Death, their context influences annotator perception. Author-supplied ground truth, too, may hence comprise label noise, warranting closer inspection.

In this paper, we lay the foundation to investigate if and how trigger warnings can be reliably annotated and assigned at the passage level:(i)We conduct a large annotation study to understand the challenges of trigger annotation, resulting in a dataset of 4,135English five-sentence passages from the Webis Trigger Warning Corpus2022 wiegmann:2023a, annotated with eight triggers (Section4).(ii)We then systematically evaluate state-of-the-art classifiers in assigning various trigger warnings and analyze their behavior regarding training data availability, label subjectivity, and generalization to unseen concepts (Section5).¹¹1All code and data are shared for reproducibility: github.com/MattiWe/passage-level-trigger-warningsWe find that trigger warning annotation belongs to the subjective annotation tasks in NLP, and that classifying triggering passages requires a careful choice of the right model per warning (Section6).

2 Related Work

Trigger warnings originate from trauma therapyknox:2017 and have been studied in both clinical and educational contexts. A recent meta-study by bridgland:2023 provides an overview of the state of the art in these areas. The first computational trigger warning assignment approach stratta:2020 investigated an interaction design in a user study using a browser plugin (DeText) on generic websites, which was limited to a dictionary-based assignment of the Sexual Assault warning. Subsequently, wolska:22 examined the binary document classification of fan fiction documents using the trigger Graphic Violence. A taxonomy for multimedia triggers is the Narrative Experiences Online(NEON) taxonomy proposed by charles:2022, which contains 90labels based on 136guides from the web. wiegmann:2023a later presented a 36-label taxonomy of trigger warnings, based on academic guidelines and a large-scale annotated corpus of fan fiction documents. Since the former does not provide annotated works, we draw on the latter’s taxonomy and data. All these works consider the assignment of trigger warnings at the document level, while our focus is on the assignment of warnings at the passage level.

The classification of triggers also relates to that of other harmful content, such as toxic comments on Wikipedia articles wulczyn:2017; adams:2017, verbal violence in YouTube and Reddit comments mollas:2020, and the taxonomy of harmful online content banko:2020, which overlaps with the above-mentioned taxonomies in the verbal categories. These works differ mainly in the delineation of harmfulness: who is affected (everyone, groups, or individuals), whether the harm is intentional or collateral, how the content is conveyed (in online speech such as microblogs or chats, or in long-form texts, such as comments, blogs, or narratives). The distinctive feature of trigger warnings is that they focus primarily on collateral harms to individuals, regardless of the form of transmission. kirk:2022 present a more comprehensive discussion of harmful (online) texts and their various aspects; we have adopted their recommendations for annotation.

The moans and cries of the damned assaulted one’s ears. The stench of rotting and burning flesh assaulted one’s nose. The disfigurement of each hapless undead slave, some missing limbs, covered in blood and ooze, some naked, some with their skin missing, and more assaulted one’s eyes. Dark Magic, claws, and weapons assaulted one’s body. And finally, the horror of it all assaulted one’s mind.Annotation: positive instance

Zach would move onto whoever he met at the next frat party. His friends warned him beforehand that Jack is kinda weird. Jack was in Corbyn’s anatomy class and a cadaver has gone missing recently. Of course, it was pure coincidence despite Corbyn insisting otherwise. It was just an unfortunate mishap, and it wouldn’t deter Zach from his goal tonight.Annotation: negative instance

As our research shows, another common feature of triggers and other harmful content is label subjectivity and annotator disagreement. In sandri:2023’ssandri:2023 taxonomy of causes of disagreement in detecting offensive language use, ‘Sloppy Annotation’ (which we mitigate by monitoring annotators and reducing task complexity to binary annotations) and ‘Missing Information’ (which we mitigate by using passages instead of sentences) are mentioned as possible causes, but also the hard-to-avoid ‘Subjectivity’ and ‘Ambiguity’. rottger:2022 note that there are several valid beliefs about labeling harmful content and that the “descriptive” annotation paradigm used in academic work (including ours) attempts to capture all of these beliefs, leading to disagreement among annotators. They suggest isolating a particular belief using a “prescriptive” annotation paradigm when desirable for subsequent application. Another approach is learning with disagreement, as studied in the shared task by that name uma:2021 as well as that on ‘sexism recognition’ plaza:2023, captures multiple beliefs and models them explicitly. Similarly, davani:2022 study multitask learning over multiple annotators’ votes to classify hate speech and emotions without loosing effectiveness.

(a)
Warning	Passages		Keyword
	num	len	src	clean
Violence	1,041	92	029	028
Death	0 544	83	122	050
War	0 827	95	038	025
Abduction	0 511	83	020	018
Racism	0 267	90	043	037
hom*ophobia	0 313	79	038	027
Misogyny	0 377	84	062	055
Ableism	0 255	83	047	031
Total	4,135	88	399	271

(b)WarningNumber of Positive VotesTime $\alpha$ 0123Violence0 198(53 %)098(26 %)060(16 %)021(6 %)390.41Death0 082(31 %)060(22 %)037(14 %)088(33 %)300.36War0 086(34 %)082(32 %)053(21 %)034(13 %)180.22Abduction0 097(31 %)078(25 %)097(31 %)041(13 %)290.25Racism0 310(56 %)120(22 %)071(13 %)043(8 %)290.52hom*ophobia0 678(65 %)175(17 %)119(11 %)069(7 %)270.24Misogyny0 238(47 %)141(28 %)094(18 %)038(7 %)310.25Ableism0 545(66 %)180(22 %)083(10 %)019(2 %)240.25Total2,234(54 %)935(23 %)613(15 %)353(9 %)330.35(c)WarningTrainIDOODViolence1,0010 675Death0 5040 178War0 7870 202Abduction0 4710 252Racism0 2270 068hom*ophobia0 2730 129Misogyny0 3370 152Ableism0 2150 162Total3,8151,818

3 Task Design

Our approach to trigger annotation is the result of several small test runs and pilot studies that provided us with the insights to make design decisions for our annotation task. From these studies, we derived the following three key constraints for the task of trigger annotation at the passage level:

Trigger Diversity

The trigger warnings used in practice are based on the personal opinions and experiences of many authors and thus very diverse. The WTWC-22 organizes them into seven categories containing 36warnings, encompassing hundreds of thousands of variants. Annotating a given passage with all 36warnings, let alone all variants, has proven infeasible at scale under reasonable budget constraints. We therefore select eight common warnings for annotation, four each from the two most frequently assigned warning categories Aggression and Discrimination of the WTWC-22 taxonomy. From Aggression we use all four warnings Death, Violence, Abduction and War. From Discrimination we use the four most frequently assigned warnings Misogyny, Racism, hom*ophobia and Ableism. The former relate primarily to physical harm, the latter to psychological harm.

PersonaYou have suffered through death in the past and want to be warned before reading text that contains death.DefinitionDeath includes graphic or implied descriptions of dead people or creatures, actions leading to death, experiences of dying, and descriptions of grief. A Warning is required if the text evokes a negative experience through the actions or speech of characters, graphic descriptions, or through the narrative. A Warning is required if a vulnerable reader is harmed by reading the text.Demo.Consider the following examples:Text: He laid the cadaver back down, uselessly recalling every single thing he learned in first aid in an …InstructionAssign ‘Warning: yes’ if the following story piece needs a trigger warning for death, else assign ‘Warning: no’.PassageText: The stench of rotting and burning flesh assaulted one’s nose. The disfigurement of each …

Trigger Sparsity

Harmful passages are often sparsely distributed over long documents,²²2E.g., many fan fiction works are as long as books.leading to class imbalance between positive an negative cases. We therefore resort to the commonly used approach of dictionary-based retrieval to obtain a sufficient number of positive candidate passages as detailed below.³³3Retrieval-guided annotation have previously been employed to create NLP resources.This entails selection bias, since very subtle cases of triggering content may not contain any explicit mention of one of the dictionary’s phrases, and since some important phrases might be missing from our dictionaries, both leading to false negatives. We analyze this possibility with a targeted out-of-distribution experiment.

Trigger Severity

Authors and readers are not equally sensitive to a potentially triggering concept (see Section4.4). A reader who enjoys horror may find fewer references to Death in fiction triggering than readers who do not. For instance, the positive example in Figure1 might be considered enticing rather than harmful. Therefore, when annotators are asked to make a descriptive binary decision, they often disagree. This is not uncommon in work on harmful content and, according to rottger:2022, even desirable in an initial study like ours. However, this presents difficulties when assessing annotations and annotators, as inter-annotator agreement measures become less reliable due to different reader sensitivity to a particular trigger. This prevents the use of crowdsourcing platforms for annotation, as we cannot(i)quantitatively evaluate annotators,(ii)train them reliably ahead of time, and(iii)provide appropriate support during annotation.We therefore recruited local annotators and provided them with personal support.

4 Dataset Construction

We constructed a dataset of 4,135passages with five consecutive sentences, each annotated with three binary human labels for one of eight selected trigger warnings. All passages originate from the Webis Trigger Warning Corpus2022 (WTWC-22) wiegmann:2023a, which compiles fan fiction documents with document-level trigger warnings. Figure1 shows example passages and Table1 overviews our dataset.

4.1 Retrieving Triggering Passage Candidates

We collected the passages for annotation using a keyword-based retrieval approach. We first constructed a keyword list for each of the eight considered warnings, then retrieved matching documents from WTWC-22 using a BM25-based initial retrieval⁴⁴4We retrieved the documents via elasticsearch 7.17.4 with the default BM25 scoring on the chapter text. We applied html-removal and stemming after the default pre-processing. where the keywords served as query terms. From each document, we extracted the first sentence with a keyword match and added the two pre- and succeeding sentences as context. Adding context is often necessary understand the central sentence. We did not use the original (paragraph) segmentation of the documents, since they greatly varied in length, particularly since dialog turns are often written as single paragraph. The resulting five-sentence passages were de-duplicated across warnings and annotated in order of the BM25 score of the originating document.

Keyword List Construction.

We build a list of keywords and phrases for each of the eight trigger warnings by prompting gpt-3.5-turbo-0301 using the following prompt with slight morpho-syntactic adaptations for each warning:
Provide a list of <warning> language that must be avoided in a story for people that are triggered by <warning>. Return the list in JSON format.

We cleaned this initial list manually to remove redundant (i.e. lexically similar) phrases, phrases that better match a different warning, or ambiguous phrases that produce many false positives, reducing the initial keywords by ca. 32 %. Table1a shows the number of words and phrases on each warning’s list, before and after cleaning. We split each of the cleaned lists into two sets of keywords (in the middle according to the order returned by the model). The set identity is used to distinguish between in and out-of-distribution examples (cf. Section5.1). Table3 (A.1) illustrates the list construction process for the Death and Misogyny warnings.

Warning	Pairwise $\kappa$			Pairwise Overlap			Mean $\kappa$			Positive Rate
	1-2	1-3	2-3	1-2	1-3	2-3	1	2	3	1	2	3
Violence	0.38	0.33	0.53	0.82	0.80	0.82	0.36	0.46	0.43	0.09	0.25	0.25
Death	0.34	0.29	0.44	0.78	0.74	0.78	0.32	0.39	0.36	0.19	0.25	0.28
War	0.18	0.08	0.38	0.78	0.77	0.81	0.13	0.28	0.23	0.12	0.19	0.18
Abduction	0.27	0.31	0.26	0.77	0.67	0.65	0.29	0.26	0.29	0.20	0.20	0.47
Racism	0.66	0.43	0.46	0.83	0.71	0.73	0.55	0.56	0.44	0.44	0.49	0.56
hom*ophobia	0.23	0.15	0.35	0.60	0.61	0.67	0.19	0.29	0.25	0.34	0.56	0.37
Misogyny	0.39	0.14	0.25	0.74	0.69	0.73	0.26	0.32	0.20	0.31	0.30	0.14
Ableism	0.13	0.31	0.35	0.57	0.66	0.71	0.22	0.24	0.33	0.49	0.33	0.31
Total	0.36	0.28	0.41	0.76	0.73	0.76	0.32	0.38	0.34	0.21	0.28	0.29

4.2 Passage Annotation

Figure1 shows two passages, one for each annotation decision. The collected passages were annotated as either positive (requires a warning) or negative (does not require a warning) by three different annotators each. We decided on the final annotation based on the majority of the votes.

Instead of annotating a fixed number of passages for each warning, the first annotator of each warning rated continuously, in order of the BM25 score of the source document, until 50 passages were marked positive in each set (100 per warning). All passages marked either positive or negative by the first annotator were then also rated by the other annotators. This step is necessary because the ratio of positive-to-negative passages varies between the warnings and is very low (1:10) for some warnings like Death and War in set 2, so having an equal number of passages for each warning would results either in a very high number of annotations or a low number of positive examples for some classes.

Annotation Task Design

The annotation task was designed as a binary classification decision.⁵⁵5We used Label Studio 1.6.0rc5 as the annotation system. Annotators were presented with one passage at a time, an example of which is shown in Figure2. The markable consisted of five parts:(i)a description of the ‘Persona’ which the annotators should adapt for their decision,(ii)the ‘Definition’ of the warning which we created manually to have a similar length and scope,(iii)two ‘Demonstrations’, one positive and one negative, selected by the authors,(iv)the ‘Instructions’ explained the binary classification decision, and(v)the ‘Passage’ to be annotated.

Annotator Instruction and Monitoring

In total, we recruited seven permanent annotators of different genders and backgrounds. The passages were assigned based on the warning, i.e. annotator 1 rated all Misogyny passages, annotator 2 all Racism passages, and so on. The annotators were allowed to freely choose the warning they wanted to annotate, however, they were also asked to choose a warning that they were familiar with, if possible.

Following kirk:2022, we tried to reduce the stress on the annotators in two ways: First, we set no deadlines and we encouraged annotators to work in small batches over a longer period (6 weeks). Second, we arranged for weekly personal meetings with the annotators to discuss unclear cases, to monitor their well-being regarding the task, and to refine the shared understanding of the task through open discussion. The annotators were allowed to re-iterate and modify their annotations at any time.

4.3 Evaluation

We evaluate(i)if the keyword-based passage retrieval is effective,(ii)if the dataset is large and balanced enough for machine learning, and(iii)if the task design facilitates high-quality annotations.

Passage Retrieval

The annotation results in Table1b and Table2 show that1,901 (46 %) of the retrieved passages received at least one positive vote, so a sensitive reader might desire a warning. The highest positive rate(PR) is69 % for hom*ophobia and Racism, the lowest is34 % for War. Considering that most passages in any given document are negative, we consider our keyword-based passage retrieval effective in recalling good annotation candidates. However, some ambiguous keywords like ‘hit’ (Violence) or ‘occupy’ (War) retrieve many off-topic examples, lowering precision. Although these are often easy to annotate, better filtering would reduce annotation costs.

Size

Table1b shows that negative instances are more frequent than positives. Discrimination warnings have a higher positive ratio than Aggression warnings. While the complete dataset is nearly balanced (45 % PR) under a 1-vote threshold, the data skew towards negatives (23 % PR) under a 2-vote threshold. The balance varies between keywords, which is a problem for standard splits because the test or validation sets may have few positive (Set2 War) or negative (Set1 Death) instances. Instead, we opted for cross-validation.

Causes for the imbalance are, first, some keywords retrieve more severe passages. For example, Racism Set1 contains many strong slurs and has a high positive rate (99 %), while Death Set2 (31 % positive rate) contains many fantasy concepts that are also used in harmless contexts. Second, annotators differ in sensitivity to some warnings (see Table2), like Rater1 on Violence (9 % PR compared to 25 %) or Rater3 on Abduction (47 % to20 %).

Annotation Quality

Table1b and Table2 show a chance-corrected inter-annotator agreement (0.22–0.52, mean 0.35 Krippendorff’s $\alpha$ ). This indicates a “fair agreement”, which is consistent with similar tasks (0.34–0.58 Cohen’s $\kappa$ over two annotators for binary offensive language by pitenis:2020; 0.20 Fleiss’ $\kappa$ over 20annotators for binary hateful language by rottger:2022). We expected a reasonable degree of disagreement due to the subjective nature of trigger warnings. Text appears harmful to individuals based on their personal lived experience, which, naturally, varies between annotators. We still evaluated our annotators quantitatively by measuring the pairwise overlap (0.74–0.77 mean across all warnings) and the pairwise agreement (0.31–0.44 Cohen’s $\kappa$ ). One annotator systematically disagreed with the other two (Racism, Annotator3 with mean $\kappa$ of-0.13), so that the annotations were repeated by another.

4.4 Subjectivity and Annotator Beliefs

(1) That’s not unusual. There’s a lot of turnover. She could be a wop or a spic, or she could have some nigg*r blood in her like Obama, and she’s old.co*ke-bottle-bottom eyeglasses, shriveled as a prune, and almost as small as one. 0 out of 10.Racism (3 Votes) – Heavy Slurs
(2) Marcy’s last moments had not been pleasant ones. She’d looked down at the fiery sword protruding from her chest and knew this wasn’t something she’d recover from, that this was it. As she apologized to anne and collapsed, she thought that would be the end of it, but dying was a lot slower than she anticipated. As her vision blurred and her limbs grew numb, she watched as Sasha rushed over to her, screaming something she couldn’t hear. She tried to reach out to her and say something, anything, but all she managed was wheezing and gurgling as blood poured from her mouth.Death (3 Votes) – Graphic Descriptions
(3) One night, however, changed everything. Bella had a spinal stroke, paralyzing her from the waist down. Bella’s husband tried to be there for her, but caring for a wife who was an invalid proved to be too much for him. He divorced his wife, unable to cope with her long recovery. Despite being abandoned by her husband, Bella was determined to not let her disability define her.Ableism (3 Votes) – Clear Description
(4) For Blighttown humanity was a foreign concept: disease fueled mutation stole the human forms of most who live there. Their bodies stretched in gruesome ways, minds lost in the process. Others lost theirs from once desperate cannibalism. A conscience stricken self preservation morphed into an insatiable greed for flesh. Fear had turned to panic and then to hopelessness as years ago disease washed over the rickety mining town.Death (2 Votes) – Unrelatable Setting
(5) Two days. Two goddamn days that burned every good thing in his life to ash. Two days wherein he’d killed another member of his family (they were all dead so long ago, why did i let myself have new family to kill?), physically and emotionally devastated his life’s only love, and saw final death come at last to his greatest (until what i did to Buffy) sin. And they said it took centuries to shake the world to its very core. The evil inside him had managed to rock the foundation of his world off its axis in two short days.Violence (1 Vote) – Concept Mixing with Death
(6) They were disgusting and meaningless, and they were nothing but torture to the poor people who were forced to go through with them. And what had Terry done that one awful morning? Was what Mickey lived through anything but a conversion therapy? ’She’s gonna f*ck the fa*ggot out of you. ’Maybe Ian wasn’t just fighting for himself and the others.hom*ophobia (1 Vote) – Unfamiliar or Unrelatable Concept
(7) She remembered a tall, strong man, with a handsome face framed by black hair. The man she looked at now didn’t match that description in the least. He was confined to a wheelchair loaded with machinery. Frail, thin face surrounded by a mane of unkempt grey hair. His eyes however had not changed.Ableism (1 Vote) – Background Mentions of Concept

Table1b shows the passage count by number of positive votes. Out of all passages, 55 %were unanimously negative, but only9 % unanimously positive. The negatives are partially explained by off-topic passages retrieved by ambiguous keywords, although this does not account for all negatives (see Figure5). The variance across positive votes is explained by varying annotator sensitivity and a different belief of when a warning is required. Figure3 shows typical positive passages. Unanimous positive passages are often severe cases with heavy use of slurs and very graphic descriptions. Non-unanimous examples feature unfamiliar or unrelatable settings (fantasy, science fiction) or concepts (‘conversion therapy’), background or implied mentions, or a mix between warnings which confuse the annotators. Contributing to this are keywords like ‘undead’, ‘reanimation’, or ‘necromancy’ often contained in fantastical scenes (see Table3). Removing those keywords would reduce annotation load and annotator uncertainty but would also reduce recall, remove positive passages, and further limit the scope of the concepts for model training.

The vote aggregation threshold impacts the characteristics of a classifier. A low threshold (1vote) produces recall-oriented classifiers for warning assignment, while a high threshold (2–3votes) produces precision-oriented classifiers suitable for moderation tools. We assume that resolving annotator disagreement will require either a form of personalization with a per-warning known sensitivity, or a prescriptive transformation of the annotation task, where the triggers are intensionally defined.

5 Passage Classification

We structure our experiments around four relevant design decisions and analyze their effect on the performance across models and warnings.

Class Modeling

We comparatively evaluate binary, multiclass, and multilabel modeling via fine-tuned classification. We formulated the annotation task as binary classification: given a passage and a warning, decide if the warning should be assigned. Hence, our reference baseline and our evaluation setting is binary classification. However, binary classification is expensive to scale since each new warning needs a new classifier, and new training data, and each classifier does not use most of the annotated data. Under multiclass modeling, only one classifier is needed so the training data is better utilized, although such a classifier can only predict one warning per passage. Multilabel modeling combines these advantages of binary and multiclass, but is more difficult to train and our (binary) data might not be sufficient.

Fine-tuning vs. Few-shot Learning

We comparatively evaluate six few-shot prompted generative LLMs which can scale to a high-dimensional warning taxonomy like NEON. Trigger warnings are potentially open-class and might require personalization, which is difficult to scale to with model fine-tuning.

Vote Aggregation

We comparatively evaluate majority (2 votes count as positive) and minority voting (1 vote counts as positive) across all other experimental configurations. Our dataset evaluation suggested that examples containing clear and intense harm often receive multiple positive votes (24% with 2–3), while examples with one vote (23%) often contain only mild or implicit harm. The minimum number of votes required for an example to count as positive for classification will thus influence the sensitivity of the classifier.

Keyword Distribution

We evaluate all experiments across in-distribution (ID) and out-of-distribution (OOD) samples of the dataset. Using keyword-based filtering to pre-select passages may limit how well the classifiers generalize to unseen examples that match the respective warnings but not the concepts captured by the keywords. We simulate this situation by splitting the keywords (and the passages retrieved using them) into two non-overlapping sets (cf.4.1) and sampling datasets once from both sets combined and once separately.

5.1 Experiment Datasets

We compiled a total of 16 initial, unbalanced datasets, one for each of the eight warnings with majority and minority vote aggregation respectively. To get comparable results, we sampled balanced test sets with 20 positive and 20 negative instances for each of the 16 unbalanced datasets, which is the limit imposed by the smallest set (Misogyny Set 2). Since 40 instances are too few to get stable results, we created six random folds for a 5-fold Monte Carlo cross-validation with a sixth fold for parameter tuning.

For the in-distribution experiments, the test data are randomly drawn from all examples (i.e. they share passages retrieved by both keyword sets), while all other instances remain for training. In the out-of-distribution experiments, the test data are randomly drawn from all passages retrieved by keyword set 2, all other set 2 instances are discarded and all set 1 instances remain for training. Table1c shows the number of examples in the training and test datasets.

5.2 Models and Training

We evaluate five fine-tuning-based classification models Binary, Binary+ (which includes negative instances from other classes), Multiclass, Multilabel (without any all-negative instances) and Multilabel+ (with all-negative instances) and six generative LLMs GPT 3.5, GPT 4, Mistral 7B, Mixtral 7x8B, Llama 7B, and Llama 13B. We implemented the models using Huggingface’s transformers library (4.35.0). We prompted GPT via OpenAI’s API.

Binary

We trained eight binary classification models for each configuration, one for each warning, on all positive and negative warning-passage pairs of the respective warning. We also trained a version Binary+ where each classifier was trained on the positive instances of the respective warning and the negative instances of all other warnings. We hypothesize that this expansion might improve the classification in the out-of-distribution configurations.

Multiclass

We trained one multiclass classifier by combining the eight training datasets: We assigned each positive instance a class label for it’s warning (class 0–7) and all negative instances to class 8. We test the models on the binary datasets. We count a prediction as positive if it predicts the class corresponding to the dataset’s warning and all other classes as negative (class 8). For example, when predicting multiclass for Death, then all predicted classes except Death count as negative predictions. This is a limitation imposed by annotating the dataset in a binary fashion.

Multilabel

We trained one multilabel classifier by combining the eight training datasets and converting the positive binary class labels to respective one-hot label vectors and discarding all negative labels. This corresponds to a one-class paradigm which can be easily expanded for new warnings. We also trained a version Multilabel+ which includes all negative examples with a zero vector as label (see limitations below). We test the models, similarly to multiclass, by only considering the models prediction for a class if the instance was annotated for that class, i.e. we ignore all predictions for other classes.

Since the training passages are only annotated in a binary way for the respective warning and there is no overlap between passages across warnings, converting the examples to multilabel class vectors introduces errors: If a passage would be positive for a warning it was not annotated for, the class vector becomes incomplete and might confuse the classifier. Although the multilabel class modeling is the most practically convenient, it is also not very promising to reach a high performance due to these issues. Adding negative examples for the multilabel+ classifier will likely exacerbate this problem (cf. Figure3(4)).

However, since the passages are short, they are usually only positive for multiple warnings from within the same category (either Aggression or Discrimination). An example selected for Death may be positive for War or Violence, but is likely not positive for Racism. That means we can mitigate adding many false positives by splitting the multilabel+ classifier into two: one multilabel classifier for all Aggression warnings where only negatives from Discrimination are added, and vice versa.

GPT 3.5 and GPT 4

We prompted the models ‘gpt-3.5-turbo-0125’ and ‘gpt-4-0125-preview’ using the OpenAI API as described in SectionB.3. We only prompted the models for the respective test splits across all warnings and folds.

Mistral 7B and Mixtral 7x8B

We implemented both Mistral models via Huggingface’s transformers library, using the ‘mistralai/Mistral-7B-Instruct-v0.2’ and ‘mistralai/Mixtral-8x7B-Instruct-v0.1’ checkpoints respectively. We encoded the prompt using Mistral’s chat template (without system message) and prefixed the prompt with: You are a classification model that only answers with ’yes’ or ’no’. Mistral 7B was tested on one A100 40GB. Mixtral 7x8B was loaded with 8-bit quantization and tested on three A100 40GB.

Llama 7B and 13B

We implemented both Llama models via Huggingface’s transformers library, using the ‘meta-llama/Llama-2-7b-chat-hf’ and ‘meta-llama/Llama-2-13b-chat-hf’ checkpoints respectively. We used the prompt directly and without the chat template, because the models produces difficult-to-parse output when using the chat template as we did with Mistral. Llama 7B was tested on two A100 40GB with a batch size of 12. Llama 12B was loaded using the A100 GPUs and batch sizes of 8.

If there’s a Trigger Warning, then where’s the Trigger? Investigating Trigger Warnings at the Passage Level (1)

5.3 Evaluation

We test all models on the balanced, binary test datasets for ease of comparison and report accuracy and positive rate. For multiclass and multilabel models we only count the predictions relating to the respective warning (cfLABEL:model-implementation). We report the mean performance across all five folds for each model and macro-averages across models and warnings.

6 Results and Discussion

Figure4 and Table5 (A.2) show that models typically score between ca. 0.6–0.7 accuracy on average across all warnings, depending on the model and data configuration, while the top models per warning score between ca. 0.7–0.8. The highest-scoring models are Multiclass on in-distribution data (ID) with minority voting (MinV) (0.82), Mixtral on ID (0.74), and out-of-distribution (OD) (0.71) majority voting (MajV), and GPT3̇.5 with 0.7 on OD/MinV. The lowest mean scores are barely above random and occur for some worst-case configurations. The best per-warning score is 0.91 (Multiclass on hom*ophobia). Multiclass is the most effective model in 10 of 32 cases across warnings and configurations, followed by Mixtral with 9. The scores vary between warnings, where Violence has the highest (0.70–0.72 mean accuracy) and Racism the lowest scores (0.52–0.69). Aggression warnings score higher than Discrimination.

(1) It’s easier for him to carry what he needs to avoid suspicions. Kunimi is required to pay for his medical school tuition tomorrow. Which is a cadaver. Kunimi doesn’t understand what’s the point of finding dead bodies anymore. His classmates practically dug all the bodies from all the cemeteries, but it’s still not enough.Death (0 Votes) – Not Perceived as Harmful
(2) Many, many things did happen to me. the hardest part is simply in starting, because once you do, the words come flying out. I asked myself for a long time whether i would be able to return to a civilian life. How could i get up in the morning if i didn’t have a purpose, if i wasn’t useful? I’d been a soldier, and a doctor, and a bullet through my shoulder ripped me of both titles.War (0 Votes) – Implicit or Background Mentions
(3) The military bases also held the space force academy, a supposedly nonpartisan association that was clearly held and run by the allies of the United States. If Luna decided to put sanctions on Chinese trade, that also included chinese traders. Directly, that would affect thousands of people, but with sanctions came racism. The workers that lived on Luna, Mars, Deimos, Phobos, and the other new colonies would be ostracized, meaning millions would be left in the cold. Luna would be cutting off all shipments to and from China, which meant China wouldn’t be trading with any colonies in the solar system.Racism (0 Votes) – Implicit or Background Mentions
(4) ’I remember my last holder complaining about that too…’’It sucks and it hurts and there’s no point! and i’d go to the doctor and get that pill thing to stop it, but my dad won’t let me, and… and speaking of my dad, you heard what he was saying earlier…’ Fluff nodded again.Sitting in Alix’s pocket all day, there was no way she couldn’t have heard it - the usual why can’t you be more ladylike? You’re not a little child anymore, you need to stop being so immature! Can’t you be more like your friends?Misogyny (1 Vote) – Edge Cases with Disagreement
(5) Usually, they lasted longer than that, a month, at most. Whenever she started to trust them, they would disappear. Some found murdered, others never seen again. Ramsay liked to play with her, giving and taking anything she found comfort in. There was Theon, but she found no comfort in him, only pity.Violence (1 Vote) – Edge Cases with Disagreement

Class Modeling

Model efficiency varies significantly between modeling strategies: Multiclass outperforms Binary (0.11 pp. for ID/MinV;) and Multilabel+ (0.23 pp. for ID/MinV; 0.07 pp. ID/MajV) and Binary outperforms Multilabel (0.09 for OOD/MinV). All other differences are not significant. Adding more negative examples to Binary shows no significant difference and adding negative examples to Multilabel improves the measured performance within the 95% CI by 0.01–0.06. Although Multiclass is the most effective fine-tuned model, it works only if the warnings are separable and rarely overlap. Since this is often not the case (cf. Figure3 (5)), the binary model is a more realistic alternative.

Fine-tuning vs. Few-shot Learning

The results show that all fine-tuned models perform equal or worse than the best few-shot models (Mixtral) in the OOD configuration. In the ID configuration, the fine-tuned models score comparative or higher (Multilabel for ID/MinV). The fine-tuned models are also more efficient in time and energy and should be preferred whenever possible.

Another notable difference is that all few-shot models have a higher average positive rate (PR) than the fine-tuned model (9–49 pp.), so the generative LLMs can be considered more sensitive. The reason for this is not clear, although explicit harm avoidance during training is a possible explanation.

Vote Aggregation

Increasing the aggregation threshold from one to two significantly increases the accuracy for Multilabel+ and decreases the accuracy for Multiclass, for Ableism, and for Abduction (all ID). In addition, the inter-quartile range of the folds (box plots in Figure4) is smaller for MajV. This is likely explained by the reduced number of positive examples, which reduced the variance of the examples in the folds since there are much fewer positive examples to draw from.

Increasing the threshold also reduces the PR (3–28 pp.) across all fine-tuned models and increases the PR (6–10 pp.) across all few-shot models. The increase is likely explained by the overall increases of instances with 1+ votes increases with higher thresholds, since these get samples as the negative class and, since the the prompt did not change, this increases the few-shot models’ positive rates.

Keyword Distribution

Overall, all models are less effective in the OOD configuration. This is expected behavior for fine-tuned models, since they can rely less on learned lexical cues, but not necessarily for few-shot models. A possible explanation is that we split the keywords in the order of generation, so keywords that are strongly associated with a trigger are in set 1. The OOD test set then contains passages that are less strongly associated with the trigger, thus reducing the few-shot scores.

In the OOD configurations, all fine-tuned models are significantly worse by ca. 0.1–0.2 pp. on average with an increased spread of per-warning scores while the few-shot models score as on ID (except for a reduction on Racism). Fine-tuned models are still competitive for most Aggression but ineffective for Discrimination warnings with barely above random accuracy and very low PRs. This is likely due to systematic differences between the keywords (cf. 3). While the Aggression keywords are somewhat consistent between sets, Discrimination keywords often start with a series of slurs (Set 1) and become more conceptual later (Set 2), leading to a topical gap between OOD training and test. This was the goal of the configuration and shows that fine-tuned models generalize poorly in this case, highlighting the importance of diverse training data.

6.1 Causes of Misclassification

Table5 shows selected passages that are always misclassified. Cases of systematic misclassification contain passages that (i)are on-topic but are never rated as triggering by the annotators, (ii)mention the topic implicitly or in the background (like historic discrimination, descriptions of the setting) but are never rated as triggering by the annotators, and (iii)with edge cases which are rated positively by only one annotator. Misclassified instances seem to align with annotator disagreement. Across all few-shot models, many (56%/75%) instances with 1 or 3 votes are always classified correctly, but only 15% with 1 vote. Mixtral classifies 70–91% of passages with 1/2/3-votes correctly, but only 30% of 1-vote instances. The fine-tuned models, too, more often misclassify examples with disagreement but less extreme: Binary (ID) is correct for 66% of 3-vote and only 44% on 2-vote instances.

7 Conclusion

In this paper, we seek to identify the exact text passages that prompt authors to preface their works with trigger warnings. We model the task as binary classification and create a dataset of 4,135English passages, each annotated by three human votes across 8 trigger warnings. We investigate how different beliefs about triggers contribute to the assignment of warnings by quantitatively and qualitatively analyzing annotator disagreement and the behavior of 11 classifiers across different design decisions: Vote Aggregation, Keyword Distribution, Class Modeling, and Training vs. Prompting.

Our keyword-based passage retrieval identifies many positive instances, from severe and graphic passages with high agreement to mild and implicit ones where the annotators’ opinions differ based on their personal sensitivities and beliefs about the need for trigger warnings. We find that where annotators disagree, classification errors occur more frequently. Furthermore, we can experimentally show that, first, diverse training data is required for models to generalize well to unseen concepts and rare triggers, second, few-shot models like Mixtral are competitive (albeit computationally expensive), especially for unseen triggers, and, third, fine-tuned models are often still the best models for individual warnings and in certain configurations. It is advantageous to select a model specifically for the targeted warning or configuration, depending on the goal, like deleting content vs. adding warnings, or on the targeted sensitivity.

In conclusion, we question whether authors can determine a trigger warning equally well for every reader. Future work may include corresponding investigations. Personalizing trigger warning assignment seems a fruitful computational direction.

Limitations

We consider four limitations.First, we only consider eight warnings from two categories in this study, all of them from more frequent cases. The WTWC-22 corpus contains 36 warnings from seven categories, while other taxonomies list up to 90. Consequentially, our strategy to generate keywords and collect passages to annotate might not work for rare or very individual warnings. Besides, the annotation instructions and prompts may also not generalize to some of the warnings, and model performance may differ.

Second, we recruited annotators from a population of computer science students which is a shared demographic and hence a source of bias. All annotations were done by individuals who are not (clinically) affected by any trigger but who have different sensitivities to the different triggering concepts. The latter limitation holds for almost all trigger warnings which are usually assigned by the authors of the content. However, more reliable annotations would require that annotators are actually triggered by the concepts described by the annotation categories. We know of no practical and ethically feasible way to create such annotations.

Third, we cannot make any claims about which parts of the annotation instructions and the prompts (cf. Figure2) influence the performance of the annotators or the models beyond our ablation study.

Fourth, we did not quantitatively evaluate if our annotation instructions are the most optimal or if a different annotation paradigm would reduce the subjectivity of the task without limiting the captured beliefs about trigger warnings.

Impact Statement

As with any work on harmful content in the age of generative AI, our data, code, and insights may used in bad faith to generate harmful content or for adversarial engineering to trick detection algorithms. The artifacts we use (WTWC-22) are for academic use only, as are the artifacts we created: our data and code.

It is safe to assume that due to the inherently subjective nature of the phenomenon of triggering, future work will focus on personalized decision-making for harmful content classification.However, personalization hereis a highly sensitive issue since it requires deep knowledge about the individuals and their vulnerabilities.

We followed best practices when annotating harmful content for this work. The student annotators were not paid, besides teaching credit of which annotations were only part. Annotators were not pressured to complete annotations, were given a possibility of opting out of categories that they found particularly disturbing, and it was made clear that grading was independent of the completion of the annotation work.

References

Adams etal. (2017)C.J. Adams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald,nithum, and Will Cukierski. 2017.Toxic comment classification challenge.
Banko etal. (2020)Michele Banko, Brendon MacKeen, and Laurie Ray. 2020.A unified taxonomyof harmful content.In Proceedings of the Fourth Workshop on Online Abuse andHarms, WOAH 2020, Online, November 20, 2020, pages 125–137. Associationfor Computational Linguistics.
Bridgland etal. (2023)Victoria M.E. Bridgland, PaytonJ. Jones, and BenjaminW. Bellet. 2023.A meta-analysis ofthe efficacy of trigger warnings, content warnings, and content notes.Clinical Psychological Science, 0(0):21677026231186625.
Charles etal. (2022)Ashleigh Charles, Laurie Hare-Duke, Hannah Nudds, Donna Franklin, JoyLlewellyn-Beardsley, Stefan Rennick-Egglestone, Onni Gust, Fiona Ng,Elizabeth Evans, Emily Knox, etal. 2022.Typology of content warnings and trigger warnings: Systematic review.PloS one, 17(5):e0266722.
Davani etal. (2022)AidaMostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022.Dealing with disagreements: Looking beyond the majority vote insubjective annotations.Trans. Assoc. Comput. Linguistics, 10:92–110.
Kirk etal. (2022)Hannah Kirk, Abeba Birhane, Bertie Vidgen, and Leon Derczynski. 2022.Handling and presenting harmful text in NLP research.In Findings of the Association for Computational Linguistics:EMNLP 2022, pages 497–510, Abu Dhabi, United Arab Emirates. Association forComputational Linguistics.
Knox (2017)Emily Knox. 2017.Trigger Warnings: History, Theory, Context.Rowman & Littlefield.
Mollas etal. (2020)Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas.2020.ETHOS: an online hate speech detection dataset.CoRR, abs/2006.08328.
Pitenis etal. (2020)Zesis Pitenis, Marcos Zampieri, and Tharindu Ranasinghe. 2020.Offensive languageidentification in Greek.In Proceedings of the Twelfth Language Resources and EvaluationConference, pages 5113–5119, Marseille, France. European Language ResourcesAssociation.
Plaza etal. (2023)Laura Plaza, Jorge Carrillo-de-Albornoz, Roser Morante, EnriqueAmigó, Julio Gonzalo, Damiano Spina, and Paolo Rosso. 2023.Overview of EXIST 2023 - learning with disagreement for sexismidentification and characterization.In CLEF, volume 14163 of Lecture Notes in ComputerScience, pages 316–342. Springer.
Röttger etal. (2022)Paul Röttger, Bertie Vidgen, Dirk Hovy, and JanetB. Pierrehumbert. 2022.Twocontrasting data annotation paradigms for subjective NLP tasks.In Proceedings of the 2022 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022,pages 175–190. Association for Computational Linguistics.
Sandri etal. (2023)Marta Sandri, Elisa Leonardelli, Sara Tonelli, and Elisabetta Jezek. 2023.Why don’t you do it right? analysing annotators’ disagreement insubjective tasks.In EACL, pages 2420–2433. Association for ComputationalLinguistics.
Stratta etal. (2020)Manuka Stratta, Julia Park, and Cooper deNicola. 2020.Automated contentwarnings for sensitive posts.In Extended Abstracts of the 2020 CHI Conference on HumanFactors in Computing Systems, CHI 2020, Honolulu, HI, USA, April 25-30,2020, pages 1–8. ACM.
Uma etal. (2021)Alexandra Uma, Tommaso Fornaciari, Anca Dumitrache, Tristan Miller, JonChamberlain, Barbara Plank, Edwin Simpson, and Massimo Poesio. 2021.Semeval-2021 task 12: Learning with disagreements.In SemEval@ACL/IJCNLP, pages 338–347. Association forComputational Linguistics.
Wiegmann etal. (2023)Matti Wiegmann, Magdalena Wolska, Christopher Schröder, Ole Borchardt,Benno Stein, and Martin Potthast. 2023.Trigger WarningAssignment as a Multi-Label Document Classification Problem.In Proceedings of the 61th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages 12113–12134,Toronto, Canada. Association for Computational Linguistics.
Wolska etal. (2022)Magdalena Wolska, Christopher Schröder, Ole Borchardt, Benno Stein, andMartin Potthast. 2022.Trigger warnings:Bootstrapping a violence detector for fanfiction.CoRR, abs/2209.04409.
Wulczyn etal. (2017)Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.Ex machina: Personalattacks seen at scale.In Proceedings of the 26th International Conference on WorldWide Web, WWW ’17, pages 1391–1399, Republic and Canton of Geneva, CHE.International World Wide Web Conferences Steering Committee.

Appendix A Figures and Tables

A.1 Dataset Evaluation

Racism
Prompt	Provide a list of racist language that must be avoided in a story for people that are triggered by it.
	Return the list in json format.
Phrases	niggr, chink, spic, gook, wetback, jap, darkie, kike, dago, ngger, chnk, spc, g*k,
in Set 1	towelhead, curry muncher, redskin, half-breed, slant eyes
Phrases	model minority, savage, primitive, barbaric, uncivilized, tribal, blackface, white supremacy, aryan race,
in Set 2	master race, racial purity, colorblind, racism, articulate for, pretty for a, go back to your country,
	you’re not like other, you’re one of the good ones, I don’t see color, all lives matter
Omitted	sand nigg*r, exotic, oriental, reverse racism, nip, speak english

Death
Prompt	Provide a list of language relating to the death of others, character death, killing, corpses, or coping
	with loss or grief that must be avoided in a story for people that are triggered by it.
	Return the list in json format.
Phrases	mourn, grieve, deal with loss, cope with grief, bury, inter, cremate, come to terms with, kill, murder,
in Set 1	slay, weep, slaughter, sacrifice, take a life, end a life, cause death, corpse, dead body, lifeless form,
	cadaver, pass away, suffer loss, experience grief, navigate bereavement, struggle with mourning,
	manage sorrow, cope with death, die, cease, euthanize
Phrases	choose to die, end suffering, peacefully pass away, undead, reanimate, come back to life, resurrect, revive,
in Set 2	consume flesh, cannibalize, consume brains, devour flesh, tear into, consume blood, drink blood, suck blood,
	cannibalism, necromancy, flesh-eating, blood-sucking, risen from the dead
Omitted	cry, sob, lament, suffer, comfort, console, remember, honor, respect, accept, process, deal with, face,
	confront, put down, put to sleep, remains, lose, deal with deceased, struggle, fight, gasping, gasp,
	drown, sink, fade, weaken, wane, slip away, pass, depart, succumb, let go, release, end life, attack,
	bite, devour, consume, feast, eat, chew, gnaw, rip, tear, destroy, eliminate, terminate, annihilate,
	exterminate, rise again, prey upon, feast on, rip apart, consume human flesh, zombify,
	reanimate as a zombie, drain blood, vampirize, turn into a vampire, resurrection, zombification,
	vampirism, undeadness, death-like state, revival, post-mortem consumption, reanimation, expire

Warning	Majority Instances		Minority Instances		Majority Vote PR		Minority Vote PR
	pos	neg	pos	neg	Set 1	Set 2	Set 1	Set 2
Violence	188	0 853	0 363 (35%)	0 678	107 (16 %)	081 (22 %)	205 (30 %)	0 158 (43 %)
Death	114	0 430	0 234 (43%)	0 310	079 (44 %)	035 (10 %)	122 (69 %)	0 112 (31 %)
War	102	0 725	0 282 (34%)	0 545	039 (19 %)	063 (10 %)	097 (48 %)	0 185 (30 %)
Abduction	132	0 379	0 273 (53%)	0 238	059 (23 %)	073 (28 %)	122 (48 %)	0 151 (57 %)
Racism	125	0 142	0 185 (69%)	00 82	066 (97 %)	059 (30 %)	067 (99 %)	0 118 (59 %)
hom*ophobia	138	0 175	0 216 (69%)	00 97	052 (40 %)	086 (47 %)	077 (60 %)	0 139 (76 %)
Misogyny	081	0 296	0 179 (47%)	0 198	057 (38 %)	024 (11 %)	090 (59 %)	0 089 (40 %)
Ableism	087	0 168	0 169 (66%)	00 86	043 (27 %)	044 (47 %)	092 (56 %)	0 077 (83 %)
Total	966	3,169	1,901 (46%)	2,234	501 (28 %)	465 (20 %)	872 (48 %)	1,029 (44 %)

A.2 Classification Evaluation

(a) In-Distribution with Minority Voting
ModeViolenceDeathWarAbductionRacismhom*oph.MisogynyAbleismMeanAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRBinary0.800.440.810.450.700.420.760.530.740.570.640.740.630.460.590.830.710.56Binary+0.770.430.740.480.710.430.670.600.680.380.720.580.650.380.710.550.710.48Multiclass0.780.280.780.290.770.270.860.360.810.320.910.410.790.280.900.400.820.32Multilabel0.580.730.480.900.560.820.680.760.510.770.550.930.510.860.680.750.570.82Multilabel+0.610.620.480.860.540.780.670.730.630.690.500.000.560.800.700.710.590.65GPT 3.50.760.530.720.670.660.770.780.580.650.680.700.740.590.760.700.600.700.67GPT 40.780.680.570.910.670.740.830.450.690.650.710.740.600.700.750.600.700.68Mistral 7B0.740.680.670.780.680.800.730.690.640.650.630.830.620.640.670.590.670.71Mixtral 8x7B0.780.420.700.560.620.680.760.420.670.480.710.550.640.570.700.360.700.51Llama 7B0.700.320.680.710.630.550.640.450.650.570.630.680.560.500.560.210.630.50Llama 13B0.680.670.620.820.660.660.640.760.600.770.650.750.540.770.620.750.630.74Mean0.720.530.660.680.650.630.730.580.660.590.670.630.610.610.690.58

(c) In-Distribution with Majority Voting
ModeViolenceDeathWarAbductionRacismhom*oph.MisogynyAbleismMeanAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRBinary0.770.320.810.320.680.180.570.180.740.470.710.430.570.140.550.240.680.28Binary+0.750.320.790.340.610.120.630.240.730.340.670.420.600.150.590.260.670.27Multiclass0.700.200.730.230.640.140.650.150.810.300.840.340.630.130.710.220.710.21Multilabel0.710.570.490.860.590.780.720.710.520.820.490.940.570.790.550.770.580.78Multilabel+0.740.410.620.700.730.550.720.590.610.710.500.000.620.620.580.780.640.54GPT 3.50.840.570.710.790.700.780.740.740.680.730.630.860.660.840.660.700.700.75GPT 40.800.700.600.900.690.810.760.660.720.660.620.870.680.800.680.720.700.76Mistral 7B0.740.700.620.870.640.820.680.820.700.660.540.940.680.750.660.730.660.79Mixtral 8x7B0.800.490.760.680.720.710.770.540.720.530.730.700.710.700.740.560.740.61Llama 7B0.730.400.630.790.710.570.650.570.720.560.670.770.650.660.600.340.670.58Llama 13B0.700.640.580.880.660.690.640.800.620.780.580.920.650.830.580.820.630.80Mean0.750.480.670.670.670.560.690.550.690.600.640.650.640.580.630.56

(c) Out-of-Distribution with Minority Voting
ModeViolenceDeathWarAbductionRacismhom*oph.MisogynyAbleismMeanAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRBinary0.760.360.720.410.660.620.760.530.501.000.620.220.560.630.480.460.630.53Binary+0.760.400.730.440.680.590.660.490.470.200.570.300.550.140.430.400.610.37Multiclass0.710.210.550.050.770.270.710.210.500.000.500.000.570.080.550.110.610.12Multilabel0.590.570.620.480.610.680.630.650.470.280.430.480.490.360.470.560.540.50Multilabel+0.670.450.660.220.690.640.700.670.500.020.500.000.570.180.470.260.600.31GPT 3.50.730.550.710.580.710.760.800.550.580.620.680.770.550.680.790.610.700.64GPT 40.790.660.560.900.740.730.740.510.610.500.630.850.590.530.720.760.670.68Mistral 7B0.720.660.630.760.710.780.690.670.580.490.550.920.610.430.720.560.650.66Mixtral 8x7B0.740.430.640.480.740.700.750.390.510.340.650.620.630.360.760.380.680.46Llama 7B0.640.410.600.640.670.510.690.450.480.440.680.710.520.300.510.290.600.47Llama 13B0.650.650.570.840.740.670.660.690.560.670.650.820.520.610.610.810.620.72Mean0.700.480.640.530.700.630.710.530.520.410.590.520.560.390.590.47

(d) Out-of-Distribution with Majority Voting
ModeViolenceDeathWarAbductionRacismhom*oph.MisogynyAbleismMeanAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRAccPRBinary0.680.250.590.100.640.280.560.260.501.000.530.090.560.430.490.080.570.31Binary+0.700.280.680.300.650.300.610.440.480.100.520.090.500.020.460.080.580.20Multiclass0.620.120.570.080.620.120.690.200.500.000.500.000.500.000.500.000.570.07Multilabel0.660.520.710.320.640.570.650.700.460.230.560.450.510.260.500.360.590.43Multilabel+0.720.350.700.280.650.610.670.630.500.020.500.000.540.130.490.220.600.28GPT 3.50.790.660.760.720.660.820.720.780.600.600.620.860.680.690.720.760.690.74GPT 40.710.800.590.900.680.800.710.740.590.510.580.920.670.610.630.840.650.77Mistral 7B0.680.760.660.840.670.800.650.850.540.540.550.940.680.540.710.680.640.74Mixtral 8x7B0.780.550.770.630.680.720.730.610.540.320.710.700.700.440.740.530.710.56Llama 7B0.720.420.620.740.710.520.660.620.590.520.650.750.610.340.570.300.640.53Llama 13B0.650.770.630.860.720.640.650.840.550.700.600.890.650.680.570.850.630.78Mean0.700.500.660.520.670.560.660.610.530.410.580.520.600.380.580.43

If there’s a Trigger Warning, then where’s the Trigger? Investigating Trigger Warnings at the Passage Level (2)

Appendix B Model Training

All fine-tuned models are based on a ‘roberta-base’ checkpoint that was fine-tuned for fan fiction using masked language modeling (cf.B.1). We conducted a parameter sweep for each fine-tuned model on a 6th fold (cf.B.2). All fine-tuning models were trained on a single A100 40GB. All generative LLMs were prompted using the instructions shown in Figure2. SectionB.3 describes the prompt and our ablation study in detail. All used models are explained in detail in SectionLABEL:model-implementation.

B.1 Language Modeling Fine-tuning of RoBERTa for Fan Fiction

We fine-tuned the ‘roberta-base’ checkpoint on fan fiction documents via masked language modeling using Huggingface’s ‘Trainer’ routine. As data, we extracted all English fan fiction documents from WTWC-22 that were marked as recommended for model training, i.e. where a trigger warning was assigned and where overly long, short, or unpopular works were removed (wiegmann:2023a list the precise parameters). All documents were split into training examples of ca. 450-500 words, respecting sentence and paragraph boundaries. We trained the checkpoint on the resulting 19 million examples for ca. 470,000 steps with a batch size of 32, the point of loss convergence on a 10,000 example hold-out validation set. As parameters, we used largely the standard parameters of the ‘Trainer’, with a random masking function, masking probability of 0.2, and an initial learning rate of 2e-5.

B.2 Fine-tuning Parameter Sweep

We conducted a grid-based parameter sweep on the validation sample (cf. Section5.1) for all fine-tuning strategy (binary, multi-label, multi-class, with and without extended training data) across three dimensions: in vs out-of-distribution, minority vs. majority vote aggregation, and learning rate within $(1e-5,2e-5,5e-5)$ . All models were trained for 20 epochs. The parameters did not vary between warnings. Table6 shows the final parameter settings each model was trained on.

Strategy	Dist.	Vote Agg.	LR	Acc.
binary	ood	majority	1e-5lr	0.62
binary	ood	minority	2e-5lr	0.64
binary	id	majority	1e-5lr	0.68
binary	id	minority	5e-5lr	0.74
binary extended	ood	minority	2e-5lr	0.62
binary extended	ood	majority	2e-5lr	0.62
binary extended	id	majority	2e-5lr	0.64
binary extended	id	minority	1e-5lr	0.71
multi-label	ood	majority	2e-5lr	0.58
multi-label	ood	minority	5e-5lr	0.57
multi-label	id	majority	2e-5lr	0.59
multi-label	id	minority	1e-5lr	0.57
multi-label extended	ood	majority	2e-5lr	0.62
multi-label extended	ood	minority	2e-5lr	0.58
multi-label extended	id	majority	1e-5lr	0.66
multi-label extended	id	minority	1e-5lr	0.57
multi-class	ood	majority	2e-5lr	0.59
multi-class	ood	minority	5e-5lr	0.64
multi-class	id	majority	2e-5lr	0.68
multi-class	id	minority	5e-5lr	0.84

B.3 Prompt Ablation

Figure2 shows the 5 parts of the instructions we used for both the annotators and as the prompt for the few-shot models: Instruction, Passage, Persona, Definition, and Demonstrations (one positive and one negative in that order). We counted model responses as positive when they contained ‘yes’ within the first 5 response tokens and negative if they contained ‘no’

For the prompt ablation, we queried all generative LLMs with open weights (excluding GPT) across all 4 test settings (aggregation and distribution), averaged the scores across all settings per model, and averaged the scores across all models to determine the best prompt template. We tested eight prompt template variations (cf. Table7). Each variation started with Instruction and Passage, followed by all combinations of Persona, Definition, and Demonstrations, including neither. We manually selected the demonstrations that we rated as clear and representative for the respective trigger out of all unanimously annotated passages. We selected the prompt with all parts, which had with the highest average accuracy on the validation sample across all models and was also the one shown to annotators.

Prompt			Mistral 7B	Mixtral 8x7B	Llama 7B	Llama 13B	Mean Acc.
Instruction and Passage			0.72	0.74	0.52	0.50	0.62
+ Persona			0.71	0.74	0.57	0.57	0.65
	+ Definition		0.72	0.75	0.54	0.58	0.65
		+ Demonstrations	0.69	0.70	0.58	0.64	0.65
+ Persona	+ Definition		0.73	0.75	0.57	0.55	0.65
+ Persona		+ Demonstrations	0.68	0.74	0.64	0.64	0.68
	+ Definition	+ Demonstrations	0.68	0.75	0.67	0.63	0.68
+ Persona	+ Definition	+ Demonstrations	0.68	0.75	0.68	0.59	0.68