YouTube is a major social media platform that plays a significant role in digital culture, with content creators at its core. These creators often engage in controversial behaviour to drive engagement, which can foster toxicity. This paper presents a quantitative analysis of controversial content on YouTube, focusing on the relationship between controversy, toxicity, and monetisation. We introduce a curated dataset comprising 20 controversial YouTube channels extracted from Reddit discussions, including 16,349 videos and more than 105 million comments. We identify and categorise monetisation cues from video descriptions into various models, including affiliate marketing and direct selling, using lists of URLs and keywords. Additionally, we train a machine learning model to measure the toxicity of comments in these videos. Our findings reveal that while toxic comments correlate with higher engagement, they negatively impact monetisation, indicating that controversy-driven interaction does not necessarily lead to financial gain. We also observed significant variation in monetisation strategies, with some creators showing extensive monetisation despite high toxicity levels. Our study introduces a curated dataset, lists of URLs and keywords to categorise monetisation, a machine learning model to measure toxicity, and is a significant step towards understanding the complex relationship between controversy, engagement, and monetisation on YouTube. The lists used for detecting and categorising monetisation cues are available on https://github.com/thalesbertaglia/toxmon.
InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection
Thales Bertaglia, Lily Heisig, Rishabh Kaushal, and Adriana Iamnitchi
In Proceedings of the International AAAI Conference on Web and Social Media , 2024
Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
Influencer Self-Disclosure Practices on Instagram: A Multi-Country Longitudinal Study
Thales Bertaglia, Catalina Goanta, Gerasimos Spanakis, and Adriana Iamnitchi
Submitted to Online Social Networks and Media (under review), 2024
This paper presents a longitudinal study of more than ten years of activity on Instagram consisting of over a million posts by 400 content creators from four countries: the US, Brazil, Netherlands and Germany. Our study shows differences in the professionalisation of content monetisation between countries, yet consistent patterns; significant differences in the frequency of posts yet similar user engagement trends; and significant differences in the disclosure of sponsored content in some countries, with a direct connection with national legislation. We analyse shifts in marketing strategies due to legislative and platform feature changes, focusing on how content creators adapt disclosure methods to different legal environments. We also analyse the impact of disclosures and sponsored posts on engagement and conclude that, although sponsored posts have lower engagement on average, properly disclosing ads does not reduce engagement further. Our observations stress the importance of disclosure compliance and can guide authorities in developing and monitoring them more effectively.
Across Platforms and Languages: Dutch Influencers and Legal Disclosures on Instagram, YouTube and TikTok
Content monetization on social media fuels a growing influencer economy. Influencer marketing remains largely undisclosed or inappropriately disclosed on social media. Non-disclosure issues have become a priority for national and supranational authorities worldwide, who are starting to impose increasingly harsher sanctions on them. This paper proposes a transparent methodology for measuring whether and how influencers comply with disclosures based on legal standards. We introduce a novel distinction between disclosures that are legally sufficient (green) and legally insufficient (yellow). We apply this methodology to an original dataset reflecting the content of 150 Dutch influencers publicly registered with the Dutch Media Authority based on recently introduced registration obligations. The dataset consists of 292,315 posts and is multi-language (English and Dutch) and cross-platform (Instagram, YouTube and TikTok). We find that influencer marketing remains generally underdisclosed on social media, and that bigger influencers are not necessarily more compliant with disclosure standards.
Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research
Henry Tari, M. Danial Khan, Justus Rutten, Darian Othman, Thales Bertaglia , and 2 more authors
In Proceedings of the 35th ACM Conference on Hypertext and Social Media , Poznan, Poland, 2024
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
2023
Sexism in Focus: An Annotated Dataset of YouTube Comments for Gender Bias Research
Thales Bertaglia, Katarina Bartekova, Rinske Jongma, Stephen Mccarthy, and Adriana Iamnitchi
In Proceedings of the 3rd International Workshop on Open Challenges in Online Social Networks , Rome, Italy, 2023
This paper presents a novel dataset of 200k YouTube comments from 468 videos across 109 channels in four content categories: Entertainment, Gaming, People & Blogs, and Science & Technology. We applied state-of-the-art NLP methods to augment the dataset with sexism-related features such as sentiment, toxicity, offensiveness, and hate speech. These features can assist manual content analyses and enable automated analysis of sexism in online platforms. Furthermore, we develop an annotation framework inspired by the Ambivalent Sexism Theory to promote a nuanced understanding of how comments relate to the gender of content creators. We release a small sample of comments annotated using this framework. Our dataset analysis confirms that female content creators receive more sexist and hateful comments than their male counterparts, underscoring the need for further research and intervention in addressing online sexism.
Closing the Loop: Testing ChatGPT to Generate Model Explanations to Improve Human Labelling of Sponsored Content on Social Media
Thales Bertaglia, Stefan Huber, Catalina Goanta, Gerasimos Spanakis, and Adriana Iamnitchi
Regulatory bodies worldwide are intensifying their efforts to ensure transparency in influencer marketing on social media through instruments like the Unfair Commercial Practices Directive (UCPD) in the European Union, or Section 5 of the Federal Trade Commission Act. Yet enforcing these obligations has proven to be highly problematic due to the sheer scale of the influencer market. The task of automatically detecting sponsored content aims to enable the monitoring and enforcement of such regulations at scale. Current research in this field primarily frames this problem as a machine learning task, focusing on developing models that achieve high classification performance in detecting ads. These machine learning tasks rely on human data annotation to provide ground truth information. However, agreement between annotators is often low, leading to inconsistent labels that hinder the reliability of models. To improve annotation accuracy and, thus, the detection of sponsored content, we propose using chatGPT to augment the annotation process with phrases identified as relevant features and brief explanations. Our experiments show that this approach consistently improves inter-annotator agreement and annotation accuracy. Additionally, our survey of user experience in the annotation task indicates that the explanations improve the annotators’ confidence and streamline the process. Our proposed methods can ultimately lead to more transparency and alignment with regulatory requirements in sponsored content detection.
Digital influencers, monetization models and platforms as transactional spaces
This paper aims to discuss the impact of digital influencers’ content monetization on social media in the context of platform governance. For achieving this objective, it characterizes the Monetization Supply Chain and the different monetization models as ad revenue, on-platform influencer marketing; subscription, tokenization, crowdfunding; direct selling; creator funds; besides the traditional influencer marketing. It also presents preliminary analyses of a dataset of posts by 400 influencers from four countries: Brazil, Germany, the Netherlands and the United States of America to understand how content creators from different countries are framing sponsored content.
2022
The case for a legal compliance API for the enforcement of the EU’s digital services act on social media platforms
Catalina Goanta, Thales Bertaglia, and Adriana Iamnitchi
In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , 2022
In the course of under a year, the European Commission has launched some of the most important regulatory proposals to date on platform governance. The Commission’s goals behind cross-sectoral regulation of this sort include the protection of markets and democracies alike. While all these acts propose sophisticated rules for setting up new enforcement institutions and procedures, one aspect remains highly unclear: how digital enforcement will actually take place in practice. Focusing on the Digital Services Act (DSA), this discussion paper critically addresses issues around social media data access for the purpose of digital enforcement and proposes the use of a legal compliance application programming interface (API) as a means to facilitate compliance with the DSA and complementary European and national regulation. To contextualize this discussion, the paper pursues two scenarios that exemplify the harms arising out of content monetization affecting a particularly vulnerable category of social media users: children. The two scenarios are used to further reflect upon essential issues surrounding data access and legal compliance with the DSA and further applicable legal standards in the field of labour and consumer law.
2021
Abusive language on social media through the legal looking glass
Thales Bertaglia, Andreea Grigoriu, Michel Dumontier, and Gijs Dijck
In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) , 2021
Abusive language is a growing phenomenon on social media platforms. Its effects can reach beyond the online context, contributing to mental or emotional stress on users. Automatic tools for detecting abuse can alleviate the issue. In practice, developing automated methods to detect abusive language relies on good quality data. However, there is currently a lack of standards for creating datasets in the field. These standards include definitions of what is considered abusive language, annotation guidelines and reporting on the process. This paper introduces an annotation framework inspired by legal concepts to define abusive language in the context of online harassment. The framework uses a 7-point Likert scale for labelling instead of class labels. We also present ALYT – a dataset of Abusive Language on YouTube. ALYT includes YouTube comments in English extracted from videos on different controversial topics and labelled by Law students. The comments were sampled from the actual collected data, without artificial methods for increasing the abusive content. The paper describes the annotation process thoroughly, including all its guidelines and training steps.
Clout Chasing for the Sake of Content Monetization: Gaming Algorithmic Architectures with Self-Moderation Strategies
Thales Bertaglia, Adrien Dubois, and Catalina Goanta
This short discussion paper addresses how controversy is monetized online by reflecting on a new iteration of the shock value in media production, identified on social media as the ‘clout chasing’ phenomenon. We first exemplify controversial behavior, and subsequently proceed to defining clout chasing, which we discuss this concept in relation to existing frameworks for the understanding of controversy on social media. We then outline what clout chasing entails as a content monetization strategy, and address the risks associated with this approach. In doing so, we introduce the concept of ‘content self-moderation’, which encompasses how creators use content moderation as a way to hedge monetization risks arising out of their reliance on controversy for economic growth. This concept is discussed in the context of the automated content governance entailed by algorithmic platform architectures, to contribute to existing scholarship on platform governance.
2020
European integration after Maastricht: Insights, novel research agendas, and the challenge of Real-World Impact
Neculai-Cristian Surubaru, Caterina Di Fazio, Miriam Urlings, Catalina Goanta, Thales Costa Bertaglia , and 1 more author
The enormous amount of texts published daily by Internet users has fostered the development of methods to analyze this content in several natural language processing areas, such as sentiment analysis. The main goal of this task is to classify the polarity of a message. Even though many approaches have been proposed for sentiment analysis, some of the most successful ones rely on the availability of large annotated corpus, which is an expensive and time-consuming process. In recent years, distant supervision has been used to obtain larger datasets. So, inspired by these techniques, in this paper we extend such approaches to incorporate popular graphic symbols used in electronic messages, the emojis, in order to create a large sentiment corpus for Portuguese. Trained on almost one million tweets, several models were tested in both same domain and cross-domain corpora. Our methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach. In addition, our results suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message.
2016
Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization
Thales Bertaglia, and Maria das Graças Volpe Nunes
In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) , Dec 2016
Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese.