publications
publications in reversed chronological order. check out my google scholar for the full list.
2025
- Influencer self-disclosure practices on Instagram: A multi-country longitudinal studyThales Bertaglia, Catalina Goanta, Gerasimos Spanakis, and Adriana IamnitchiOnline Social Networks and Media, 2025
This paper presents a longitudinal study of more than ten years of activity on Instagram consisting of over a million posts by 400 content creators from four countries: the US, Brazil, Netherlands and Germany. Our study shows differences in the professionalisation of content monetisation between countries, yet consistent patterns; significant differences in the frequency of posts yet similar user engagement trends; and significant differences in the disclosure of sponsored content in some countries, with a direct connection with national legislation. We analyse shifts in marketing strategies due to legislative and platform feature changes, focusing on how content creators adapt disclosure methods to different legal environments. We also analyse the impact of disclosures and sponsored posts on engagement and conclude that, although sponsored posts have lower engagement on average, properly disclosing ads does not reduce engagement further. Our observations stress the importance of disclosure compliance and can guide authorities in developing and monitoring them more effectively.
@article{bertaglia2025disclosures, title = {Influencer self-disclosure practices on Instagram: A multi-country longitudinal study}, journal = {Online Social Networks and Media}, volume = {45}, pages = {100298}, year = {2025}, issn = {2468-6964}, doi = {https://doi.org/10.1016/j.osnem.2024.100298}, author = {Bertaglia, Thales and Goanta, Catalina and Spanakis, Gerasimos and Iamnitchi, Adriana}, keywords = {Influencer marketing, Advertising disclosure, Instagram, Self-disclosure practices, Legal compliance}, }
2024
- Decoding digital influence: Computational insights into monetisation, controversy, and compliance in the creator economyThales BertagliaMaastricht University , 2024
This thesis investigates the creator economy on social media platforms, focusing on influencers – content creators who use their extensive audience reach and strong follower relationships to monetise content through various models, most notably influencer marketing. The drive for revenue generation through content monetisation often incentivises influencers to prioritise engagement and visibility above transparency, leading to issues related to undisclosed advertisements and online toxicity. This thesis addresses these challenges with an interdisciplinary approach, aiming to develop reliable computational methodologies and data resources for monitoring, analysing, and detecting problematic content while considering the socio-legal implications. Our key contributions include (1) A longitudinal analysis of influencer practices on Instagram from 2010 to 2022 across four countries, revealing trends in posting volumes, ad disclosures, and the impact of regulatory changes on influencer behaviour. (2) A method based on large language models to improve the quality of data resources for sponsored content detection. (3) A methodology for generating and evaluating synthetic social media data for research. (4) A framework and dataset for detecting abusive language inspired by legal definitions. (5) A methodology and dataset for analysing gendered toxicity targeting influencers on YouTube. (6) An in-depth analysis of controversy as a monetisation strategy within the creator economy, investigating its effects on engagement, toxicity, and content moderation strategies employed by creators. This thesis combines insights from law, social sciences, and data science to improve transparency, reduce toxicity, and encourage responsible content creation, contributing to a safer online environment. We emphasise the importance of interdisciplinary research in addressing the challenges within the creator economy, particularly in improving regulatory compliance through computational methods. Our work highlights the importance of transparency, the impact of regulatory clarity on influencer behaviour, and the need for computational approaches that are both transparent and theoretically grounded.
@phdthesis{phdthesis, title = {Decoding digital influence: Computational insights into monetisation, controversy, and compliance in the creator economy}, author = {Bertaglia, Thales}, year = {2024}, doi = {10.26481/dis.20241107tb}, language = {English}, isbn = {9789465102597}, publisher = {Maastricht University}, address = {Netherlands}, school = {Maastricht University}, }
- The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven EngagementThales Bertaglia, Catalina Goanta, and Adriana IamnitchiIn Proceedings of the 4th International Workshop on Open Challenges in Online Social Networks , Poznan, Poland, 2024
YouTube is a major social media platform that plays a significant role in digital culture, with content creators at its core. These creators often engage in controversial behaviour to drive engagement, which can foster toxicity. This paper presents a quantitative analysis of controversial content on YouTube, focusing on the relationship between controversy, toxicity, and monetisation. We introduce a curated dataset comprising 20 controversial YouTube channels extracted from Reddit discussions, including 16,349 videos and more than 105 million comments. We identify and categorise monetisation cues from video descriptions into various models, including affiliate marketing and direct selling, using lists of URLs and keywords. Additionally, we train a machine learning model to measure the toxicity of comments in these videos. Our findings reveal that while toxic comments correlate with higher engagement, they negatively impact monetisation, indicating that controversy-driven interaction does not necessarily lead to financial gain. We also observed significant variation in monetisation strategies, with some creators showing extensive monetisation despite high toxicity levels. Our study introduces a curated dataset, lists of URLs and keywords to categorise monetisation, a machine learning model to measure toxicity, and is a significant step towards understanding the complex relationship between controversy, engagement, and monetisation on YouTube. The lists used for detecting and categorising monetisation cues are available on https://github.com/thalesbertaglia/toxmon.
@inproceedings{bertaglia2024toxmon, author = {Bertaglia, Thales and Goanta, Catalina and Iamnitchi, Adriana}, title = {The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement}, year = {2024}, isbn = {9798400710827}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3677117.3685005}, booktitle = {Proceedings of the 4th International Workshop on Open Challenges in Online Social Networks}, pages = {1--9}, numpages = {9}, keywords = {content creators, content monetization, influencer marketing, influencers, toxicity, youtube}, location = {Poznan, Poland}, series = {OASIS '24}, }
- InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content DetectionThales Bertaglia, Lily Heisig, Rishabh Kaushal, and Adriana IamnitchiIn Proceedings of the International AAAI Conference on Web and Social Media , 2024
Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
@inproceedings{bertaglia2024instasynth, title = {InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection}, author = {Bertaglia, Thales and Heisig, Lily and Kaushal, Rishabh and Iamnitchi, Adriana}, booktitle = {Proceedings of the International AAAI Conference on Web and Social Media}, volume = {18}, pages = {139--151}, doi = {10.1609/icwsm.v18i1.31303}, year = {2024}, }
- Across Platforms and Languages: Dutch Influencers and Legal Disclosures on Instagram, YouTube and TikTokHaoyang Gui, Thales Bertaglia, Catalina Goanta, Sybe Vries, and Gerasimos SpanakisIn Accepted for publication at the 16th International Conference on Advances in Social Networks Analysis and Mining , 2024
Content monetization on social media fuels a growing influencer economy. Influencer marketing remains largely undisclosed or inappropriately disclosed on social media. Non-disclosure issues have become a priority for national and supranational authorities worldwide, who are starting to impose increasingly harsher sanctions on them. This paper proposes a transparent methodology for measuring whether and how influencers comply with disclosures based on legal standards. We introduce a novel distinction between disclosures that are legally sufficient (green) and legally insufficient (yellow). We apply this methodology to an original dataset reflecting the content of 150 Dutch influencers publicly registered with the Dutch Media Authority based on recently introduced registration obligations. The dataset consists of 292,315 posts and is multi-language (English and Dutch) and cross-platform (Instagram, YouTube and TikTok). We find that influencer marketing remains generally underdisclosed on social media, and that bigger influencers are not necessarily more compliant with disclosure standards.
@inproceedings{gui2024, title = {Across Platforms and Languages: Dutch Influencers and Legal Disclosures on Instagram, YouTube and TikTok}, author = {Gui, Haoyang and Bertaglia, Thales and Goanta, Catalina and de Vries, Sybe and Spanakis, Gerasimos}, booktitle = {Accepted for publication at the 16th International Conference on Advances in Social Networks Analysis and Mining}, pages = {1--2}, year = {2024}, }
- Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for ResearchHenry Tari, M. Danial Khan, Justus Rutten, Darian Othman, Thales Bertaglia , and 2 more authorsIn Proceedings of the 35th ACM Conference on Hypertext and Social Media , Poznan, Poland, 2024
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
@inproceedings{tari2024, author = {Tari, Henry and Khan, M. Danial and Rutten, Justus and Othman, Darian and Bertaglia, Thales and Kaushal, Rishabh and Iamnitchi, Adriana}, title = {Leveraging {GPT} for the Generation of Multi-Platform Social Media Datasets for Research}, year = {2024}, isbn = {9798400705953}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3648188.3675153}, booktitle = {Proceedings of the 35th ACM Conference on Hypertext and Social Media}, pages = {337--343}, numpages = {7}, keywords = {LLMs, Social Media Research, Synthetic Data}, location = {Poznan, Poland}, series = {HT '24}, }
2023
- Sexism in Focus: An Annotated Dataset of YouTube Comments for Gender Bias ResearchThales Bertaglia, Katarina Bartekova, Rinske Jongma, Stephen Mccarthy, and Adriana IamnitchiIn Proceedings of the 3rd International Workshop on Open Challenges in Online Social Networks , Rome, Italy, 2023
This paper presents a novel dataset of 200k YouTube comments from 468 videos across 109 channels in four content categories: Entertainment, Gaming, People & Blogs, and Science & Technology. We applied state-of-the-art NLP methods to augment the dataset with sexism-related features such as sentiment, toxicity, offensiveness, and hate speech. These features can assist manual content analyses and enable automated analysis of sexism in online platforms. Furthermore, we develop an annotation framework inspired by the Ambivalent Sexism Theory to promote a nuanced understanding of how comments relate to the gender of content creators. We release a small sample of comments annotated using this framework. Our dataset analysis confirms that female content creators receive more sexist and hateful comments than their male counterparts, underscoring the need for further research and intervention in addressing online sexism.
@inproceedings{bertaglia2023sexism, author = {Bertaglia, Thales and Bartekova, Katarina and Jongma, Rinske and Mccarthy, Stephen and Iamnitchi, Adriana}, title = {Sexism in Focus: An Annotated Dataset of YouTube Comments for Gender Bias Research}, year = {2023}, isbn = {9798400702259}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3599696.3612900}, booktitle = {Proceedings of the 3rd International Workshop on Open Challenges in Online Social Networks}, pages = {22–28}, numpages = {7}, location = {Rome, Italy}, series = {OASIS '23}, }
- Closing the Loop: Testing ChatGPT to Generate Model Explanations to Improve Human Labelling of Sponsored Content on Social MediaThales Bertaglia, Stefan Huber, Catalina Goanta, Gerasimos Spanakis, and Adriana IamnitchiIn Explainable Artificial Intelligence , 2023
Regulatory bodies worldwide are intensifying their efforts to ensure transparency in influencer marketing on social media through instruments like the Unfair Commercial Practices Directive (UCPD) in the European Union, or Section 5 of the Federal Trade Commission Act. Yet enforcing these obligations has proven to be highly problematic due to the sheer scale of the influencer market. The task of automatically detecting sponsored content aims to enable the monitoring and enforcement of such regulations at scale. Current research in this field primarily frames this problem as a machine learning task, focusing on developing models that achieve high classification performance in detecting ads. These machine learning tasks rely on human data annotation to provide ground truth information. However, agreement between annotators is often low, leading to inconsistent labels that hinder the reliability of models. To improve annotation accuracy and, thus, the detection of sponsored content, we propose using chatGPT to augment the annotation process with phrases identified as relevant features and brief explanations. Our experiments show that this approach consistently improves inter-annotator agreement and annotation accuracy. Additionally, our survey of user experience in the annotation task indicates that the explanations improve the annotators’ confidence and streamline the process. Our proposed methods can ultimately lead to more transparency and alignment with regulatory requirements in sponsored content detection.
@inproceedings{bertaglia2023closing, author = {Bertaglia, Thales and Huber, Stefan and Goanta, Catalina and Spanakis, Gerasimos and Iamnitchi, Adriana}, editor = {Longo, Luca}, title = {Closing the Loop: Testing {ChatGPT} to Generate Model Explanations to Improve Human Labelling of Sponsored Content on Social Media}, booktitle = {Explainable Artificial Intelligence}, year = {2023}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {198--213}, isbn = {978-3-031-44067-0}, doi = {10.1007/978-3-031-44067-0_11}, }
- Digital influencers, monetization models and platforms as transactional spacesCatalina Goanta, and Thales BertagliaBrazilian Creative Industries Journal, 2023
This paper aims to discuss the impact of digital influencers’ content monetization on social media in the context of platform governance. For achieving this objective, it characterizes the Monetization Supply Chain and the different monetization models as ad revenue, on-platform influencer marketing; subscription, tokenization, crowdfunding; direct selling; creator funds; besides the traditional influencer marketing. It also presents preliminary analyses of a dataset of posts by 400 influencers from four countries: Brazil, Germany, the Netherlands and the United States of America to understand how content creators from different countries are framing sponsored content.
@article{goanta2023digital, title = {Digital influencers, monetization models and platforms as transactional spaces}, author = {Goanta, Catalina and Bertaglia, Thales}, journal = {Brazilian Creative Industries Journal}, volume = {3}, number = {1}, pages = {242--259}, year = {2023}, }
2022
- The case for a legal compliance API for the enforcement of the EU’s digital services act on social media platformsCatalina Goanta, Thales Bertaglia, and Adriana IamnitchiIn Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , 2022
In the course of under a year, the European Commission has launched some of the most important regulatory proposals to date on platform governance. The Commission’s goals behind cross-sectoral regulation of this sort include the protection of markets and democracies alike. While all these acts propose sophisticated rules for setting up new enforcement institutions and procedures, one aspect remains highly unclear: how digital enforcement will actually take place in practice. Focusing on the Digital Services Act (DSA), this discussion paper critically addresses issues around social media data access for the purpose of digital enforcement and proposes the use of a legal compliance application programming interface (API) as a means to facilitate compliance with the DSA and complementary European and national regulation. To contextualize this discussion, the paper pursues two scenarios that exemplify the harms arising out of content monetization affecting a particularly vulnerable category of social media users: children. The two scenarios are used to further reflect upon essential issues surrounding data access and legal compliance with the DSA and further applicable legal standards in the field of labour and consumer law.
@inproceedings{goanta2022case, title = {The case for a legal compliance API for the enforcement of the EU’s digital services act on social media platforms}, author = {Goanta, Catalina and Bertaglia, Thales and Iamnitchi, Adriana}, booktitle = {Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency}, pages = {1341--1349}, year = {2022}, }
2021
- Abusive language on social media through the legal looking glassThales Bertaglia, Andreea Grigoriu, Michel Dumontier, and Gijs DijckIn Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) , 2021
Abusive language is a growing phenomenon on social media platforms. Its effects can reach beyond the online context, contributing to mental or emotional stress on users. Automatic tools for detecting abuse can alleviate the issue. In practice, developing automated methods to detect abusive language relies on good quality data. However, there is currently a lack of standards for creating datasets in the field. These standards include definitions of what is considered abusive language, annotation guidelines and reporting on the process. This paper introduces an annotation framework inspired by legal concepts to define abusive language in the context of online harassment. The framework uses a 7-point Likert scale for labelling instead of class labels. We also present ALYT – a dataset of Abusive Language on YouTube. ALYT includes YouTube comments in English extracted from videos on different controversial topics and labelled by Law students. The comments were sampled from the actual collected data, without artificial methods for increasing the abusive content. The paper describes the annotation process thoroughly, including all its guidelines and training steps.
@inproceedings{bertaglia2021abusive, title = {Abusive language on social media through the legal looking glass}, author = {Bertaglia, Thales and Grigoriu, Andreea and Dumontier, Michel and van Dijck, Gijs}, booktitle = {Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)}, pages = {191--200}, year = {2021}, }
- Clout Chasing for the Sake of Content Monetization: Gaming Algorithmic Architectures with Self-Moderation StrategiesThales Bertaglia, Adrien Dubois, and Catalina GoantaMorals & Machines, 2021
This short discussion paper addresses how controversy is monetized online by reflecting on a new iteration of the shock value in media production, identified on social media as the ‘clout chasing’ phenomenon. We first exemplify controversial behavior, and subsequently proceed to defining clout chasing, which we discuss this concept in relation to existing frameworks for the understanding of controversy on social media. We then outline what clout chasing entails as a content monetization strategy, and address the risks associated with this approach. In doing so, we introduce the concept of ‘content self-moderation’, which encompasses how creators use content moderation as a way to hedge monetization risks arising out of their reliance on controversy for economic growth. This concept is discussed in the context of the automated content governance entailed by algorithmic platform architectures, to contribute to existing scholarship on platform governance.
@article{bertaglia2021clout, title = {Clout Chasing for the Sake of Content Monetization: Gaming Algorithmic Architectures with Self-Moderation Strategies}, author = {Bertaglia, Thales and Dubois, Adrien and Goanta, Catalina}, journal = {Morals \& Machines}, volume = {1}, number = {1}, pages = {22--29}, year = {2021}, publisher = {Nomos Verlagsgesellschaft mbH \& Co. KG}, }
2020
- European integration after Maastricht: Insights, novel research agendas, and the challenge of Real-World ImpactNeculai-Cristian Surubaru, Caterina Di Fazio, Miriam Urlings, Catalina Goanta, Thales Costa Bertaglia , and 1 more authorEuropeNow Journal, 2020
@article{surubaru2020european, title = {European integration after Maastricht: Insights, novel research agendas, and the challenge of Real-World Impact}, author = {Surubaru, Neculai-Cristian and Di Fazio, Caterina and Urlings, Miriam and Goanta, Catalina and Bertaglia, Thales Costa and Segers, Mathieu}, journal = {EuropeNow Journal}, volume = {37}, pages = {1--14}, year = {2020}, }
2017
- Normalização textual de conteúdo gerado por usuárioThales BertagliaUniversidade de São Paulo , 2017
@phdthesis{bertaglia2017normalizaccao, title = {Normaliza{\c{c}}{\~a}o textual de conte{\'u}do gerado por usu{\'a}rio}, author = {Bertaglia, Thales}, year = {2017}, school = {Universidade de S{\~a}o Paulo}, }
- PELESent: Cross-Domain Polarity Classification Using Distant SupervisionEdilson Anselmo Corrêa, Vanessa Queiroz Marinho, Leandro Borges Santos, Thales Bertaglia, Marcos Vinícius Treviso , and 1 more authorIn 2017 Brazilian Conference on Intelligent Systems (BRACIS) , 2017
The enormous amount of texts published daily by Internet users has fostered the development of methods to analyze this content in several natural language processing areas, such as sentiment analysis. The main goal of this task is to classify the polarity of a message. Even though many approaches have been proposed for sentiment analysis, some of the most successful ones rely on the availability of large annotated corpus, which is an expensive and time-consuming process. In recent years, distant supervision has been used to obtain larger datasets. So, inspired by these techniques, in this paper we extend such approaches to incorporate popular graphic symbols used in electronic messages, the emojis, in order to create a large sentiment corpus for Portuguese. Trained on almost one million tweets, several models were tested in both same domain and cross-domain corpora. Our methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach. In addition, our results suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message.
@inproceedings{edilson2017pelesent, author = {Corrêa, Edilson Anselmo and Marinho, Vanessa Queiroz and dos Santos, Leandro Borges and Bertaglia, Thales and Treviso, Marcos Vinícius and Brum, Henrico Bertini}, booktitle = {2017 Brazilian Conference on Intelligent Systems (BRACIS)}, title = {PELESent: Cross-Domain Polarity Classification Using Distant Supervision}, year = {2017}, volume = {}, number = {}, pages = {49-54}, keywords = {Sentiment analysis;Twitter;Learning systems;Manuals;Polarity Classification;Sentiment Analysis;Distant Supervision;Twitter}, doi = {10.1109/BRACIS.2017.45}, }
2016
- Exploring Word Embeddings for Unsupervised Textual User-Generated Content NormalizationThales Bertaglia, and Maria das Graças Volpe NunesIn Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) , Dec 2016
Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese.
@inproceedings{bertaglia2016exploring, title = {Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization}, author = {Bertaglia, Thales and Volpe Nunes, Maria das Gra{\c{c}}as}, booktitle = {Proceedings of the 2nd Workshop on Noisy User-generated Text ({WNUT})}, month = dec, year = {2016}, address = {Osaka, Japan}, publisher = {The COLING 2016 Organizing Committee}, pages = {112--120}, }