Step 5 of 7

NLP & Annotation

LLM-in-the-loop annotation with iterative CamemBERT fine-tuning across multiple classifier families

Last updated: 2026-06-03 00:43
4.8M+ Annotated sentences (transcripts)
15.3M+ Annotated sentences (comments)
Data Flow

Continuous Update Pipeline

Input Sentences + codebooks
Process LLM annotation → CamemBERT fine-tuning → reinforced loop
Output Per-sentence predictions
Methodology

How It Works

For each classification task, a detailed codebook defines the target construct with positive/negative examples and edge-case rules. The LLM (accessed via LLM_Tool) receives the codebook as a system prompt and annotates batches of 500 sentences, producing binary labels and brief justifications. This yields silver-standard training sets of 2,000 to 5,000 sentences per task, generated at a fraction of the cost of human annotation while maintaining sufficient quality for classifier training.

Each CamemBERT classifier is fine-tuned using a standard sequence classification head (dropout + linear layer) on the LLM-annotated data. Models are evaluated on a held-out validation set. If performance is insufficient, the pipeline enters a reinforced loop: the most uncertain predictions (closest to the decision boundary) are re-annotated by the LLM with stricter prompts, the training set is augmented, and the model is retrained. This iterative process typically converges within 2 to 3 rounds.

Once validated, classifiers will be applied to the entire corpus: transcript sentences and comment sentences. Each sentence will receive a binary prediction (0/1) and a confidence score for every applicable model, resulting in multiple annotation columns per sentence. Inference will run on GPU, allowing the full corpus to be processed in a matter of hours.

Click each card above to expand details

Annotating millions of sentences for ideological content at the sentence level poses a fundamental scaling challenge: human annotation is the gold standard but prohibitively expensive at this scale (an estimated 45,000 annotator-hours for the full corpus). Our solution implements an LLM-in-the-loop strategy: large language models generate silver-standard training labels on carefully sampled subsets, which are then used to fine-tune lightweight CamemBERT classifiers that can be applied to the entire corpus in a matter of hours.

The annotation pipeline is powered by LLM_Tool, an open-source framework supporting both local inference and cloud API calls. For each classification task, a detailed codebook defines the construct (e.g., "anti-immigration: security threat"), provides positive and negative examples, and specifies edge cases. The LLM receives the codebook as a system prompt and annotates batches of 500 sentences with binary labels and a brief justification. CamemBERT classifiers are then fine-tuned on these labels using a standard sequence classification head. When performance is insufficient, a reinforced learning loop is triggered: the LLM re-annotates the most uncertain samples (those near the decision boundary), the training set is augmented, and the classifier is retrained.

In Practice

Concrete Example

Example: Annotating the "Anti-Immigration: Security Threat" Classifier

The codebook defines the construct as "framing immigration as a security threat through references to crime, terrorism, or public safety." The LLM annotates a sample of sentences, labeling some as positive (e.g., "l’immigration massive est directement liée à l’explosion de la délinquance") and others as negative (e.g., "le taux de chômage a diminué ce trimestre"). A CamemBERT classifier is then fine-tuned on these labels. If performance is insufficient, the reinforced loop selects the most uncertain sentences, sends them back to the LLM for re-annotation with stricter codebook prompts, and the model is retrained. This iterative process is currently underway for each classification task.

Technology Stack

Tools Used

LLM_Tool
CamemBERT
Hugging Face Transformers
scikit-learn
CUDA / GPU
NLP Annotation

Annotation Model Families

Classifiers currently under development, organized into thematic families. Click to explore each category.

A first model classifies each sentence as political or non-political using a broad definition (current affairs, social issues, political actors, power relations, social norms). This filtering precedes the three annotation projects.

Political Detection

Binary classification of each sentence as political or non-political.

political_yes The sentence refers to current affairs, social issues, political actors, power relations or social norms.
political_no The sentence relates to private life, personal narrative or entertainment without collective scope.

Detection of gender discourse and multidimensional analysis: gender presence, valence (positive, negative, ambivalent), type of rationality invoked and positioning towards science.

Gender

Does the content address gender? Direct or indirect reference to men, women, masculinity, femininity, gender roles, feminism, antifeminism, male-female relations, LGBTQ+.

gender_yes Gender discourse present
gender_no No gender discourse

Gender Valence

Tone of the gender discourse.

genre_valence_positive Promotes gender equality or challenges stereotypes
genre_valence_negative Hostility, criticism or derogatory claims toward feminism or gender equality
genre_valence_ambivalent Initially appears egalitarian but limits or relativizes equality
genre_valence_null No evaluative stance toward gender

Rationality Type

Type of rationality mobilized in gender discourse.

rationality_none No justificatory rationality
rationality_nature Biological, natural, evolutionary or religious-natural arguments
rationality_liberal Invokes formal equality or individual rights to deny structural domination
rationality_empirical Statistics, data or "facts" as justification
rationality_heroic Frames the claim as courageous truth-telling, anti-political correctness

Science Position

Positioning towards science in gender discourse.

science_none No reference to science
science_pro_science Values studies, experts or research
science_anti_science Discredits academia or research
science_ambivalent Both pro- and anti-science registers coexist

Measurement of neo-reactionary (NR) ideas centered on technological optimism, libertarianism and the use of fictional metaphors in political discourse, along with dimensions shared with the SIED (equality and ecology).

Technology

Technological optimism, technocracy and transhumanism.

techno_optimism_overall Optimistic or positive view of the role of technology and innovation
innovation_as_progress Technological innovation as a driver of progress or solution to social problems
pro_tech_figures Favorable reference to tech figures (Musk, Thiel, Altman, Zuckerberg…)
technocracy_over_democracy Technocratic or expert-led governance is more effective than democracy
deregulation_of_tech Deregulation of technological innovation as necessary for progress
transhumanism Support for transhumanism, post-humanism, eugenics or technological augmentation of humans

Libertarianism

Secession, individual autonomy, alternative communities and the corporate model as a political counter-model.

lib_sec Support for secession or break from the national political community
lib_autonomy Living autonomously, outside traditional state structures
lib_community Creation of communities based on their own values and rules
lib_company The corporate model as a political counter-model to the state or democracy
lib_state The state should be run like a company, according to performance criteria

Fictional Metaphors

Use of metaphors from popular fiction to structure political interpretation.

metaphor_redpill Reference to the "red pill," awakening to hidden truth, breaking free from egalitarian or democratic illusions
metaphor_lotr References to Lord of the Rings to conceptualize social or civilizational hierarchies
metaphor_starwars References to Star Wars to frame political struggle, authority or legitimacy
metaphor_cathedral The Cathedral as a metaphor for universities, media or progressive institutions forming an ideological system

Equality SIED + NR

Relationship to equality, social and biological hierarchies.

equality_value Equality as a threat to values, traditions or the social order
equality_identity Equality as a threat to French identity or a factor of national dissolution
equality_gender Inequalities between the sexes presented as natural or biologically grounded
hierarchy_castes Society described in terms of castes or natural social hierarchies
hierarchy_IQ IQ used as a criterion for ranking individuals or groups
hierarchy_race Reference to natural inequalities between races or ethnic groups
equality_utopia Equality described as unrealistic, naive or utopian

Ecology SIED + NR

Ecological positioning: eco-skepticism, techno-solutionism or civilizational ecology.

eco_eco Economic growth is more important than environmental protection
eco_tech Ecological concerns as obstacles to technological development
eco_civ Climate challenges framed as a competition between civilizations

Far-right ideological score (SIED) developed in Boursier & Lemor (2025), Revue française de science politique. Measures the presence of far-right ideological affiliation categories (CAIED) — nationalism, immigration, democracy, progress, authority, tradition — as well as dimensions shared with the NR project (equality and ecology), through their respective sub-dimensions.

Nationalism

Constructions of the nation and national identity.

nation_ethnic Nation as an ethnic or cultural community based on blood ties or common ancestors
nation_family Nation associated with the family, citizens as children of the motherland
nation_state Nation fused with the state as a single, inseparable entity
nation_vital The nation as an essential and insurmountable element of human life
nation_threat Nation described as under threat, requiring protection or defense
nation_colonialism Colonial nostalgia or denial of the consequences of colonization

Immigration

Framing of immigration as a threat.

immigration_identity Threat to national identity, culture or French/European values
immigration_security Association with delinquency, crime or terrorism
immigration_women Threat to women's rights or gender equality
immigration_law Call for stricter immigration or asylum legislation

Democracy

Critical relationship to democracy as an ideal or political regime.

demo_value Democracy as a threat to values, traditions or national identity
demo_sep Challenging the separation of powers, strengthening the executive
demo_vain Democracy described as inefficient, slow or unable to produce good decisions
demo_corrupt Democracy as fundamentally corrupt or captured by special interests
demo_beyond Call for revolting against or moving beyond democracy
demo_neg Support for non-democratic regimes (authoritarianism, monarchy, technocracy)

Progress

Rejection of modernization, globalization and progressive change.

progress_identity Progress as a threat to values, traditions or national identity
progress_stop Call for slowing, limiting or stopping social progress or progressive reforms
progress_glob Criticism of progress through globalization or the EU as destruction of identities

Authority

Obedience to authority, use of force and traditionalism.

authority_chief Importance of a strong leader or providential figure to protect the nation
authority_essential Political measure presented as essential, urgent to restore authority
authority_security Importance of order and security, fighting delinquency
authority_army Valorization of the army, police or law enforcement

Tradition

Defense of traditional values and the civilizational project.

tradition_value French values, customs or identity to be preserved and promoted
tradition_threat Tradition or traditional values under threat, requiring protection
tradition_family Promotion of the traditional family model or criticism of family transformations
tradition_laicite Secularism as a marker of national identity rather than a principle of neutrality
tradition_civilization Tradition as a civilizational project to spread values considered superior

Equality SIED + NR

Relationship to equality, social and biological hierarchies.

equality_value Equality as a threat to values, traditions or the social order
equality_identity Equality as a threat to French identity or a factor of national dissolution
equality_gender Inequalities between the sexes presented as natural or biologically grounded
hierarchy_castes Society described in terms of castes or natural social hierarchies
hierarchy_IQ IQ used as a criterion for ranking individuals or groups
hierarchy_race Reference to natural inequalities between races or ethnic groups
equality_utopia Equality described as unrealistic, naive or utopian

Ecology SIED + NR

Ecological positioning: eco-skepticism, techno-solutionism or civilizational ecology.

eco_eco Economic growth is more important than environmental protection
eco_tech Ecological concerns as obstacles to technological development
eco_civ Climate challenges framed as a competition between civilizations
Data Architecture

Database Schema

Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.

# Table Description Scale
1 videos One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. 26,396 rows
2 comments All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. 9.6M+ rows
3 video_transcripts Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows
4 transcription_speakers Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows
5 comments_processed Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. 15.3M+ rows
6 transcription_speakers_processed Sentence-level speaker segments with NER extraction and full annotation suite. 4.8M+ rows

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.