Database & Analysis
Normalized relational schema with NER, co-occurrence networks and engagement modeling
Continuous Update Pipeline
How It Works
The PostgreSQL schema contains 4 core tables (videos, comments, video_transcripts, transcription_speakers) and 2 processed variant tables (comments_processed, transcription_speakers_processed) that add sentence-level tokenization, NER entities as JSONB columns, and 40+ ML prediction columns. Data flows from upstream CSV intermediaries through a SQLAlchemy ingestion pipeline with type validation, deduplication, and referential integrity checks. Indexes are optimized for temporal queries (upload date) and channel-level aggregations.
A fine-tuned CamemBERT NER model extracts persons, organizations, and locations from every sentence in both transcripts and comments. Entities are stored as JSONB arrays (e.g., {"PER": ["Macron", "Le Pen"], "ORG": ["RN"]}) enabling efficient JSON path queries and full-text search. The NER pipeline processes approximately 10,000 sentences per second on GPU, completing the full corpus in under 5 hours.
Entity co-occurrence networks are built using NetworkX and iGraph: nodes represent entities, edges connect entities mentioned in the same sentence, and edge weights reflect co-occurrence frequency. The Leiden community detection algorithm identifies clusters of related entities (political figures, media organizations, geographic regions). OLS regression models (via statsmodels) then test hypotheses about the relationship between discourse features and engagement, controlling for channel fixed effects and temporal trends.
Click each card above to expand details
The final stage integrates all upstream outputs (video metadata, comment trees, speaker-labeled transcripts, and 40+ columns of NLP predictions) into a normalized PostgreSQL database with six tables. The schema is designed for analytical flexibility: a single SQL query can join a video’s metadata (views, likes, upload date, political orientation) with the sentence-level NLP annotations of its transcript and the NER entities mentioned by each speaker. The database currently holds over 15 million sentence-level rows and supports the interactive dashboards on this website.
Beyond storage, this stage performs two major analytical enrichments. Named Entity Recognition (NER) uses a fine-tuned CamemBERT NER model combined with spaCy’s pipeline to extract persons (PER), organizations (ORG), and locations (LOC) at the sentence level across both transcripts and comments. Entities are stored as JSONB arrays, enabling efficient full-text and JSON path queries. Co-occurrence network analysis then maps relationships between the most frequently mentioned entities using NetworkX and iGraph, with community detection via the Leiden algorithm revealing clusters of co-mentioned figures and institutions. Finally, OLS regression models examine how discourse features (ideology scores, hate speech prevalence, entity mentions) predict audience engagement metrics.
Concrete Example
From Entity Extraction to Engagement Modeling
NER processing on a single far-right channel’s 800 video transcripts extracts 12,400 unique person mentions, 3,200 organization mentions, and 2,100 location mentions. The co-occurrence network for this channel reveals a dense cluster around political figures frequently discussed together (e.g., Emmanuel Macron, Marine Le Pen, Éric Zemmour), connected by an edge weight proportional to how often they appear in the same sentence. Leiden community detection identifies 7 thematic clusters. An OLS regression then shows that videos with above-median mentions of "immigration + security" entities receive 23% more views than the channel average (p < 0.01), while "scientific rhetoric" mentions have no significant effect on engagement.
Tools Used
Database Schema
Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.
| # | Table | Description | Scale |
|---|---|---|---|
| 1 | videos | One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. | 26,396 rows |
| 2 | comments | All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. | 9.6M+ rows |
| 3 | video_transcripts | Full diarized transcripts with speaker labels and cleaned text versions. | 28,121 rows |
| 4 | transcription_speakers | Individual speaker segments from diarization, ordered by position within each video. | 1,021,611 rows |
| 5 | comments_processed | Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. | 15.3M+ rows |
| 6 | transcription_speakers_processed | Sentence-level speaker segments with NER extraction and full annotation suite. | 4.8M+ rows |
Continuous Observatory
The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.