Step 6 of 7

Database & Analysis

Normalized relational schema with NER, co-occurrence networks and engagement modeling

                
                Last updated: 2026-06-03 00:43
            

6 Database tables

4.8M+ Transcript sentences

15.3M+ Comment sentences

Input Annotated transcripts + metadata + comments

Process PostgreSQL + NER + Leiden networks + OLS

Output Research-ready database (15M+ rows)

The PostgreSQL schema contains 4 core tables (videos, comments, video_transcripts, transcription_speakers) and 2 processed variant tables (comments_processed, transcription_speakers_processed) that add sentence-level tokenization, NER entities as JSONB columns, and 40+ ML prediction columns. Data flows from upstream CSV intermediaries through a SQLAlchemy ingestion pipeline with type validation, deduplication, and referential integrity checks. Indexes are optimized for temporal queries (upload date) and channel-level aggregations.

A fine-tuned CamemBERT NER model extracts persons, organizations, and locations from every sentence in both transcripts and comments. Entities are stored as JSONB arrays (e.g., {"PER": ["Macron", "Le Pen"], "ORG": ["RN"]}) enabling efficient JSON path queries and full-text search. The NER pipeline processes approximately 10,000 sentences per second on GPU, completing the full corpus in under 5 hours.

Entity co-occurrence networks are built using NetworkX and iGraph: nodes represent entities, edges connect entities mentioned in the same sentence, and edge weights reflect co-occurrence frequency. The Leiden community detection algorithm identifies clusters of related entities (political figures, media organizations, geographic regions). OLS regression models (via statsmodels) then test hypotheses about the relationship between discourse features and engagement, controlling for channel fixed effects and temporal trends.

Click each card above to expand details

The final stage integrates all upstream outputs (video metadata, comment trees, speaker-labeled transcripts, and 40+ columns of NLP predictions) into a normalized PostgreSQL database with six tables. The schema is designed for analytical flexibility: a single SQL query can join a video’s metadata (views, likes, upload date, political orientation) with the sentence-level NLP annotations of its transcript and the NER entities mentioned by each speaker. The database currently holds over 15 million sentence-level rows and supports the interactive dashboards on this website.

Beyond storage, this stage performs two major analytical enrichments. Named Entity Recognition (NER) uses a fine-tuned CamemBERT NER model combined with spaCy’s pipeline to extract persons (PER), organizations (ORG), and locations (LOC) at the sentence level across both transcripts and comments. Entities are stored as JSONB arrays, enabling efficient full-text and JSON path queries. Co-occurrence network analysis then maps relationships between the most frequently mentioned entities using NetworkX and iGraph, with community detection via the Leiden algorithm revealing clusters of co-mentioned figures and institutions. Finally, OLS regression models examine how discourse features (ideology scores, hate speech prevalence, entity mentions) predict audience engagement metrics.

NER processing on a single far-right channel’s 800 video transcripts extracts 12,400 unique person mentions, 3,200 organization mentions, and 2,100 location mentions. The co-occurrence network for this channel reveals a dense cluster around political figures frequently discussed together (e.g., Emmanuel Macron, Marine Le Pen, Éric Zemmour), connected by an edge weight proportional to how often they appear in the same sentence. Leiden community detection identifies 7 thematic clusters. An OLS regression then shows that videos with above-median mentions of "immigration + security" entities receive 23% more views than the channel average (p < 0.01), while "scientific rhetoric" mentions have no significant effect on engagement.

PostgreSQL

SQLAlchemy

spaCy

CamemBERT (NER)

NetworkX

iGraph

Leiden

statsmodels

#	Table	Description	Scale
1	videos	One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender.	26,396 rows
2	comments	All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column.	9.6M+ rows
3	video_transcripts	Full diarized transcripts with speaker labels and cleaned text versions.	28,121 rows
4	transcription_speakers	Individual speaker segments from diarization, ordered by position within each video.	1,021,611 rows
5	comments_processed	Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns.	15.3M+ rows
6	transcription_speakers_processed	Sentence-level speaker segments with NER extraction and full annotation suite.	4.8M+ rows

Previous NLP & Annotation Next Continuous Observatory

All steps

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43

Today

videos transcribed

comments extracted

Since January

videos transcribed

comments extracted

videos detected

metadata updated

channels scanned

Database & Analysis

Continuous Update Pipeline

How It Works

Schema Design & Data Ingestion

Named Entity Recognition (CamemBERT + spaCy)

Network Analysis & Regression Modeling

Concrete Example

From Entity Extraction to Engagement Modeling

Tools Used

Database Schema

Continuous Observatory