Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data

not the one one that feels that YouTube sponsor segments have develop into longer and more frequent recently. Sometimes, I watch videos that appear to be attempting to sell me something every couple of seconds.

, it sure is annoying to be bombarded by ads.

On this blog post, I’ll explore these sponsor segments, using data from a preferred browser extension called SponsorBlock, to work out if the perceived increase in ads actually did occur and in addition to quantify what number of ads I’m watching.

I’ll walk you thru my evaluation, providing code snippets in Sql, DuckDB, and pandas. All of the code is offered on my GitHub, and for the reason that dataset is open, I may even teach you tips on how to download it, so that you may follow along and play with the information yourself.

These are the questions I shall be attempting to answer on this evaluation:

Have sponsor segments increased over time?
Which channels have the best percentage of sponsor time per video?
What’s the density of sponsor segments throughout a video?

To get to those answers, we could have to cover much ground. That is the agenda for this post:

Let’s get this began!

SponsorBlock is an extension that lets you skip ad segments in videos, just like the way you skip Netflix intros. It’s incredibly accurate, as I don’t remember seeing one incorrect segment since I began using it around a month ago, and I watch plenty of smaller non-English creators.

You is perhaps asking yourself how the extension knows which parts of the video are sponsors, and, consider it or not, the reply is thru crowdsourcing!

Users submit the timestamps for the ad segments, and other users vote if it’s accurate or not. For the common user, who isn’t contributing in any respect, the one thing you could have to do is to press Enter to skip the ad.

Okay, now that you understand what SponsorBlock is, let’s talk concerning the data.

Cleansing the Data

If you ought to follow along, you’ll be able to download a duplicate of the information using this SponsorBlock Mirror (it would take you quite a couple of minutes to download all of it). The database schema may be seen here, although most of it won’t be useful for this project.

As one might expect, their database schema is made for the extension to work properly, and never for some guy to mainly leech from an enormous community effort to seek out what percentage of ads his favorite creator runs. For this, some work will have to be done to scrub and model the information.

The one two tables which might be necessary for this evaluation are:

sponsorTimes.csv : That is an important table, containing the startTime and endTime of all crowdsourced sponsor segments. The CSV is around 5GB.
videoInfo.csv : Comprises the video title, publication date, and channel ID related to each video.

Before we get into it, these are all of the libraries I ended up using. I’ll explain the less obvious ones as we go.

pandas
duckdb
requests
requests-cache
python-dotenv
seaborn
matplotlib
numpy

Step one, then, is to load the information. Surprisingly, this was already a bit difficult, as I used to be getting plenty of errors parsing some rows of the CSV. These were the settings I discovered to work for nearly all of the rows:

import duckdb
import os

# Connect with an in-memory DuckDB instance
con = duckdb.connect(database=':memory:')

sponsor_times = con.read_csv(
    "sb-mirror/sponsorTimes.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "startTime": "DOUBLE",
        "endTime": "DOUBLE",
        "votes": "INTEGER",
        "locked": "INTEGER",
        "incorrectVotes": "INTEGER",
        "UUID": "VARCHAR",
        "userID": "VARCHAR",
        "timeSubmitted": "DOUBLE",
        "views": "INTEGER",
        "category": "VARCHAR",
        "actionType": "VARCHAR",
        "service": "VARCHAR",
        "videoDuration": "DOUBLE",
        "hidden": "INTEGER",
        "fame": "DOUBLE",
        "shadowHidden": "INTEGER",
        "hashedVideoID": "VARCHAR",
        "userAgent": "VARCHAR",
        "description": "VARCHAR",
    },
    ignore_errors=True,
    quotechar="",
)

video_info = con.read_csv(
    "sb-mirror/videoInfo.csv",
    header=True,
    columns={
        "videoID": "VARCHAR",
        "channelID": "VARCHAR",
        "title": "VARCHAR",
        "published": "DOUBLE",
    },
    ignore_errors=True,
    quotechar=None,
)

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Here’s what a sample of the information looks like:

con.sql("SELECT videoID, startTime, endTime, votes, locked, category FROM sponsor_times LIMIT 5")

con.sql("SELECT * FROM video_info LIMIT 5")

Understanding the information within the sponsorTimes table is ridiculously necessary, otherwise, the cleansing process won’t make any sense.

Each row represents a user-submitted timestamp for a sponsored segment. Since multiple users can submit segments for a similar video, the dataset comprises duplicate and potentially incorrect entries, which can have to be handled during cleansing.

To seek out incorrect segments, I’ll use the votes and the locked column, because the latter one represents segments that were confirmed to be correct.

One other necessary column is the category. There are a bunch of categories like Intro, Outro, Filler, etc. For this evaluation, I’ll only work with Sponsor and Self-Promo.

I began by applying some filters:

CREATE TABLE filtered AS
SELECT
    *
FROM sponsor_times
WHERE category IN ('sponsor', 'selfpromo') AND (votes > 0 OR locked=1)

Filtering for locked segments or segments with greater than 0 votes was a giant decision. This reduced the dataset by an enormous percentage, but doing so made the information very reliable. For instance, before doing this, the entire Top 50 channels with the best percentage of ads were just spam, random channels that ran 99.9% of ads.

With this done, the following step is to get a dataset where each sponsor segment shows up just once. For instance, a video with a sponsor segment at first and one other at the top must have only two rows of knowledge.

This could be very much not the case to this point, since in a single video we are able to have multiple user-submitted entries for every segment. To do that, I’ll use window functions to discover if two or more rows of knowledge represent the identical segment.

The primary window function compares the startTime of 1 row with the endTime of the previous. If these values don’t overlap, it means they’re entries for separate segments, otherwise they’re repeated entries for a similar segment.

CREATE TABLE new_segments AS
SELECT
    -- Coalesce to TRUE to take care of the primary row of each window
    -- because the values are NULL, nevertheless it should count as a brand new segment.
    COALESCE(startTime > LAG(endTime) 
      OVER (PARTITION BY videoID ORDER BY startTime), true) 
      AS new_ad_segment,
    *
FROM filtered

Window Function example for a single video.

The new_ad_segment column is TRUE each time a row represents a brand new segment of a video. The primary two rows, as their timestamps overlap, are properly marked as the identical segment.

Next up, the second window function will label each ad segment by number:

CREATE TABLE ad_segments AS
SELECT
    SUM(new_ad_segment) 
      OVER (PARTITION BY videoID ORDER BY startTime)
      AS ad_segment,
    *
FROM new_segments

Example of labels for ad segments for a single video.

Finally, now that every segment is correctly numbered, it’s easy to get the segment that’s either locked or has the best amount of votes.

CREATE TABLE unique_segments AS
SELECT DISTINCT ON (videoID, ad_segment)
    *
FROM ad_segments
ORDER BY videoID, ad_segment, locked DESC, votes DESC

Example of what the ultimate dataset looks like for a single video.

That’s it! Now this table has one row for every unique ad segment, and I can start exploring the information.

If these queries feel complicated, and you wish a refresher on window functions, take a look at this blog post that can teach you all it’s essential learn about them! The last example covered within the blog post is nearly precisely the process I used here.

Exploring and Enhancing the Data

Finally, the dataset is sweet enough to begin exploring. The very first thing I did was to get a way of the dimensions of the information:

36.0k Unique Channels
552.6k Unique Videos
673.8k Unique Sponsor Segments, for a mean of 1.22 segments per video

As mentioned earlier, filtering by segments that were either locked or had not less than 1 upvote, reduced the dataset massively, by around 80%. But that is the value I needed to pay to have data that I could work with.

To ascertain if there’s nothing immediately incorrect with the information, I gathered the channels which have probably the most amount of videos:

CREATE TABLE top_5_channels AS 
SELECT
    channelID,
    count(DISTINCT unique_segments.videoID) AS video_count
FROM
    unique_segments
    LEFT JOIN video_info ON unique_segments.videoID = video_info.videoID 
WHERE
    channelID IS NOT NULL
    -- Some channel IDs are blank
    AND channelID != '""'
GROUP BY
    channelID
ORDER BY
    video_count DESC
LIMIT 5

The quantity of videos per channel looks realistic… But that is terrible to work with. I don’t need to go to my browser and look up channel IDs each time I need to know the name of a channel.

To repair this, I created a small script with functions to get these values from the YouTube API in Python. I’m using the library requests_cache to be sure I won’t be repeating API calls and depleting the API limits.

import requests
import requests_cache
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("YT_API_KEY")

# Cache responses indefinitely
requests_cache.install_cache("youtube_cache", expire_after=None)

def get_channel_name(channel_id: str) -> str:
    url = (
        f"https://www.googleapis.com/youtube/v3/channels"
        f"?part=snippet&id={channel_id}&key={API_KEY}"
    )
    response = requests.get(url)
    data = response.json()

    try:
        return data.get("items", [])[0].get("snippet", {}).get("title", "")
    except (IndexError, AttributeError):
        return ""

Besides this, I also created very similar functions to get the country and thumbnail of every channel, which shall be useful later. Should you’re keen on the code, check the GitHub repo.

On my DuckDB code, I’m now capable of register this Python function and call them inside SQL! I just have to be very careful to all the time use them on aggregated and filtered data, otherwise, I can say bye-bye to my API quota.

# This the script created above
from youtube_api import get_channel_name

# Try registering the function, ignore if already exists
try:
    con.create_function('get_channel_name', get_channel_name, [str], str)
except Exception as e:
    print(f"Skipping function registration (possibly already exists): {e}")

# Get the channel names
channel_names = con.sql("""
    select
        channelID,
        get_channel_name(channelID) as channel_name,
        video_count
    from top_5_channels
""")

Significantly better! I looked up two channels that I’m conversant in on YouTube for a fast sanity check. Linus Tech Suggestions has a complete of seven.2k videos uploaded, with 2.3k present on this dataset. Gamers Nexus has 3k videos, with 700 within the dataset. Looks adequate for me!

The last item to do, before moving over to truly answering the query I set myself to reply, is to have an idea of the common duration of videos.

This matches my expectations, for probably the most part. I’m still a bit surprised by the quantity of 20-40-minute videos, as for a few years the “meta” was to have videos of 10 minutes to maximise YouTube’s own ads.

Also, I assumed those buckets of video durations utilized in the previous graph were quite representative of how I take into consideration video lengths, so I shall be sticking with them for the following sections.

For reference, that is the pandas code used to create those buckets.

video_lengths = con.sql("""
  SELECT DISTINCT ON (videoID)
      videoID,
      videoDuration
  FROM
      unique_segments
  WHERE
      videoID IS NOT NULL
      AND videoDuration > 0
"""
).df()

# Define custom bins, in minutes
bins = [0, 3, 7, 12, 20, 40, 90, 180, 600, 9999999] 
labels = ["0-3", "3-7", "7-12", "12-20", "20-40", "40-90", "90-180", "180-600", "600+"]

# Assign each video to a bucket (trasnform duration to min)
video_lengths["duration_bucket"] = pd.cut(video_lengths["videoDuration"] / 60, bins=bins, labels=labels, right=False)

The massive query. It will prove if I’m being paranoid or not about everyone attempting to sell me something in any respect times. I’ll start, though, by answering a less complicated query, which is the share of sponsors for various video durations.

My expectation is that shorter videos have the next share of their runtime from sponsors as compared to longer videos. Let’s check if this is definitely the case.

CREATE TABLE video_total_ads AS
SELECT
    videoID,
    MAX(videoDuration) AS videoDuration,
    SUM(endTime - startTime) AS total_ad_duration,
    SUM(endTime - startTime) / 60 AS ad_minutes,
    SUM(endTime - startTime) / MAX(videoDuration) AS ad_percentage,
    MAX(videoDuration) / 60 AS video_duration_minutes
FROM
    unique_segments
WHERE
    videoDuration > 0
    AND videoDuration < 5400
    AND videoID IS NOT NULL
GROUP BY
    videoID

To maintain the visualization easy, I’m applying similar buckets, but only as much as 90 minutes.

# Define duration buckets (in minutes, as much as 90min)
bins = [0, 3, 7, 12, 20, 30, 40, 60, 90]    
labels = ["0-3", "3-7", "7-12", "12-20", "20-30", "30-40", "40-60", "60-90"]

video_total_ads = video_total_ads.df()

# Apply the buckets again
video_total_ads["duration_bucket"] = pd.cut(video_total_ads["videoDuration"] / 60, bins=bins, labels=labels, right=False)

# Group by bucket and sum ad times and total durations
bucket_data = video_total_ads.groupby("duration_bucket")[["ad_minutes", "videoDuration"]].sum()

# Convert to percentage of total video time
bucket_data["ad_percentage"] = (bucket_data["ad_minutes"] / (bucket_data["videoDuration"] / 60)) * 100
bucket_data["video_percentage"] = 100 - bucket_data["ad_percentage"]

As expected, when you’re watching shorter-form content on YouTube, then around 10% of it's sponsored! Videos of 12–20 min in duration have 6.5% of sponsors, while 20–30 min have only 4.8%.

To maneuver forward to the year-by-year evaluation I would like to affix the sponsor times with the videoInfo table.

CREATE TABLE video_total_ads_joined AS
SELECT
    *
FROM
    video_total_ads
LEFT JOIN video_info ON video_total_ads.videoID = video_info.videoID

Next, let’s just check what number of videos now we have per yr:

SELECT
    *,
    to_timestamp(NULLIF (published, 0)) AS published_date,
    extract(yr FROM to_timestamp(NULLIF (published, 0))) AS published_year
FROM
    video_total_ads

Not good, not good in any respect. I’m not exactly sure why but there are plenty of videos that didn’t have the timestamp recorded. Evidently only in 2021 and 2022 videos were reliably stored with their published date.

I do have some ideas on how I can improve this dataset with other public data, nevertheless it’s a really time-consuming process and I'll leave this for a future blog post. I don’t intend to accept a solution based on limited data, but for now, I could have to make do with what I even have.

I selected to maintain the evaluation between the years 2018 and 2023, provided that those years had more data points.

# Limiting the years as for these here I even have a good amount of knowledge.
start_year = 2018
end_year = 2023

plot_df = (
    video_total_ads_joined.df()
    .query(f"published_year >= {start_year} and published_year <= {end_year}")
    .groupby(["published_year", "duration_bucket"], as_index=False)
    [["ad_minutes", "video_duration_minutes"]]
    .sum()
)

# Calculate ad_percentage & content_percentage
plot_df["ad_percentage"] = (
    plot_df["ad_minutes"] / plot_df["video_duration_minutes"] * 100
)
plot_df["content_percentage"] = 100 - plot_df["ad_percentage"]

There's a steep increase in ad percentage, especially from 2020 to 2021, but afterward, it plateaus, especially for longer videos. This makes plenty of sense since during those years online commercial grew loads as people spent increasingly time at home.

For shorter videos, there does appear to be a rise from 2022 to 2023. But as the information is proscribed, and I don’t have data for 2024, I can’t get a conclusive answer to this.

Next up, let’s move into questions that don’t rely upon the publishing date, this fashion I can work with a bigger portion of the dataset.

It is a fun one for me, as I'm wondering if the channels I actively watch are those that run probably the most ads.

Continuing from the table created previously, I can easily group the ad and video amount by channel:

CREATE TABLE ad_percentage_per_channel AS
SELECT
    channelID,
    sum(ad_minutes) AS channel_total_ad_minutes,
    sum(videoDuration) / 60 AS channel_total_video_minutes
FROM
    video_total_ads_joined
GROUP BY
    channelID

I made a decision to filter for channels that had not less than half-hour of videos in the information, as a way of eliminating outliers.

SELECT
    channelID,
    channel_total_video_minutes,
    channel_total_ad_minutes,
    channel_ad_percentage
FROM
    ad_percentage_per_channel
WHERE
    -- Not less than half-hour of video
    channel_total_video_minutes > 1800
    AND channelID IS NOT NULL
ORDER BY
    channel_ad_percentage DESC
LIMIT 50

As quickly mentioned earlier, I also created some functions to get the country and thumbnail of channels. This allowed me to create this visualization.

I’m unsure if this surprised me or not. A few of the channels on this list I watch very incessantly, especially Gaveta (#31), a Brazilian YouTuber who covers movies and film editing.

I also know that each he and Corridor Crew (#32) do plenty of self-sponsor, promoting their very own content and products, so possibly this can be the case for other channels!

In any case, the information seems good, and the odds appear to match my manual checks and private experience.

I might like to know if channels that you just watch were present on this list, and if it surprised you or not!

If you ought to see the Top 150 Creators, subscribe to my free newsletter, as I shall be publishing the complete list in addition to more details about this evaluation in there!

Have you ever ever thought of at which point of the video ads work best? People probably just skip sponsor segments placed at first, and just move on and shut the video for those placed at the top.

From personal experience, I feel that I’m more more likely to watch an ad if it plays across the middle of a video, but I don’t think that is what creators do generally.

My goal, then, is to create a heatmap that shows the density of ads during a video runtime. Doing this was surprisingly not obvious, and the answer that I discovered was so clever that it kinda blew my mind. Let me show you.

That is the information needed for this evaluation. One row per ad, with the timestamp when each segment starts and ends:

Step one is to normalize the intervals, e.g., I don’t care that an ad began at 63s, what I need to know is that if it began at 1% of the video runtime or 50% of the video runtime.

CREATE TABLE ad_intervals AS
SELECT
    videoID,
    startTime,
    endTime,
    videoDuration,
    startTime / videoDuration AS start_fraction,
    endTime / videoDuration AS end_fraction
FROM
    unique_segments
WHERE
    -- Simply to be sure we haven't got bad data
    videoID IS NOT NULL
    AND startTime >= 0
    AND endTime <= videoDuration
    AND startTime < endTime
    -- Lower than 40h
    AND videoDuration < 144000

Great, now all intervals are comparable, but the issue is way from solved.

I need you to think, how would you solve this? If I asked you “At 10% runtime out of all videos, what number of ads are running?”

I don't consider that that is an obvious problem to unravel. My first instinct was to create a bunch of buckets, after which, for every row, I might ask “Is there an ad running at 1% of the runtime? What about at 2%? And so forth…”

This gave the impression of a terrible idea, though. I wouldn’t have the option to do it in SQL, and the code to unravel it could be incredibly messy. In the long run, the implementation of the answer I discovered was remarkably easy, using the Sweep Line Algorithm, which is an algorithm that is commonly utilized in programming interviews and puzzles.

I'll show you the way I solved it but don’t worry when you don’t understand what is going on. I'll share other resources so that you can learn more about it in a while.

The very first thing to do is to remodel each interval (startTime, endTime) into two events, one that can count as +1 when the ad starts, and one other that can count as -1 when the ad finishes. Afterward, just order the dataset by the “start time”.

CREATE TABLE ad_events AS
WITH unioned as (
  -- That is an important step.
  SELECT
      videoID,
      start_fraction as fraction,
      1 as delta
  FROM ad_intervals
  UNION ALL
  SELECT
      videoID,
      end_fraction as fraction,
      -1 as delta
  FROM ad_intervals
), ordered AS (
  SELECT
      videoID,
      fraction,
      delta
  FROM ad_events
  ORDER BY fraction, delta
)
SELECT * FROM ordered

Now it’s already much easier to see the trail forward! All I even have to do is use a running sum on the delta column, after which, at any point of the dataset, I can understand how many ads are running!

For instance, if from 0s to 10s three ads began, but two of those also finished, I might have a delta of +3 after which -2, which suggests that there is simply one ad currently running!

Going forward, and to simplify the information a bit, I first around the fractions to 4 decimal points and aggregate them. This shouldn't be needed, but having too many rows was an issue when attempting to plot the information. Finally, I divide the quantity of running ads by the entire amount of videos, to have it as a percentage.

CREATE TABLE ad_counter AS 
WITH rounded_and_grouped AS (
  SELECT
      ROUND(fraction, 4) as fraction,
      SUM(delta) as delta
  FROM ad_events
  GROUP BY ROUND(fraction, 4)
  ORDER BY fraction
), running_sum AS (
  SELECT
      fraction,
      SUM(delta) OVER (ORDER BY fraction) as ad_counter
  FROM rounded_and_grouped
), density AS (
  SELECT
      fraction,
      ad_counter,
      ad_counter / (SELECT COUNT(DISTINCT videoID) FROM unique_segments_filtered) as density
  FROM running_sum
)
SELECT * FROM density

With this data not only do I do know that at first of the videos (0.0% fraction), there are 69987 videos running ads, this also represents 17% of all videos within the dataset.

Now I can finally plot it as a heatmap:

As expected, the bumps on the extremities show that it’s far more common for channels to run ads at first and end of the video. It’s also interesting that there's a plateau across the middle of the video, but then a drop, because the second half of the video is usually more ad-free.

What I discovered funny is that it’s apparently common for some videos to begin right away with an ad. I couldn’t picture this, so I manually checked 10 videos and it’s actually true… I’m unsure how representative it's, but many of the ones that I opened were gaming-related and in Russian, they usually began directly with ads!

Before we move on to the conclusions, what did you're thinking that of the answer to this problem? I used to be surprised at how easy was doing this with the Sweep Line trick. If you ought to know more about it, I recently published a blog post covering some SQL Patterns, and the last one is precisely this problem! Just repackaged within the context of counting concurrent meetings.

Conclusion

I actually enjoyed doing this evaluation for the reason that data feels very personal to me, especially because I’ve been hooked on YouTube currently. I also feel that the answers I discovered were quite satisfactory, not less than for probably the most part. To complete it off, let’s do a final recap!

Have Sponsor Segments Increased Over the Years?

There was a transparent increase from 2020 to 2021. This was an effect that happened throughout all digital media and it’s clearly shown on this data. In newer years, I can’t say whether there was a rise or not, as I don’t have enough data to be confident.

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

I got to create a really convincing list of the Top 50 channels that run the best amount of ads. And I discovered that a few of my favorite creators are those that spend probably the most period of time attempting to sell me something!

What's the density of sponsor segments throughout a video?

As expected, most individuals run ads at first and the top of videos. Besides this, plenty of creators run ads across the middle of the video, making the second half barely more ad-free.

Also, there are YouTubers who immediately start a video with ads, which I believe it’s a crazy strategy.

Other Learnings and Next Steps

I liked how clear the information was in showing the share of ads in several video sizes. Now I do know that I’m probably spending 5–6% of my time on YouTube watching ads if I’m not skipping them since I mostly watch videos which might be 10–20 min.

I’m still not fully blissful though with the year-by-year evaluation. I’ve already looked into other data and downloaded greater than 100 GB of YouTube metadata datasets. I’m confident that I can use it, along with the YouTube API, to fill some gaps and get a more convincing answer to my query.

Visualization Code

You would possibly have noticed that I didn’t provide snippets to plot the charts shown here. This was on purpose to make the blog post more readable, as matplotlib code occupies plenty of space.

You'll find all of the code in my GitHub repo, that way you'll be able to copy my charts if you ought to.

That’s it for this one! I actually hope you enjoyed reading this blog post and learned something latest!

Should you’re inquisitive about interesting topics that didn’t make it into this post, or enjoy learning about data, subscribe to my free newsletter on Substack. I publish at any time when I even have something genuinely interesting to share.

Need to connect directly or have questions? Reach out anytime at mtrentz.com.

Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data

Cleansing the Data

Exploring and Enhancing the Data

Conclusion

Have Sponsor Segments Increased Over the Years?

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

What's the density of sponsor segments throughout a video?

Other Learnings and Next Steps

Visualization Code

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data

Cleansing the Data

Exploring and Enhancing the Data

Conclusion

Have Sponsor Segments Increased Over the Years?

Which Channels Have the Highest Percentage of Sponsor Time Per Video?

What's the density of sponsor segments throughout a video?

Other Learnings and Next Steps

Visualization Code

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.