Automate Video Chaptering with LLMs and TF-IDF

-

The important thing steps within the workflow lie in structuring the transcript in paragraphs (step 2) before grouping the paragraphs into chapters from which a table of contents is derived (step 4). Note that these two steps may depend on different LLMs: A quick and low cost LLM resembling LLama 3 8B for the easy task of text editing and paragraph identification, and a more sophisticated LLM resembling GPT-4o-mini for the generation of the table of contents. In between, TF-IDF is used so as to add back timestamp information to the structured paragraphs.

The remainder of the post describes each step in additional detail.

Take a look at the accompanying Github repository and Colab notebook to explore on your individual!

Allow us to use for example the primary lecture of the course ‘MIT 6.S191: Introduction to Deep Learning’ (IntroToDeepLearning.com) by Alexander Amini and Ava Amini (licensed under the MIT License).

Screenshot of the course YouTube page. Course material is under an MIT licence.

Note that chapters are already provided within the video description.

Chaptering made available within the YouTube description

This provides us with a baseline to qualitatively compare our chaptering later on this post.

YouTube transcript API

For YouTube videos, an robotically generated transcript is often made available by YouTube. A convenient method to retrieve that transcript is by calling the get_transcript approach to the Python youtube_transcript_api library. The strategy takes the YouTube video_id library as argument:

# https://www.youtube.com/watch?v=ErnWZxJovaM
video_id = "ErnWZxJovaM" # MIT Introduction to Deep Learning - 2024

# Retrieve transcript with the youtube_transcript_api library
from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])

This returns the transcript as a listing of text and timestamp key-value pairs:

[{'text': '[Music]', 'start': 1.17}, 
{'text': 'good afternoon everyone and welcome to', 'start': 10.28},
{'text': 'MIT sus1 191 my name is Alexander amini', 'start': 12.88},
{'text': "and I will be one among your instructors for", 'start': 16.84},
...]

The transcript is nonetheless poorly formatted: it lacks punctuation and comprises typos (‘MIT sus1 191’ as a substitute of ‘MIT 6.S191′, or ‘amini’ as a substitute of ‘Amini’).

Speech-to-text with Whisper

Alternatively, a speech-to-text library will be used to infer the transcript from a video or audio file. We recommend using faster-whisper, which is a quick implementation of the state-of-the-art open-source whisper model.

The models are available in different size. Probably the most accurate is the ‘large-v3’, which is in a position to transcribe about quarter-hour of audio per minute on a T4 GPU (available free of charge on Google Colab).

from faster_whisper import WhisperModel

# Load Whisper model
whisper_model = WhisperModel("large-v3",
device="cuda" if torch.cuda.is_available() else "cpu",
compute_type="float16",
)

# Call the Whisper transcribe function on the audio file
initial_prompt = "Use punctuation, like this."
segments, transcript_info = whisper_model.transcribe(audio_file, initial_prompt=initial_prompt, language="en")

The results of the transcription is provided as segments which will be easily converted in a listing of text and timestamps as with the youtube_transcript_api library.

Tip: Whisper may sometimes not include the punctuation. The initial_prompt argument will be used to nudge the model so as to add punctuation by providing a small sentence containing punctuation.

Below is an excerpt of the transcription of the our video example with whisper large-v3:

[{'start': 0.0, 'text': ' Good afternoon, everyone, and welcome to MIT Success 191.'},
{'start': 15.28, 'text': " My name is Alexander Amini, and I'll be one of your instructors for the course this year"},
{'start': 19.32, 'duration': 2.08, 'text': ' along with Ava.'}
...]

Note that in comparison with the YouTube transcription, the punctuation is added. Some transcription errors nonetheless still remain (‘MIT Success 191’ as a substitute of ‘MIT 6.S191′).

Once a transcript is out there, the second stage consists in editing and structuring the transcript in paragraphs.

Transcript editing refers to changes made to enhance readability. This involves, for instance, adding punctuation whether it is missing, correcting grammatical errors, removing verbal tics, etc.

The structuring in paragraphs also improves readability, and additionnally serves as a preprocessing step for identifying chapters in stage 4, since chapters shall be formed by grouping paragraphs together.

Paragraph editing and structuring will be carried out in a single operation, using an LLM. We illustrated below the expected results of this stage:

Left: Raw transcript. Right: Edited and structured transcript.

This task doesn’t require a really sophisticated LLM because it mostly consists in reformulating content. On the time of writing this text, decent results might be obtained with for instance GPT-4o-mini or Llama 3 8B, and the next system prompt:

You’re a helpful assistant.

Your task is to enhance the user input’s readability: add punctuation if needed and take away verbal tics, and structure the text in paragraphs separated with ‘nn’.

Keep the wording as faithful as possible to the unique text.

Put your answer inside tags.

We depend on OpenAI compatible chat completion API for LLM calling, with messages having the roles of either ‘system’, ‘user’ or ‘assistant’. The code below illustrates the instantiation of an LLM client with Groq, using LLama 3 8B:

# Hook up with Groq with a Groq API key
llm_client = Groq(api_key=api_key)
model = "llama-8b-8192"

# Extract text from transcript
transcript_text = ' '.join([s['text'] for s in transcript])

# Call LLM
response = client.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": transcript_text
}
],
model=model,
temperature=0,
seed=42
)

Given a bit of raw ‘transcript_text’ as input, this returns an edited piece of text inside tags:

response_content=response.decisions[0].message.content

print(response_content)
"""

Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I will be one among your instructors for the course this 12 months, together with Ava. We're really excited to welcome you to this incredible course.

It is a fast-paced and intense one-week course that we're about to undergo together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more.

Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and annually, this lecture is getting harder and harder to show since it's alleged to cover the foundations of the sector.
"""

Allow us to then extract the edited text from the tags, divide it into paragraphs, and structure the outcomes as a JSON dictionary consisting of paragraph numbers and pieces of text:

import re
pattern = re.compile(r'(.*?)', re.DOTALL)
response_content_edited = pattern.findall(response_content)
paragraphs = response_content_edited.strip().split('nn')
paragraphs_dict = [{'paragraph_number': i, 'paragraph_text': paragraph} for i, paragraph in enumerate(paragraphs)

print(paragraph_dict)

[{'paragraph_number': 0,
'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course."},
{'paragraph_number': 1,
'paragraph_text': "This is a fast-paced and intense one-week course that we're about to go through together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more."},
{'paragraph_number': 2,
'paragraph_text': "Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and each year, this lecture is getting harder and harder to teach because it's supposed to cover the foundations of the field."}]

Note that the input mustn’t be too long because the LLM will otherwise ‘forget’ a part of the text. For long inputs, the transcript have to be split in chunks to enhance reliability. We noticed that GPT-4o-mini handles well as much as 5000 characters, while Llama 3 8B can only handle as much as 1500 characters. The notebook provides the function transcript_to_paragraphs which takes care of splitting the transcript in chunks.

The transcript is now structured as a listing of edited paragraphs, however the timestamps have been lost in the method.

The third stage consists in adding back timestamps, by inferring which segment within the raw transcript is the closest to every paragraph.

TF-IDF is used to seek out which raw transcript segment (right) best matches the start of the edited pargagraphs (left).

We rely for this task on the TF-IDF metric. TF-IDF stands for term frequency–inverse document frequency and is a similarity measure for comparing two pieces of text. The measure works by computing the number of comparable words, giving more weight to words which appear less continuously.

As a preprocessing step, we adjust the transcript segments and paragraph beginnings in order that they contain the identical variety of words. The text pieces needs to be long enough in order that paragraph beginnings will be successfully matched to a singular transcript segment. We discover that using 50 words works well in practice.


num_words = 50

transcript_num_words = transform_text_segments(transcript, num_words=num_words)
paragraphs_start_text = [{"start": p['paragraph_number'], "text": p['paragraph_text']} for p in paragraphs]
paragraphs_num_words = transform_text_segments(paragraphs_start_text, num_words=num_words)

We then depend on the sklearn library and its TfidfVectorizer and cosine_similarity function to run TF-IDF and compute similarities between each paragraph starting and transcript segment. below is an example of code for locating the perfect match index within the transcript segments for the primary paragraph.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Paragraph for which to seek out the timestamp
paragraph_i = 0

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer().fit_transform(transcript_num_words + paragraphs_num_words)
# Get the TF-IDF vectors for the transcript and the excerpt
vectors = vectorizer.toarray()
# Extract the TF-IDF vector for the paragraph
paragraph_vector = vectors[len(transcript_num_words) + paragraph_i]

# Calculate the cosine similarity between the paragraph vector and every transcript chunk
similarities = cosine_similarity(vectors[:len(transcript_num_words)], paragraph_vector.reshape(1, -1))
# Find the index of essentially the most similar chunk
best_match_index = int(np.argmax(similarities))

We wrapped the method in a add_timestamps_to_paragraphs function, which adds timestamps to paragraphs, along with the matched segment index and text:

paragraphs = add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50)

#Example of output for the primary paragraph:
print(paragraphs[0])

{'paragraph_number': 0,
'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I will be one among your instructors for the course this 12 months, together with Ava. We're really excited to welcome you to this incredible course.",
'matched_index': 1,
'matched_text': 'good afternoon everyone and welcome to',
'start_time': 10}

In the instance above, the primary paragraph (numbered 0) is found to match the transcript segment #1 that starts at time 10 (in seconds).

The table of content is then found by grouping consecutive paragraphs into chapters and identifying meaningful chapter titles. The duty is usually carried out by an LLM, which is instructed to rework an input consisting in a listing of JSON paragraphs into an output consisting in a listing of JSON chapter titles with the starting paragraph numbers:

system_prompt_paragraphs_to_toc = """

You're a helpful assistant.

You're given a transcript of a course in JSON format as a listing of paragraphs, each containing 'paragraph_number' and 'paragraph_text' keys.

Your task is to group consecutive paragraphs in chapters for the course and discover meaningful chapter titles.

Listed below are the steps to follow:

1. Read the transcript rigorously to grasp its general structure and the most important topics covered.
2. Search for clues that a brand new chapter is about to start out. This might be a change of topic, a change of time or setting, the introduction of latest themes or topics, or the speaker's explicit mention of a brand new part.
3. For every chapter, keep track of the paragraph number that starts the chapter and discover a meaningful chapter title.
4. Chapters should ideally be equally spaced throughout the transcript, and discuss a selected topic.

Format your end in JSON, with a listing dictionaries for chapters, with 'start_paragraph_number':integer and 'title':string as key:value.

Example:
{"chapters":
[{"start_paragraph_number": 0, "title": "Introduction"},
{"start_paragraph_number": 10, "title": "Chapter 1"}
]
}
"""

A vital element is to specifically ask for a JSON output, which increases the possibilities to get a accurately formatted JSON output that may later be loaded back in Python.

GPT-4o-mini is used for this task, because it is more cost effective than OpenAI’s GPT-4o and customarily provides good results. The instructions are provided through the ‘system’ role, and paragraphs are provided in JSON format through the ‘user’ role.

# Hook up with OpenAI with an OpenAI API key
llm_client_get_toc = OpenAI(api_key=api_key)
model_get_toc = "gpt-4o-mini-2024-07-18"

# Dump JSON paragraphs as text
paragraphs_number_text = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in paragraphs]
paragraphs_json_dump = json.dumps(paragraphs_number_text)

# Call LLM
response = client_get_toc.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt_paragraphs_to_toc
},
{
"role": "user",
"content": paragraphs_json_dump
}
],
model=model_get_toc,
temperature=0,
seed=42
)

Et voilà! The decision returns the list of chapter titles along with the starting paragraph number in JSON format:

print(response)

{
"chapters": [
{
"start_paragraph_number": 0,
"title": "Introduction to the Course"
},
{
"start_paragraph_number": 17,
"title": "Foundations of Intelligence and Deep Learning"
},
{
"start_paragraph_number": 24,
"title": "Course Structure and Expectations"
}
....
]
}

As in step 2, the LLM may struggle with long inputs and dismiss a part of the input. The answer consists again in splitting the input into chunks, which is implemented within the notebook with the paragraphs_to_toc function and the chunk_size parameter.

This last stage combines the paragraphs and the table of content to create a structured JSON file with chapters, an example of which is provided within the accompanying Github repository.

We illustrate below the resulting chaptering (right), in comparison with the baseline chaptering that was available from the YouTube description (left):

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x