Using Gemini + Text to Speech + MoviePy to create a video, and what this says about what GenAI is becoming rapidly useful for
Like most everyone, I used to be flabbergasted by NotebookLM and its ability to generate a podcast from a set of documents. After which, I got to pondering: “how do they try this, and where can I get a few of that magic?” How easy would it not be to copy?
Goal: Create a video talk from an article
I don’t need to create a podcast, but I’ve often wished I could generate slides and a video talk from my blog posts —some people prefer paging through slides, and others prefer to look at videos, and this might be method to meet them where they’re. In this text, I’ll show you methods to do that.
The full code for this text is on GitHub — in case you ought to follow together with me. And the goal is to create this video from this text:
1. Initialize the LLM
I’m going to make use of Google Gemini Flash because (a) it’s the least expensive frontier LLM today, (b) it’s multimodal in that it will probably read and understand images also, and (c) it supports controlled generation, meaning that we will ensure the output of the LLM matches a desired structure.
import pdfkit
import os
import google.generativeai as genai
from dotenv import load_dotenvload_dotenv("../genai_agents/keys.env")
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
Note that I’m using Google Generative AI and never Google Cloud Vertex AI. The 2 packages are different. The Google one supports Pydantic objects for controlled generation; the Vertex AI one only supports JSON for now.
2. Get a PDF of the article
I used Python to download the article as a PDF, and upload it to a brief storage location that Gemini can read:
ARTICLE_URL = "https://lakshmanok.medium...."
pdfkit.from_url(ARTICLE_URL, "article.pdf")
pdf_file = genai.upload_file("article.pdf")
Unfortunately, something about medium prevents pdfkit from getting the photographs within the article (perhaps because they’re webm and never png …). So, my slides are going to be based on just the text of the article and never the photographs.
3. Create lecture notes in JSON
Here, the information format I would like is a set of slides each of which has a title, key points, and a set of lecture notes. The lecture as an entire has a title and an attribution also.
class Slide(BaseModel):
title: str
key_points: List[str]
lecture_notes: strclass Lecture(BaseModel):
slides: List[Slide]
lecture_title: str
based_on_article_by: str
Let’s tell Gemini what we wish it to do:
lecture_prompt = """
You're a university professor who must create a lecture to
a category of undergraduate students.* Create a 10-slide lecture based on the next article.
* Each slide should contain the next information:
- title: a single sentence that summarizes the predominant point
- key_points: a listing of between 2 and 5 bullet points. Use phrases, not full sentences.
- lecture_notes: 3-10 sentences explaining the important thing points in easy-to-understand language. Expand on the points using other information from the article.
* Also, create a title for the lecture and attribute the unique article's creator.
"""
The prompt is pretty straightforward — ask Gemini to read the article, extract key points and create lecture notes.
Now, invoke the model, passing within the PDF file and asking it to populate the specified structure:
model = genai.GenerativeModel(
"gemini-1.5-flash-001",
system_instruction=[lecture_prompt]
)
generation_config={
"temperature": 0.7,
"response_mime_type": "application/json",
"response_schema": Lecture
}
response = model.generate_content(
[pdf_file],
generation_config=generation_config,
stream=False
)A number of things to notice concerning the code above:
- We pass within the prompt because the system prompt, in order that we don’t have to keep sending within the prompt with latest inputs.
- We specify the specified response type as JSON, and the schema to be a Pydantic object
- We send the PDF file to the model and tell it generate a response. We’ll wait for it to finish (no have to stream)
The result’s JSON, so extract it right into a Python object:
lecture = json.loads(response.text)
For instance, that is what the third slide looks like:
{'key_points': [
'Silver layer cleans, structures, and prepares data for self-service analytics.',
'Data is denormalized and organized for easier use.',
'Type 2 slowly changing dimensions are handled in this layer.',
'Governance responsibility lies with the source team.'
],
'lecture_notes': 'The silver layer takes data from the bronze layer and transforms it right into a usable format for self-service analytics. This involves cleansing, structuring, and organizing the information. Type 2 slowly changing dimensions, which track changes over time, are also handled on this layer. The governance of the silver layer rests with the source team, which is often the information engineering team accountable for the source system.',
'title': 'The Silver Layer: Data Transformation and Preparation'
}
4. Convert to PowerPoint
We will use the Python package pptx to create a Presentation with notes and bullet points. The code to create a slide looks like this:
for slidejson in lecture['slides']:
slide = presentation.slides.add_slide(presentation.slide_layouts[1])
title = slide.shapes.title
title.text = slidejson['title']
# bullets
textframe = slide.placeholders[1].text_frame
for key_point in slidejson['key_points']:
p = textframe.add_paragraph()
p.text = key_point
p.level = 1
# notes
notes_frame = slide.notes_slide.notes_text_frame
notes_frame.text = slidejson['lecture_notes']
The result’s a PowerPoint presentation that appears like this:
Not very fancy, but definitely an amazing place to begin for editing should you are going to provide a chat.
5. Read the notes aloud and save audio
Well, we were inspired by a podcast, so let’s see methods to create just an audio of somebody summarizing the article.
We have already got the lecture notes, so let’s create audio files of every of the slides.
Here’s the code to take some text, and have an AI voice read it out. We save the resulting audio into an mp3 file:
from google.cloud import texttospeechdef convert_text_audio(text, audio_mp3file):
"""Synthesizes speech from the input string of text."""
tts_client = texttospeech.TextToSpeechClient()    
input_text = texttospeech.SynthesisInput(text=text)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Standard-C",
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = tts_client.synthesize_speech(
request={"input": input_text, "voice": voice, "audio_config": audio_config}
)
# The response's audio_content is binary.
with open(audio_mp3file, "wb") as out:
out.write(response.audio_content)
print(f"{audio_mp3file} written.")
What’s happening within the code above?
- We’re using Google Cloud’s text to speech API
- Asking it to make use of a typical US accent female voice. If you happen to were doing a podcast, you’d pass in a “speaker map” here, one voice for every speaker.
- We then give it within the input text, ask it generate audio
- Save the audio as an mp3 file. Note that this has to match the audio encoding.
Now, create audio by iterating through the slides, and passing within the lecture notes:
for slideno, slide in enumerate(lecture['slides']):
text = f"On to {slide['title']} n"
text += slide['lecture_notes'] + "nn"
filename = os.path.join(outdir, f"audio_{slideno+1:02}.mp3")
convert_text_audio(text, filename)
filenames.append(filename)
The result’s a bunch of audio files. You’ll be able to concatenate them should you wish using pydub:
combined = pydub.AudioSegment.empty()
for audio_file in audio_files:
audio = pydub.AudioSegment.from_file(audio_file)
combined += audio
# pause for 4 seconds
silence = pydub.AudioSegment.silent(duration=4000)
combined += silence
combined.export("lecture.wav", format="wav")
However it turned out that I didn’t have to. The person audio files, one for every slide, were what I needed to create a video. For a podcast, in fact, you’d need a single mp3 or wav file.
6. Create images of the slides
Somewhat annoyingly, there’s no easy method to render PowerPoint slides as images using Python. You wish a machine with Office software installed to try this — not the sort of thing that’s easily automatable. Possibly I must have used Google Slides … Anyway, a straightforward method to render images is to make use of the Python Image Library (PIL):
def text_to_image(output_path, title, keypoints):
image = Image.latest("RGB", (1000, 750), "black")
draw = ImageDraw.Draw(image)
title_font = ImageFont.truetype("Coval-Black.ttf", size=42)
draw.multiline_text((10, 25), wrap(title, 50), font=title_font)
text_font = ImageFont.truetype("Coval-Light.ttf", size=36)
for ptno, keypoint in enumerate(keypoints):
draw.multiline_text((10, (ptno+2)*100), wrap(keypoint, 60), font=text_font) 
image.save(output_path)
The resulting image isn’t great, but it surely is serviceable (you may tell nobody pays me to jot down production code anymore):
7. Create a Video
Now that now we have a set of audio files and a set of image files, we will use a Python package moviepy to create a video clip:
clips = []
for slide, audio in zip(slide_files, audio_files):
audio_clip = AudioFileClip(f"article_audio/{audio}")
slide_clip = ImageClip(f"article_slides/{slide}").set_duration(audio_clip.duration)
slide_clip = slide_clip.set_audio(audio_clip)
clips.append(slide_clip)
full_video = concatenate_videoclips(clips)
And we will now write it out:
full_video.write_videofile("lecture.mp4", fps=24, codec="mpeg4", 
temp_audiofile='temp-audio.mp4', remove_temp=True)
Final result? We’ve 4 artifacts, all created routinely from the article.pdf:
lecture.json  lecture.mp4  lecture.pptx  lecture.wav
There’s:
- a JSON file with keypoints, lecture notes, etc.
- A PowerPoint file that you could modify. The slides have the important thing points, and the notes section of the slides has the “lecture notes”
- An audio file consisting of an AI voice reading out the lecture notes
- A mp4 movie (that I uploaded to YouTube) of the audio + images. That is the video talk that I got down to create.
Pretty cool, eh?
8. What this says concerning the way forward for software
We’re all, as a community, probing around to search out what this really cool technology (generative AI) could be used for. Obviously, you should use it to create content, however the content that it creates is sweet for brainstorming, but not to make use of as-is. Three years of improvements within the tech haven’t solved the issue that GenAI generates blah content, and not-ready-to-use code.
That brings us to a number of the ancillary capabilities that GenAI has opened up. And these transform extremely useful. There are 4 capabilities of GenAI that this post illustrates.
(1) Translating unstructured data to structured data
The Attention paper was written to resolve the interpretation problem, and it seems transformer-based models are really good at translation. We keep discovering use cases of this. But not only Japanese to English, but in addition Java 11 to Java 17, of text to SQL, of text to speech, between database dialects, …, and now of articles to audio-scripts. This, it seems is the stepping point of using GenAI to create podcasts, lectures, videos, etc.
All I needed to do was to prompt the LLM to construct a series of slide contents (keypoints, title, etc.) from the article, and it did. It even returned the information to me in structured format, conducive to using it from a pc program. Specifically, GenAI is de facto good at translating unstructured data to structured data.
(2) Code search and coding assistance are actually dramatically higher
The opposite thing that GenAI seems to be really good at is at adapting code samples dynamically. I don’t write code to create presentations or text-to-speech or moviepy on a regular basis. Two years ago, I’d have been using Google search and getting Stack Overflow pages and adapting the code by hand. Now, Google search is giving me ready-to-incorporate code:
In fact, had I been using a Python IDE (quite than a Jupyter notebook), I could have avoided the search step completely — I could have written a comment and gotten the code generated for me. That is hugely helpful, and accelerates development using general purpose APIs.
(3) GenAI web services are robust and easy-to-consume
Let’s not lose track of the proven fact that I used the Google Cloud Text-to-Speech service to show my audio script into actual audio files. Text-to-speech is itself a generative AI model (and one other example of the interpretation superpower). The Google TTS service which was introduced in 2018 (and presumably improved since then) was one among the primary generative AI services in production and made available through an API.
In this text, I used two generative AI models — TTS and Gemini — which can be made available as web services. All I needed to do was to call their APIs.
(4) It’s easier than ever to offer end-user customizability
I didn’t do that, but you may squint a little bit and see where things are headed. If I’d wrapped up the presentation creation, audio creation, and movie creation code in services, I could have had a prompt create the function call to invoke these services as well. And put a request-handling agent that might mean you can use text to vary the look-and-feel of the slides or the voice of the person reading the video.
It becomes extremely easy so as to add open-ended customizability to the software you construct.
Summary
Inspired by the NotebookLM podcast feature, I got down to construct an application that might convert my articles to video talks. The important thing step is to prompt an LLM to supply slide contents from the article, one other GenAI model to convert the audio script into audio files, and use existing Python APIs to place them together right into a video.
This text illustrates 4 capabilities that GenAI is unlocking: translation of every kind, coding assistance, robust web services, and end-user customizability.
I loved having the ability to easily and quickly create video lectures from my articles. But I’m much more excited concerning the potential that we keep discovering on this latest tool now we have in our hands.
Further Reading
- Full code for this text: https://github.com/lakshmanok/lakblogs/blob/predominant/genai_seminar/create_lecture.ipynb
- The source article that I converted to a video: https://lakshmanok.medium.com/what-goes-into-bronze-silver-and-gold-layers-of-a-medallion-data-architecture-4b6fdfb405fc
- The resulting video: https://youtu.be/jKzmj8-1Y9Q
- Seems Sascha Heyer wrote up methods to use GenAI to generate a podcast, which is the precise Notebook LM usecase. His approach is somewhat just like mine, except that there is no such thing as a video, just audio. In a cool twist, he uses his own voice as one among the podcast speakers!
- In fact, here’s the video talk of this text created using the technique shown on this video. Ideally, we’re pulling out code snippets and pictures from the article, but it is a start …



