Home Artificial Intelligence Constructing Owly an AI Comic Video Generator For My Son

Constructing Owly an AI Comic Video Generator For My Son

0
Constructing Owly an AI Comic Video Generator For My Son

Owly the AI Comic Story Teller [AI Generated Image]

Every evening, it has turn out to be a cherished routine to share bedtime stories with my 4-year-old son Dexie, who absolutely adores them. His collection of books is impressive, but he’s especially captivated once I create tales from scratch. Crafting stories this fashion also allows me to include moral values I need him to learn, which may be difficult to search out in store-bought books. Over time, I’ve honed my skills in crafting personalised narratives that ignite his imagination — from dragons with fractured partitions to a lonely sky lantern in search of companionship. Recently, I’ve been spinning yarns about fictional superheroes like Slow-Mo Man and Fart-Man, which have turn out to be his favourites.

While it’s been a pleasant journey for me, after half a 12 months of nightly storytelling, my creative reservoir is being tested. To maintain him engaged with fresh and exciting stories without exhausting myself, I want a more sustainable solution — an AI technology that may generate fascinating tales routinely! I named her Owly, after his favourite bird, an owl.

Pookie and the key door to a magic forest — Generated by AI Comic Generator.

As I began assembling my wish list, it quickly ballooned, driven by my eagerness to check the frontiers of recent technology. No bizarre text-based story would do — I envisioned an AI crafting a full-blown comic with as much as 10 panels. To amp up the thrill for Dexie, I aimed to customize the comic using characters he knew and loved, like Zelda and Mario, and perhaps even toss in his toys for good measure. Frankly, the personalisation angle emerged from a necessity for visual consistency across the comic strips, which I’ll dive into later. But hold your horses, that’s not all — I also wanted the AI to narrate the story aloud, backed by a fitting soundtrack to set the mood. Tackling this project could be equal parts amusing and difficult for me, while Dexie could be treated to a tailor-made, interactive storytelling extravaganza.

Dexie’s toys as comic story’s leading characters [Image by Author]

To beat the aforementioned requirements, I realised I needed to assemble five marvellous modules:

  1. The Story Script Generator, conjuring up a multi-paragraph story where each paragraph will likely be transformed into a comic book strip section. Plus, it recommends a musical style to pluck a fitting tune from my library. To tug this off, I enlisted the mighty OpenAI GPT3.5 Large Language Model (LLM).
  2. The Comic Strip Image Generator, whipping up images for every story segment. Stable Diffusion 2.1 teamed up with Amazon SageMaker JumpStart, SageMaker Studio and Batch Transform to bring this to life.
  3. The Text-to-Speech Module, turning the written tale into an audio narration. Amazon Polly’s neural engine leaped to the rescue.
  4. The Video Maker, weaving the comic strips, audio narration, and music right into a self-playing masterpiece. MoviePy was the star of this show.
  5. And eventually, The Controller, orchestrating the grand symphony of all 4 modules, built on the mighty foundation of AWS Batch.

The sport plan? Get the Story Script Generator to weave a 7–10 paragraph narrative, with each paragraph morphing into a comic book strip section. The Comic Strip Image Generator then generates images for every segment, while the Text-to-Speech Module crafts the audio narration. A melodious tune will likely be chosen based on the story generator’s advice. And eventually, the Video Maker combines images, audio narration, and music to create a whimsical video. Dexie is in for a treat with this one-of-a-kind, interactive story-time adventure!

Before delving into the Story Script Generator, let’s first explore the image generator module to supply context for any references to the image generation process. There are many text-to-image AI models available, but I selected the Stable Diffusion 2.1 model for its popularity and ease of constructing, fine-tuning, and deployment using Amazon SageMaker and the broader AWS ecosystem.

Amazon SageMaker Studio is an integrated development environment (IDE) that provides a unified web-based interface for all machine learning (ML) tasks, streamlining data preparation, model constructing, training, and deployment. This boosts data science team productivity by as much as 10x. Inside SageMaker Studio, users can seamlessly upload data, create notebooks, train and tune models, adjust experiments, collaborate with their team, and deploy models to production.

Amazon SageMaker JumpStart, a invaluable feature inside SageMaker Studio, provides an intensive collection of widely-used pre-trained AI models. Some models, including Stable Diffusion 2.1 base, may be fine-tuned along with your own training set and are available with a sample Jupyter Notebook. This lets you quickly and efficiently experiment with the model.

Launching Stable Diffusion 2.1 Notebook on Amazon SageMaker JumpStart [Image by Author]

I navigated to the Stable Diffusion 2.1 base view model page and launched the Jupyter notebook by clicking on the Open Notebook button.

Stable Diffusion 2.1 Base model card [Image by Author]

In a matter of seconds, Amazon SageMaker Studio presented the instance notebook, complete with all of the obligatory code to load the text-to-image model from JumpStart, deploy the model, and even fine-tune it for personalised image generation.

Amazon SageMaker Studio IDE [Image by Author]

Quite a few text-to-image models can be found, with many tailored to specific styles by their creators. Utilising the JumpStart API, I filtered and listed all text-to-image models using the filter_value “task == txt2img” and displayed them in a dropdown menu for convenient selection.

from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text-to-Image generation models.
filter_value = "task == txt2img"
txt2img_models = list_jumpstart_models(filter=filter_value)

# display the model-ids in a dropdown to pick out a model for inference.
model_dropdown = Dropdown(
options=txt2img_models,
value="model-txt2img-stabilityai-stable-diffusion-v2-1-base",
description="Select a model",
style={"description_width": "initial"},
layout={"width": "max-content"},
)
display(model_dropdown)

# Or simply hard code the model id and version=*.
# Eg. if we wish the newest 2.1 base model
self._model_id, self._model_version = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
)

The model I required was model-txt2img-stabilityai-stable-diffusion-v2–1-base which permit fine-tuning.

Huge choice of text-to-image models [Image by Author]

In under 5 minutes, utilising the provided code, I deployed the model to a SageMaker endpoint running a g4dn.2xlarge GPU instance. I swiftly generated my first image from my text prompts, which you possibly can see showcased below.

My image generator crafts a picture of turtle swimming underwater [Image by Author]

The Amazon SageMaker Studio streamlines my experimentation and prototyping process, allowing me to swiftly experiment with various image generation prompts and look at the resulting images directly throughout the IDE using the file explorer and the preview window. Moreover, I can upload images throughout the IDE, utilise the built-in terminal to launch AWS CLI for uploading and downloading images to and from an S3 bucket, and execute SageMaker batch transform jobs against my models to generate quite a few images without delay for a big scale testing.

The duty of this module is sort of straightforward: produce a story script given a story topic and a personality name. Generating a story on a particular topic with GPT3.5 API is incredibly easy.

openai.api_key = self._api_key
prompt = "Write me a 1000-word story about Bob the penguin who desires to travel to Europe to see famous landmarks"
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.7,
max_tokens=2089,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)

For instance, using the prompt “Write me a 1000-word story about Bob the penguin who desires to travel to Europe to see famous landmarks. He learns that his bravery and curiosity lead him to experience many exciting things.” GPT3.5 will effortlessly craft a fascinating story on this topic as if it were penned by an expert storyteller, very similar to the instance below.

Bob the penguin had all the time dreamed of traveling to Europe and visiting famous landmarks. He had heard stories from his friends in regards to the Eiffel Tower in Paris, the Colosseum in Rome, and the Big Ben in London. He had grown uninterested in his routine life in Antarctica and yearned for adventure.

Sooner or later, Bob decided to make the leap and start planning his trip. He spent hours researching the very best travel routes and essentially the most reasonably priced accommodations. After careful consideration, he decided to begin his journey in Paris.

The boat ride was long and tiring, but he was excited to finally be in Europe. He checked into his hotel and immediately set off to see the Eiffel Tower. As he walked through the streets of Paris, he felt a way of wonder and excitement that he had never felt before.

Over the following few days, he visited famous landmarks just like the Louvre Museum, Notre Dame Cathedral, and the Palace of Versailles. He tried recent foods and met recent people, each experience adding to his adventure.

The story itself is implausible, but to rework it into comic strips, I want to divide the story into sections and create a picture for each. The most reasonable approach could be to convert each paragraph into a piece. Nevertheless, as you possibly can see, the photographs generated from those paragraphs present some significant challenges.

Our bob the penguin is portrayed as different characters [AI generated image]
  • Character chaos ensued! Each comic strip depicted Bob as a wholly different character. In the primary strip, he’s a penguin doll; within the second, he’s an individual’s leg; within the third, a suited-up gentleman; and within the fourth, a person in a blue shirt. This happens because only the primary paragraph mentions “Bob the penguin,” while the second refers to him as just “Bob,” and the remaining as “he.” Given this scant information, it’s no wonder the image generator portrayed Bob in so many alternative guises.
  • The scenes also lacked focus. The primary comic strip showed a penguin doll sitting on a table as a substitute of an Antarctic-dwelling penguin dreaming of European adventures. An identical issue arose within the second comic strip, which showed someone’s leg soaring high above a city. It appears the image generator interpreted “Sooner or later, Bob decided to make the leap” as skydiving from an airplane, while the paragraph’s focus must have been on Bob planning his trip to Europe. Long paragraphs with multiple focal points often confuse the image generator, leading to out-of-context images.

To tackle the scene’s out-of-focus issue and improve character consistency, I refined my prompt to incorporate a request for a concise, one-sentence scene description for every paragraph wrapped in [] in order that they may be programatically extracted. This allowed me to supply detailed examples and guide the image generator in creating more focused and accurate images.

For every section please describe the scene in details and all the time include the situation in a single sentence inside [] with the next format [a photo of character in the location], [a photo of character in front of an object], [a photo of character next to an object], [a photo of a location]

With the updated prompt, here’s the resulting story that was generated.

[a photo of Bob the penguin in Antarctica]
Bob the penguin was a blissful and curious penguin who lived within the Antarctic. He was content along with his life there, surrounded by his family and friends. But at some point, he decided to make the leap and explore the world beyond the icy continent. He had heard stories of the various beautiful and exotic places world wide, and he desired to experience them for himself.

[a photo of Bob the penguin reading a book]
Bob the penguin began researching the world, taking a look at maps and reading up on different countries and cultures. He was particularly drawn to Europe, with its many famous landmarks and sights. He decided that Europe was the place he desired to visit, so he began to plan his journey.

[a photo of Bob the penguin on a cruise ship]
He began to make the long journey by boat. He was excited and couldn’t wait to get there, and he was determined to make it to Europe. After a couple of weeks of travelling, he eventually arrived at his destination.

[a photo of Bob the penguin at Eiffel Tower]
Bob the penguin began exploring Europe and was amazed by all the several places he visited. He went to the Eiffel Tower in Paris, the Colosseum in Rome, and the Cliffs of Moher in Ireland. In all places he went he was crammed with awe and delight.

As you possibly can observe, the generated scene descriptions are considerably more focused. They mention a single scene, a location, and/or an activity being performed, often starting with the character’s name. These concise prompts prove to be way more effective for my image generator, as evidenced by the improved images generated below.

A more consistent look of our Bob the penguin [AI generated image]

Bob the penguin has made a triumphant return, but he’s still sporting a recent look in each comic strip. Because the image generation process treats each image individually, and no information is provided about Bob’s color, size, or style of penguin, consistency stays elusive.

I previously considered generating an in depth character description as a part of the story generation to keep up character consistency across images. Nevertheless, this approach proved to be impractical for 2 reasons:

  • Sometimes it’s nearly unattainable to explain a personality with enough detail without resorting to an amazing amount of text. While there is probably not many forms of penguins, consider birds on the whole — with countless shapes, colors, and species comparable to cockatoos, parrots, canaries, pelicans, and owls, the duty becomes daunting.
  • The character generated doesn’t all the time adhere to the provided description throughout the prompt. For instance, a prompt describing a green parrot with a red beak might end in a picture of a green parrot with a yellow beak as a substitute.

So, despite our greatest efforts, our penguin pal Bob continues to experience something of an identity crisis.

The answer to our penguin predicament lies in giving the Stable Diffusion model a visible cue of what our penguin character should appear to be to influence the image generation process and to keep up consistency across all generated images. On this planet of Stable Diffusion, this process is often called fine-tuning, where you supply a handful (often 5 to fifteen) of images containing the identical object and a sentence describing it. These images shall henceforth be often called training images.

Because it seems, this added personalisation shouldn’t be just an answer but additionally a mighty cool feature for my comic generator. Now, I can use a lot of Dexie’s toys because the important characters within the stories, comparable to his festive Christmas penguin, respiratory recent life into Bob the penguin, making them much more personalised and relatable for my young but tough audience. So, the search for consistency turns right into a triumph for tailor-made tales!

Dexie’s toy is now Bob the penguin [Image by Author]

During my exhilarating days of experimentation, I’ve discovered a couple of nuggets of wisdom to share for achieving the very best results when fine-tuning the model to scale back the possibility of overfitting:

  • Keep the backgrounds in your training images diverse. This manner, the model won’t confuse the backdrop with the article, stopping unwanted background cameos within the generated images
  • Capture the goal object from various angles. This helps provide more visual information, enabling the model to generate the article with a greater range of angles, thus higher matching the scene.
  • Mix close-ups with full-body shots. This ensures the model doesn’t assume a particular pose is obligatory, granting more flexibility for the generated object to harmonise with the scene.

To perform the Stable Diffusion model fine-tuning, I launched a SageMaker Estimator training job with Amazon SageMaker Python SDK on an ml.g5.2xlarge GPU instance and directed the training process to my collection of coaching images in an S3 bucket. A resulting fine-tuned model file will then be saved in s3_output_location. And, with just a couple of lines of code, the magic began to unfold!

# [Optional] Override default hyperparameters with custom values
hyperparams["max_steps"] = 400
hyperparams["with_prior_preservation"] = False
hyperparams["train_text_encoder"] = False

training_job_name = name_from_base(f"stable-diffusion-{self._model_id}-transfer-learning")

# Create SageMaker Estimator instance
sd_estimator = Estimator(
role=self._aws_role,
image_uri=image_uri,
source_dir=source_uri,
model_uri=model_uri,
entry_point="transfer_learning.py", # Entry-point file in source_dir and present in train_source_uri.
instance_count=self._training_instance_count,
instance_type=self._training_instance_type,
max_run=360000,
hyperparameters=hyperparams,
output_path=s3_output_location,
base_job_name=training_job_name,
sagemaker_session=session,
)

# Launch a SageMaker Training job by passing s3 path of the training data
sd_estimator.fit({"training": training_dataset_s3_path}, logs=True)

To organize the training set, ensure it accommodates the next files:

  1. A series of images named instance_image_x.jpg, where x is a number from 1 to N. On this case, N represents the variety of images, ideally greater than 10.
  2. A dataset_info.json file that features a mandatory field called instance_prompt. This field should provide an in depth description of the article, with a novel identifier preceding the article’s name. For instance, “a photograph of Bob the penguin,” where ‘Bob’ acts because the unique identifier. Through the use of this identifier, you possibly can direct your fine-tuned model to generate either a normal penguin (known as “penguin”) or the penguin out of your training set (known as “Bob the penguin”). Some sources suggest using unique names comparable to sks or xyz, but I discovered that it’s not essential to accomplish that.

The dataset_info.json file also can include an optional field called class_prompt, which offers a general description of the article without the unique identifier (e.g., “a photograph of a penguin”). This field is utilised only when the prior_preservation parameter is ready to True; otherwise, it’ll be disregarded. I’ll discuss more about it on the advanced fine-tuning section below.

{"instance_prompt": "a photograph of bob penguin",
"class_prompt": "a photograph of a penguin"
}

After a couple of test runs with Dexie’s toys, the image generator delivered some truly impressive results. It brought Dexie’s kangaroo magnetic block creation to life, hopping its way into the virtual world. The generator also masterfully depicted his beloved shower turtle toy swimming underwater, surrounded by a vibrant school of fish. The image generator actually captured the magic of Dexie’s playtime favourites!

Dexie’s toys are dropped at life [AI generated image]

Batch Transform against fine-tuned Stable Diffusion model

Since I needed to generate over 100 images for every comic strip, deploying a SageMaker endpoint (consider it as a Rest API) and generating one image at a time wasn’t essentially the most efficient approach. As an alternative, I opted to run a batch transform against my model, supplying it with text files in an S3 bucket containing the prompts to generate the photographs.

I’ll provide more details about this process since I initially struggled with it, and I hope my explanation will prevent a while. You’ll need to organize one text file per image prompt with the next JSON content: {“prompt”: “a photograph of Bob the penguin in Antarctica”}. While it seems that there’s a method to mix multiple inputs into one file using the MultiRecord strategy, I used to be unable to work out how it really works.

One other challenge I encountered was executing a batch transform against my fine-tuned model. You possibly can’t execute a batch transform using a transformer object returned by Estimator.transformer(), which often works in my previous projects. As an alternative, you have to first create a SageMaker model object by specifying the S3 location of your fine-tuned model because the model_data. From there, you possibly can create the transformer object using this model object.

def _get_model_uris(self, model_id, model_version, scope):
# Retrieve the inference docker container uri
image_uri = image_uris.retrieve(
region=None,
framework=None, # routinely inferred from model_id
image_scope=scope,
model_id=model_id,
model_version=model_version,
instance_type=self._inference_instance_type,
)
# Retrieve the inference script uri. This includes scripts for model loading, inference handling etc.
source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope=scope
)
if scope == "training":
# Retrieve the pre-trained model tarball to further fine-tune
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=scope
)
else:
model_uri = None

return image_uri, source_uri, model_uri

image_uri, source_uri, model_uri = self._get_model_uris(self._model_id, self._model_version, "inference")

# Get model artifact location by estimator.model_data, or give an S3 key directly
model_artifact_s3_location = f"s3://{self._bucket}/output-model/{job_id}/{training_job_name}/output/model.tar.gz"

env = {
"MMS_MAX_RESPONSE_SIZE": "20000000",
}

# Create model from saved model artifact
sm_model = model.Model(
model_data=model_artifact_s3_location,
role=self._aws_role,
entry_point="inference.py", # entry point file in source_dir and present in deploy_source_uri
image_uri=image_uri,
source_dir=source_uri,
env=env
)

transformer = sm_model.transformer(instance_count=self._inference_instance_count, instance_type=self._inference_instance_type,
output_path=f"s3://{self._bucket}/processing/{job_id}/output-images",
accept='application/json')
transformer.transform(data=f"s3://{self._bucket}/processing/{job_id}/batch_transform_input/",
content_type='application/json')

And with that, my customised image generator is all ready!

Advanced Stable Diffusion model fine-tuning

While it’s not essential for my comic generator project, I’d prefer to touch on some advanced fine-tuning techniques involving the manipulation of max_steps, prior_reservation, and train_text_encoder hyper parameters, in case they turn out to be useful in your projects.

Stable Diffusion model fine-tuning is very at risk of overfitting because of the vast difference between the number of coaching images you provide and people utilized in the bottom model. For instance, you may only supply 10 images of Bob the penguin, while the bottom model’s training set accommodates hundreds of penguin images. A bigger variety of images reduces the likelihood of overfitting and erroneous associations between the goal object and other elements.

When setting prior_reservation to True, Stable Diffusion generates a default of x (typically 100) images using the class_prompt provided, and combines them along with your instance_images during fine-tuning. Alternatively, you possibly can manually supply these images by placing them within the class_data_dir subfolder. In my experience, prior_preservation is usually crucial when fine-tuning Stable Diffusion for human faces. When employing prior_reservation, make sure you provide a class_prompt that mentions essentially the most suitable generic name or common object resembling your character. For Bob the penguin, this object is clearly a penguin, so your class prompt could be “a photograph of a penguin”. This method may also be used to generate a mix between two characters, which I’ll discuss later.

One other helpful parameter for advanced fine-tuning is train_text_encoder. Set it to True to enable text encoder training in the course of the fine-tuning process. The resulting model will higher understand more complex prompts and generate human faces with greater accuracy.

Depending in your specific use case, different hyper parameter values may yield higher results. Moreover, you’ll need to regulate the max_steps parameter to manage the variety of fine-tuning steps required. Consider that a bigger training set might result in overfitting.

By utilising Amazon Polly’s Neural Text To Speech (NTTS) feature, I used to be capable of create audio narration for every paragraph of the story. The standard of the audio narration is outstanding, because it sounds incredibly natural and human-like, making it a great story-teller.

To accommodate a younger audience, comparable to Dexie, I employed the SSML format and utilised the tag to scale back the speaking speed to 90% of its normal rate, ensuring the content wouldn’t be delivered too quickly for them to follow.

self._pollyClient = boto3.Session(
region_name=aws_region).client('polly')
ftext = f"{text}"
response = self._pollyClient.synthesize_speech(VoiceId=self._speaker,
OutputFormat='mp3',
Engine='neural',
Text=ftext,
TextType='ssml')

with open(mp3_path, 'wb') as file:
file.write(response['AudioStream'].read())
file.close()

In any case the labor, I used MoviePy — a implausible Python framework — to magically turn all of the photos, audio narration, and music into an awesome mp4 video. Speaking of music, I gave my tech the ability to decide on the right soundtrack to match the video’s vibe. How, you ask? Well, I just modified my story script generator to return a music style from a pre-determined list using some clever prompts. How cool is that?

At the beginning of the story please suggest song style from the next list only which matches the story and put it inside <>. Song style list are motion, calm, dramatic, epic, blissful and touching.

Once the music style is chosen, the following step is to randomly pick an MP3 track from the relevant folder, which accommodates a handful of MP3 files. This helps so as to add a touch of unpredictability and excitement to the ultimate product.

To orchestrate your complete system, I needed a controller module in the shape of a Python script that would run each module seamlessly. But, in fact, I needed a compute environment to execute this script. I had two options to explore — the primary being my preferred option — a server-less architecture with AWS Lambda. This involved using several AWS Lambdas, paired with SQS. The primary lambda is used as public API using API Gateway as an entry point. This API would soak up the training image URLs and story topic text and pre-process the info, dropping it into an SQS queue. One other Lambda would pick up the info from the subject and conduct data preparation — think image resizing, creating dataset_info.json, and triggering the following Lambda to call Amazon SageMaker Jumpstart to organize the Stable Diffusion model and execute SageMaker training job to fine-tune the model. Phew, that’s a mouthful. Finally, Amazon EventBridge could be used as an event bus to detect the completion of the training job and trigger the following Lambda to execute SageMaker Batch Transform using the fine-tuned model to generate images.

But alas, this selection was impossible since the AWS Lambda function had a max storage limit of 10GB. And when executing the batch transform against the SageMaker model, the SageMaker Python SDK would download and extract the model.tar.gzip file temporarily within the local /tmp before sending it to the managed system that ran the batch transform. Unfortunately, my model was a whopping 5GB compressed, so the SageMaker Python SDK threw an error saying “Out of disk space.” For many use cases where the model size is smaller, this can the very best and cleanest solution.

So, I needed to resort to my second option — AWS Batch. It worked well, but it surely did cost a bit more for the reason that AWS Batch compute instance needed to run throughout your complete process —even during fine-tuning the model, and executing the batch transform which were executed in a separate compute environment inside SageMaker. I could have split the method into several AWS Batch instances and glued them along with Amazon EventBridge and SQS, similar to I’d have done previously using the server-less approach. But with AWS Batch’s longer startup time (around 5 mins), it could have added way an excessive amount of latency to the general process. So, I went with the all-in-one AWS Batch option as a substitute.

Owly system architecture

Feast your eyes upon Owly’s majestic architecture diagram! Our adventure kicks off by launching AWS Batch through the AWS Console, equipping it with an S3 folder brimming with training images, a fascinating story topic, and a pleasant character, all supplied via AWS Batch environment variables.

# Basic settings
JOB_ID = "penguin-images" # key to S3 folder containing the training images
STORY_TOPIC = "bob the penguin who desires to travel to Europe"
STORY_CHARACTER = "bob the penguin"

# Advanced settings
TRAIN_TEXT_ENCODER = False
PRIOR_RESERVATION = False
MAX_STEPS = 400
NUM_IMAGE_VARIATIONS = 5

The AWS Batch springs into motion, retrieving the training images from the S3 folder specified by JOB_ID, resizing them to a 768×768, and making a dataset_info.json file before placing them in a staging S3 bucket.

Next up, we call up the OpenAI GPT3.5 model API to whip up an attractive story and a complementary song style in harmony with the chosen topic and character. We then summon Amazon SageMaker JumpStart to unleash the powerful Stable Diffusion 2.1 base model. With the model at our disposal, we initiate a SageMaker training job to fine-tune it to our rigorously chosen training images. After a transient 30-minute interlude, we forge image prompts for every story paragraph within the guise of text files, that are then dropped into an S3 bucket as input for the image generation extravaganza. Amazon SageMaker Batch Transform is unleashed on the fine-tuned model to supply these images in a batch, a process that lasts a mere 5 minutes.

Once complete, we enlist the assistance of Amazon Polly to craft audio narrations for every paragraph within the story, saving them as mp3 files in only 30 seconds. We then randomly pick an mp3 music file from libraries sorted by song style, based on the choice made by our masterful story generator.

The ultimate act sees the resulting images, audio narration mp3s, and music.mp3 files expertly woven together right into a video slideshow with the assistance of MoviePy. Smooth transitions and the Ken Burns effect are added for that extra touch of elegance. The pièce de résistance, the finished video, is then hoisted as much as the output S3 bucket, awaiting your eager download!

I have to say, I’m moderately happy with the outcomes! The story script generator has truly outdone itself, performing much better than anticipated. Almost every story script crafted shouldn’t be only well-written but additionally brimming with positive morals, showcasing the awe-inspiring prowess of Large Language Models (LLM). As for image generation, well, it’s a little bit of a mixed bag.

With all of the enhancements I’ve described earlier, one in five stories may be utilized in the ultimate video right off the bat. The remaining 4, nonetheless, often have one or two images stricken by common issues.

  • First, we’ve got inconsistent characters, still. Sometimes the model conjures up a personality that’s barely different from the unique within the training set, often choosing a photorealistic version moderately than the toy counterpart. But fear not! Adding a desired photo style throughout the text prompt, like “A cartoon-style Rex the turtle swimming under the ocean,” helps curb this issue. Nevertheless, it does require manual intervention since certain characters may warrant a photorealistic style.
  • Then there’s the curious case of missing body parts. Occasionally, our generated characters appear with absent limbs or heads. Yikes! To mitigate this, we’ve added negative prompts supported by the Stable Diffusion model, comparable to “missing limbs, missing head,” encouraging the generation of images that avoid these peculiar attributes.
Rex the turtle in numerous style (bottom right image is in a photograph realistic style, top right image is in a mixed style, the remaining are in a toy style) and missing a head (top left image) [AI generated image]
  • Bizarre images emerge when coping with unusual interactions between objects. Generating images of characters in specific locations typically produces satisfactory results. Nevertheless, relating to illustrating characters interacting with other objects, especially in an unusual way, the final result is usually lower than ideal. As an illustration, attempting to depict Tom the hedgehog milking a cow ends in a peculiar mix of hedgehog and cow. Meanwhile, crafting a picture of Tom the hedgehog holding a flower bouquet results in an individual clutching each a hedgehog and a bouquet of flowers. Regrettably, I actually have yet to plan a method to treatment this issue, leading me to conclude that it’s simply a limitation of current image generation technology. If the article or activity within the image you’re attempting to generate is very unusual, the model lacks prior knowledge, as not one of the training data has ever depicted such scenes or activities.
Mixed of a hedgehog and a cow (top images)is generated from “Tom the hedgehog is milking a cow” prompt. An individual holding a hedgehog and a flower (bottom left image) is generated from “Tom the hedgehog is holding a flower” [AI generated image]

Ultimately, to spice up the percentages of success in story generation, I cleverly tweaked my story generator to supply three distinct scenes per paragraph. Furthermore, for every scene, I instructed my image generator to create five image variations. With this approach, I increased the likelihood of obtaining not less than one top-notch image from the fifteen available. Having three different prompt variations also aids in generating entirely unique scenes, especially when one scene proves too rare or complex to create. Below is my updated story generation prompt.

"Write me a {max_words} words story a couple of given character and a subject.nPlease break the story down into " 
"seven to 10 short sections with 30 maximum words per section. For every section please describe the scene in "
"details and all the time include the situation in a single sentence inside [] with the next format "
"[a photo of character in the location], [a photo of character in front of an object], "
"[a photo of character next to an object], [a photo of a location]. Please provide three different variations "
"of the scene details separated by |nAt the beginning of the story please suggest song style from the next "
"list only which matches the story and put it inside <>. Song style list are motion, calm, dramatic, epic, "
"blissful and touching."

The one additional cost is a little bit of manual intervention after the image generation step is finished, where I handpick the very best image for every scene after which proceed with the comic generation process. This minor inconvenience aside, I now boast a remarkable success rate of 9 out of 10 in crafting splendid comics!

With the Owly system fully assembled, I made a decision to place this marvel of technology to the test one positive Saturday afternoon. I generated a handful of stories from his toys collection, ready to reinforce bedtime storytelling for Dexie using a nifty portable projector I had purchased. That night, as I saw Dexie’s face light up and his eyes widen with excitement, the comic playing out on his bedroom wall, I knew all my efforts had been value it.

Dexie is watching the comic on his bedroom wall [Image by Author]

The cherry on top is that it now takes me under two minutes to whip up a recent story using photos of his toy characters I’ve already captured. Plus, I can seamlessly incorporate invaluable morals I need him to learn from each story, comparable to not talking to strangers, being brave and adventurous, or being kind and helpful to others. Listed below are a few of the delightful stories generated by this implausible system.

Super Hedgehog Tom Saves His City From a Dragon — Generated by AI Comic Generator.
Bob the Brave Penguin: Adventures in Europe — Generated by AI Comic Generator.

As a curious tinkerer, I couldn’t help but fiddle with the image generation module to push Stable Diffusion’s boundaries and merge two characters into one magnificent hybrid. I fine-tuned the model with Kwazi Octonaut images, but I threw in a twist by assigning Zelda as each the unique and sophistication character name. Setting prior_preservation to True, I ensured that Stable Diffusion would “octonaut-ify” Zelda while still keeping her distinct essence intact.

I cleverly utilised a modest max_step of 400, simply enough to preserve Zelda’s original charm without her being entirely consumed by Kwazi the Octonaut’s irresistible allure. Behold the wonderful fusion of Zelda and Kwazi, united as one!

Dexie brimmed with excitement as he witnessed a fusion of his two favourite characters spearheading the motion in his bedtime story. He launched into thrilling adventures, combating aliens and trying to find hidden treasure chests!

Unfortunately to guard the IP owner I cannot show the resulting images.

Generative AI, particularly Large Language Models (LLMs), is here to remain and set to turn out to be the powerful tools for not only software development but many other industries as well. I’ve experienced the true power of LLMs firsthand in a couple of projects. Just last 12 months, I built a robotic teddy bear called Ellie, able to moving its head and interesting in conversations like an actual human. While this technology is undeniably potent, it’s essential to exercise caution to make sure the protection and quality of the outputs it generates, as it will probably be a double-edged sword.

And there you will have it, folks! I hope you found this blog interesting. In that case, please shower me along with your claps. Be at liberty to attach with me on LinkedIn or try my other AI endeavours on my Medium profile. Stay tuned, as I’ll be sharing the entire source code in the approaching weeks!

Finally, I would love to say because of Mike Chambers from AWS who helped me troubleshoot my fine-tuned Stable Diffusion model batch transform code.

LEAVE A REPLY

Please enter your comment!
Please enter your name here