Home Artificial Intelligence The Way forward for Storytelling: Creating Compelling Photo-to-Audio Narratives with free AI 0.Create a virtual environment 1.Install the required dependencies and get a Hugging Face API token 2.Create a photograph to text AI function 3.Create an AI generated story from the text 4.Create a function to generate an audio file from the story 5.Use Streamlit to place every thing together Conclusions

The Way forward for Storytelling: Creating Compelling Photo-to-Audio Narratives with free AI 0.Create a virtual environment 1.Install the required dependencies and get a Hugging Face API token 2.Create a photograph to text AI function 3.Create an AI generated story from the text 4.Create a function to generate an audio file from the story 5.Use Streamlit to place every thing together Conclusions

3
The Way forward for Storytelling: Creating Compelling Photo-to-Audio Narratives with free AI
0.Create a virtual environment
1.Install the required dependencies and get a Hugging Face API token
2.Create a photograph to text AI function
3.Create an AI generated story from the text
4.Create a function to generate an audio file from the story
5.Use Streamlit to place every thing together
Conclusions

We’re surrounded by AI models and tools: it is healthier to say that for the considered one of us which are following the evolution of LLMs we’re almost overloaded.

But don’t you’re feeling that we’re left behind? Big firms and tools are hidden behind black boxes, and we aren’t entitled to grasp how they works.

In this text I would like to explore with you a multi-modal approach to Artificial Intelligence, only using open source and free tools. We hack the method and we break it all the way down to easy steps: after which we learn the best way to do them ourselves.

The app we’re going to create is taking considered one of our photos as an input: the Hugging Face models are going to acknowledge the text that describe the photo and can generate a brief story based on it. After that we’ll generate an audio based on that short story. Cool, isn’t it?

Image from the creator
Image from the creator

Here the breakdown of the things we’re going to do:

0.Create a virtual environment
1.Install the required dependencies and get a Hugging Face API token
2.Create a photograph to text AI function
3.Create an AI generated story from the text function
4.Create a function to generate an audio file from the story
5.Use Streamlit to place every thing together

Learning the best way to do it yourself has plenty of benefits: you understand the method to undergo the official documentation; you’ll be able to reuse the function in other context too. For instance, to categorise your photos based on the outline, or to create stories based on a prompt, and even to create your audio books!

With none further ado, let’s start

We don’t must install plenty of libraries. As an excellent practice, let’s create a virtual environment to handle this project.

Create a brand latest directory (mine is ) and run the venv creation instruction:

mkdir AI-yourVideoStory
cd AI-yourVideoStory
python3.10 -m venv venv #version 3.10 is advisable

To activate the virtual environment:

source venv/bin/activate  #for mac
venvScriptsactivate #for windows users

With the venv activated run the next pip installs for the required packages:

pip install transformers     #interaction with LLM
pip install huggingface_hub #hugging face library for python
pip install langchain #powerful toolkit to level up the sport
pip install streamlit==1.24.0 #latest version of streamlit

As you’ll be able to see we aren’t installing or : it is because we’re going to use only API inferences on free Hugging Face model. To achieve this it’s good to be registered on Hugging Face and create an API token (your personal authorization key for API requests to the LLMs).

On the official Hugging Face page for the API Inference now we have the instructions for getting the API Token.

But what are 🤗 Hosted Inference API? An API, short for , is a algorithm and protocols that permits various applications to speak with one another even in the event that they are written with different languages.

So let’s create an Account on Hugging Face (for those who don’t have one yet) after which we’ll create our first API token

Register or login at https://huggingface.co/join

After you’re logged in get a User Access or API token in your Hugging Face profile settings.

It is best to see a token hf_xxxxx (old tokens are api_XXXXXXXX or api_org_XXXXXXX).

Remember!

Should you don’t submit your API token when sending requests to the API, you won’t give you the option to run inference in your private models.

Create a latest python file in your major directory and call it

Now, to confirm that every thing works fantastic, let’s import the libraries and run it:

# libraries for AI inferences
from huggingface_hub import InferenceClient
from langchain import HuggingFaceHub
import requests
# Just for Internal usage
import os
import datetime

Put it aside after which out of your terminal window, with the venv activated, run

python app.py

Should you are getting nothing… signifies that it’s working fantastic 😁

: we’re importing also Langchain because text generation inference pipelines aren’t yet supported by Hugging Face: 🦜️🔗 Langchain will fix this problem for us.

We’re all set.

In our we will start creating some functions. We are going to create one function for every task: one for image to text, one for text generation and eventually one for text to speech.

After the imports follow together with this code:

yourHFtoken = "hf_xxxxxxxx" #paste here your hf token
# Only HuggingFace Hub Inferences
model_Image2Text = "Salesforce/blip-image-captioning-base"

We set a string variable with our HF token, and as well we create a string for the model related to the duty (on this case Image2Text).

Image to Text Task is positioned within the Hugging Face Multimodal models.

On the Model page from Hugging Face we will go and filter only the models for Multimodal/Image to Text tasks: amongst essentially the most liked one let’s take the famous blip-base

Salesforce/blip-image-captioning-base

Whenever you click on it (3) the model card page will open with plenty of explanations and quick start code. For the inference, anyway we follow the instructions of the Hugging Face guide for APIs and alter only the model name: you’ll be able to simply click on the copy icon as shown

click on the copy icon

Our function now have a model and we send the request with the next instructions:

def imageToText(url):
from huggingface_hub import InferenceClient
client = InferenceClient(token=yourHFtoken)
model_Image2Text = "Salesforce/blip-image-captioning-base"
# tasks from huggingface.co/tasks
text = client.image_to_text(url,
model=model_Image2Text)
print(text)
return text

Our function will accept a neighborhood image file (called ) and return a text that describe the image.

Your should seem like this:

# libraries for AI inferences
from huggingface_hub import InferenceClient
from langchain import HuggingFaceHub
import requests
# Internal usage
import os
import datetime
import streamlit
yourHFtoken = "hf_xxxxxxxx" #paste here your hf token
# Only HuggingFace Hub Inferences
model_TextGeneration="togethercomputer/RedPajama-INCITE-Chat-3B-v1"
model_Image2Text = "Salesforce/blip-image-captioning-base"
model_Text2Image="runwayml/stable-diffusion-v1-5"
model_Summarization="MBZUAI/LaMini-Flan-T5-248M"
model_Text2Speech="espnet/kan-bayashi_ljspeech_vits"
def imageToText(url):
from huggingface_hub import InferenceClient
client = InferenceClient(token=yourHFtoken)
model_Image2Text = "Salesforce/blip-image-captioning-base"
# tasks from huggingface.co/tasks
text = client.image_to_text(url,
model=model_Image2Text)
print(text)
return text
basetext = imageToText("./family.jpg")

For the aim of the test we’re going to use this image (yow will discover it on the GitHub repo too)

Image by Michelle Raponi from Pixabay

Download the image within the major folder of the project (mine is AI-yourVideoStory), save the python file and with the venv activated, run

python app.py

It is best to get the next

Screenshot from local run

The photo description retrieved by our imageToText function might be the place to begin for our story generation.

I’m telling you: text generation inference with Hugging Face models will not be a straightforward gist! To start with most of the performing models have the API disabled: secondly text generation inference follows different rules in response to the model of your alternative.

I tested 20 of them, and eventually decided to go for considered one of the models based mainly on OpenAssistant LLM. Open Assistant is a project organized by LAION and individuals world wide serious about bringing this technology to everyone. Their motto is

We consider we will create a revolution.

In the identical way that Stable Diffusion helped the world make art and pictures in latest ways, we wish to enhance the world by providing amazing conversational AI.

The model card is admittedly helpful since it gives us the hints for the prompts.

Model card at https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1

You may already see on the correct panel the structure expected by the model: for those who scroll down the model card as much as the section, it’s clearly stated too:

We are going to create a function, using LangChain as a gateway for the text generation inference, specifying a prompt just like the one given above.

# Langchain to HuggingFace Inferences
def LC_TextGeneration(model, basetext):
from langchain import PromptTemplate, LLMChain
os.environ["HUGGINGFACEHUB_API_TOKEN"] = yourHFtoken
llm = HuggingFaceHub(repo_id=model , model_kwargs={"temperature":0.45,"min_length":30, "max_length":250})
print(f"Running repo: {model}")
print("Preparing template")
template = """""": write a really short story about {basetext}.
The story should be a one paragraph.
: """"""
prompt = PromptTemplate(template=template, input_variables=["basetext"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
start = datetime.datetime.now() #not used now but useful
print("Running chain...")
story = llm_chain.run(basetext)
stop = datetime.datetime.now() #not used now but useful
elapsed = stop - start
print(f"Executed in {elapsed}")
print(story)
return story

LangChain requires a distinct method to pass the HuggingFace API token: we use the os.environ["HUGGINGFACEHUB_API_TOKEN"] to store it as an environmental variable.

The function will accept as positional arguments the model (the RedPajama-INCITE-Chat-3B-v1) and the basetext (the text to make use of to generate the short story).

As you’ll be able to see our template for the prompt follows the instruction of the Quick Start section: we only add the basetext variable to incorporate in the bottom instruction the main points of the duty we wish to be accomplished

template = """""": write a really short story about {basetext}.
The story should be a one paragraph.
: """"""

To offer you an idea of the generation time there are some console print instructions (like small checkpoints to confirm the status of the generation).

Now that the Function is prepared let’s give the arguments:

model_TextGeneration="togethercomputer/RedPajama-INCITE-Chat-3B-v1"
# Variable and Inference to HF with LangChain
basetext = "a family walking on the beach at sunset"
mystory = LC_TextGeneration(model_TextGeneration, basetext)
print("="*50)
finalstory = mystory.split('nn')[0]
print(finalstory)

save the python file and with the venv activated, run

python app.py

It is best to get the next

You could have noticed that the story is for much longer. That’s the reason we’re splitting it into paragraphs and we take only the primary one: it’s good enough for the story

The sun was setting over the horizon, casting a warm 
orange glow over the beach. The family was walking along
the shore, having fun with the previous few moments of the day.

The generation was lighting fast, isn’t it!? But remember! It might take quite a bit longer, depending on how much traffic Hugging Face hub inference servers have.

We’re on the last function. Now that now we have the story we will use a text-to-speech model to generate the audio for us. On the Hugging Face Model section, scroll down the left panel to filter the Audio Tasks with the Text-To-Speech option.

You may experiment with any of the trending one: I select the because I actually liked the natural voice.

Hugging Face text-to-speech tasks Models

On the Model Card page there aren’t so many indications to make use of the model: what can we do?

We have now two options! Lt’s have a take a look at them:

  1. An Inference API is Hosted
  2. You need to use a quick Inference API for prototyping
model card for espnet/kan-bayashi_ljspeech_vits

Whenever you click on the Deploy/Inference API button you’re going to get a snippet code ready on your use: for those who are logged in together with your HF account the available API access tokens might be exhibited to you and than copied along with the code (no efforts in any respect…!)

code snipped when not logged in
code snippet when logged in: your token might be copied within the clipboard along with the code

So let’s create a function for our text-to-speech generation with the method: we’ll include an f-string with our to be included within the headers (if not included the API request might be denied).

def  text2speech(text):
import requests
API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
headers = {"Authorization": f"Bearer {yourHFtoken}"}
payloads = {
"inputs" : text
}
response = requests.post(API_URL, headers=headers, json=payloads)
with open('audio.flac', 'wb') as file:
file.write(response.content)

As you’ll be able to see it is kind of simple: we get the response to our requests and we write it right into a .flac audio file. But what’s FLAC’?

() is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, […]. Digital audio compressed by FLAC’s algorithm can typically be reduced to between 50 and 70 percent of its original size and decompresses to the same copy of the unique audio data.

Source: wikipedia

Let’s add our text argument

mytext = "The sun was setting over the horizon, casting a warm orange glow over the beach. The family was walking along the shore, having fun with the previous few moments of the day."
text2speech(mytext)

save the python file and with the venv activated, run

python app.py

It is best to get the next

In your project folder you’ve gotten now an file: you’ll be able to play it with VLC media player.

It’s time to create our Streamlit app to offer a ravishing UI to our project.

You could find all the photographs and audio files, along with the ultimate code in my dedicated GitHub Repository.

On this section I’ll briefly explain the code and the best way to run it.

Streamlit is a library to construct data web apps without having to know any front-end technologies like HTML, and CSS. If you need to know more check the clear documentation here. The version 1.24.0 includes latest widgets dedicated for chat interface: very useful for use with LLM (but not here 😁)

Create a latest file called with this code:

# Python app for HuggingFace Inferences  with Streamlit
# libraries for AI inferences
from huggingface_hub import InferenceClient
from langchain import HuggingFaceHub
import requests
# Internal usage
import os
import datetime
# STREAMLIT
import streamlit as st
yourHFtoken = "hf_xxxxxxxxxxxxxxxxxx" #your HF token here
# Only HuggingFace Hub Inferences
model_TextGeneration="togethercomputer/RedPajama-INCITE-Chat-3B-v1"
model_Image2Text = "Salesforce/blip-image-captioning-base"
model_Text2Speech="espnet/kan-bayashi_ljspeech_vits"
def imageToText(url):
from huggingface_hub import InferenceClient
client = InferenceClient(token=yourHFtoken)
model_Image2Text = "Salesforce/blip-image-captioning-base"
# tasks from huggingface.co/tasks
text = client.image_to_text(url,
model=model_Image2Text)
print(text)
return text
def text2speech(text):
import requests
API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
headers = {"Authorization": f"Bearer {yourHFtoken}"}
payloads = {
"inputs" : text
}
response = requests.post(API_URL, headers=headers, json=payloads)
with open('audiostory.flac', 'wb') as file:
file.write(response.content)
# Langchain to HuggingFace Inferences
def LC_TextGeneration(model, basetext):
from langchain import PromptTemplate, LLMChain
os.environ["HUGGINGFACEHUB_API_TOKEN"] = yourHFtoken
llm = HuggingFaceHub(repo_id=model , model_kwargs={"temperature":0.45,"min_length":30, "max_length":250})
print(f"Running repo: {model}")
print("Preparing template")
template = """: write a really short story about {basetext}.
The story should be a one paragraph.
: """
prompt = PromptTemplate(template=template, input_variables=["basetext"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
start = datetime.datetime.now() #not used now but useful
print("Running chain...")
story = llm_chain.run(basetext)
stop = datetime.datetime.now() #not used now but useful
elapsed = stop - start
print(f"Executed in {elapsed}")
print(story)
return story
def major():
st.set_page_config(page_title="Your Photo Story Creatror App", page_icon='📱')
st.header("Turn your Photos into Amazing Audio Stories")
st.image('banner.png', use_column_width=True)
st.markdown("1. Select a photograph out of your pcn 2. AI detect the photo descriptionn3. AI write a story in regards to the photon4. AI generate an audio file of the story")

# test with Image by Michelle Raponi from Pixabay
image_file = st.file_uploader("Select a picture...", type='jpg')
if image_file will not be None:
print(image_file)
bytes_data = image_file.getvalue()
with open(image_file.name, "wb") as file:
file.write(bytes_data)
st.image(image_file, caption="Uploaded Image...",
use_column_width=True)

st.warning("Generating Photo description", icon="🤖")
basetext = imageToText(image_file)
with st.expander("Photo Description"):
st.write(basetext)
st.warning("Generating Photo Story", icon="🤖")
mystory = LC_TextGeneration(model_TextGeneration, basetext)
finalstory = mystory.split('nn')[0]
with st.expander("Photo Story"):
st.write(finalstory)
st.warning("Generating Audio Story", icon="🤖")
text2speech(finalstory)

st.audio('audiostory.flac')
st.success("Audio Story accomplished!")
if __name__ == '__main__':
major()

You could noticed that the primary part is strictly the identical: that is we’d like those functions and link them to the graphic interface. The one thing we removed is the test variables.

the banner image is in my repository: you’ll be able to download it here.

All of the Streamlit app is encapsulated contained in the major() function that we’re going to call at the tip with the wonderful instruction:

if __name__ == '__main__':
major()

save the python file and with the venv activated, run

python -m streamlit run st-app.py

It is best to get the next

3 COMMENTS

  1. … [Trackback]

    […] Find More here on that Topic: bardai.ai/artificial-intelligence/the-way-forward-for-storytelling-creating-compelling-photo-to-audio-narratives-with-free-ai0-create-a-virtual-environment1-install-the-required-dependencies-and-get-a-hugging-face-ap…

  2. … [Trackback]

    […] Read More to that Topic: bardai.ai/artificial-intelligence/the-way-forward-for-storytelling-creating-compelling-photo-to-audio-narratives-with-free-ai0-create-a-virtual-environment1-install-the-required-dependencies-and-get-a-hugging-face-api-tok…

LEAVE A REPLY

Please enter your comment!
Please enter your name here