Japanese Stable Diffusion

-


Max Shakespeare's avatar

Kei Sawada's avatar


Open In Hugging Face Spaces

Open In Colab

Stable Diffusion, developed by CompVis, Stability AI, and LAION, has generated an awesome deal of interest because of its ability to generate highly accurate images by simply entering text prompts. Stable Diffusion mainly uses the English subset LAION2B-en of the LAION-5B dataset for its training data and, in consequence, requires English text prompts to be entered producing images that are likely to be more oriented towards Western culture.

rinna Co., Ltd. has developed a Japanese-specific text-to-image model named “Japanese Stable Diffusion” by fine-tuning Stable Diffusion on Japanese-captioned images. Japanese Stable Diffusion accepts Japanese text prompts and generates images that reflect the culture of the Japanese-speaking world which could also be difficult to specific through translation.

On this blog, we’ll discuss the background of the event of Japanese Stable Diffusion and its learning methodology.
Japanese Stable Diffusion is offered on Hugging Face and GitHub. The code relies on 🧨 Diffusers.



Stable Diffusion

Recently diffusion models have been reported to be very effective in artificial synthesis, much more so than GANs (Generative Adversarial Networks) for images. Hugging Face explains how diffusion models work in the next articles:

Generally, a text-to-image model consists of a text encoder that interprets text and a generative model that generates a picture from its output.

Stable Diffusion uses CLIP, the language-image pre-training model from OpenAI, as its text encoder and a latent diffusion model, which is an improved version of the diffusion model, because the generative model. Stable Diffusion was trained mainly on the English subset of LAION-5B and might generate high-performance images just by entering text prompts. Along with its high performance, Stable Diffusion can also be easy to make use of with inference running at a computing cost of about 10GB VRAM GPU.

sd-pipeline

from Stable Diffusion with 🧨 Diffusers



Japanese Stable Diffusion



Why do we’d like Japanese Stable Diffusion?

Stable Diffusion is a really powerful text-to-image model not only by way of quality but in addition by way of computational cost. Because Stable Diffusion was trained on an English dataset, it’s required to translate non-English prompts to English first. Surprisingly, Stable Diffusion can sometimes generate proper images even when using non-English prompts.

So, why do we’d like a language-specific Stable Diffusion? The reply is because we would like a text-to-image model that may understand Japanese culture, identity, and unique expressions including slang. For instance, one among the more common Japanese terms re-interpreted from the English word businessman is “salary man” which we most frequently imagine as a person wearing a suit. Stable Diffusion cannot understand such Japanese unique words accurately because Japanese isn’t their goal.

salary man of stable diffusion

“salary man, oil painting” from the unique Stable Diffusion

So, this is the reason we made a language-specific version of Stable Diffusion. Japanese Stable Diffusion can achieve the next points in comparison with the unique Stable Diffusion.

  • Generate Japanese-style images
  • Understand Japanese words adapted from English
  • Understand Japanese unique onomatope
  • Understand Japanese proper noun



Training Data

We used roughly 100 million images with Japanese captions, including the Japanese subset of LAION-5B. As well as, to remove low quality samples, we used japanese-cloob-vit-b-16 published by rinna Co., Ltd. as a preprocessing step to remove samples whose scores were lower than a certain threshold.



Training Details

The largest challenge in making a Japanese-specific text-to-image model is the scale of the dataset. Non-English datasets are much smaller than English datasets, and this causes performance degradation in deep learning-based models. The dataset used to coach Japanese Stable Diffusion is 1/twentieth the scale of the dataset on which Stable Diffusion is trained. To make model with such a small dataset, we fine-tuned the powerful Stable Diffusion trained on the English dataset, fairly than training a text-to-image model from scratch.

To make language-specific text-to-image model, we didn’t simply fine-tune but applied 2 training stages following the thought of PITI.



1st stage: Train a Japanese-specific text encoder

Within the 1st stage, the latent diffusion model is fixed and we replaced the English text encoder with a Japanese-specific text encoder, which is trained. Right now, our Japanese sentencepiece tokenizer is used because the tokenizer. If the CLIP tokenizer is used because it is, Japanese texts are tokenized bytes, which makes it difficult to learn the token dependency, and the variety of tokens becomes unnecessarily large. For instance, if we tokenize “サラリーマン 油絵”, we get ['ãĤ', 'µ', 'ãĥ©', 'ãĥª', 'ãĥ¼ãĥ', 'ŀ', 'ãĥ³', 'æ', '²', '¹', 'çµ', 'µ'] that are uninterpretable tokens.

from transformers import CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text = "サラリーマン 油絵"
tokens = tokenizer(text, add_special_tokens=False)['input_ids']
print("tokens:", tokenizer.convert_ids_to_tokens(tokens))

print("decoded text:", tokenizer.decode(tokens))

Alternatively, by utilizing our Japanese tokenizer, the prompt is split into interpretable tokens and the variety of tokens is reduced. For instance, “サラリーマン 油絵” could be tokenized as ['▁', 'サラリーマン', '▁', '油', '絵'], which is accurately tokenized in Japanese.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-stable-diffusion", subfolder="tokenizer", use_auth_token=True)
tokenizer.do_lower_case = True
tokens = tokenizer(text, add_special_tokens=False)['input_ids']
print("tokens:", tokenizer.convert_ids_to_tokens(tokens))

print("decoded text:", tokenizer.decode(tokens))

This stage enables the model to know Japanese prompts but doesn’t still output Japanese-style images since the latent diffusion model has not been modified in any respect. In other words, the Japanese word “salary man” could be interpreted because the English word “businessman,” however the generated result’s a businessman with a Western face, as shown below.

salary man of japanese stable diffusion at stage 1

“サラリーマン 油絵”, which suggests exactly “salary man, oil painting”, from the 1st-stage Japanese Stable Diffusion

Due to this fact, within the 2nd stage, we train to output more Japanese-style images.



2nd stage: Positive-tune the text encoder and the latent diffusion model jointly

Within the 2nd stage, we’ll train each the text encoder and the latent diffusion model to generate Japanese-style images. This stage is crucial to make the model turn out to be a more language-specific model. After this, the model can finally generate a businessman with a Japanese face, as shown within the image below.

salary man of japanese stable diffusion

“サラリーマン 油絵”, which suggests exactly “salary man, oil painting”, from the 2nd-stage Japanese Stable Diffusion



rinna’s Open Strategy

Quite a few research institutes are releasing their research results based on the thought of democratization of AI, aiming for a world where anyone can easily use AI. Particularly, recently, pre-trained models with a lot of parameters based on large-scale training data have turn out to be the mainstream, and there are concerns a few monopoly of high-performance AI by research institutes with computational resources. Still, fortunately, many pre-trained models have been released and are contributing to the event of AI technology. Nonetheless, pre-trained models on text often goal English, the world’s hottest language. For a world through which anyone can easily use AI, we imagine that it’s desirable to have the opportunity to make use of state-of-the-art AI in languages apart from English.

Due to this fact, rinna Co., Ltd. has released GPT, BERT, and CLIP, that are specialized for Japanese, and now have also released Japanese Stable Diffusion. By releasing a pre-trained model specialized for Japanese, we hope to make AI that isn’t biased toward the cultures of the English-speaking world but in addition incorporates the culture of the Japanese-speaking world. Making it available to everyone will help to democratize an AI that guarantees Japanese cultural identity.



What’s Next?

In comparison with Stable Diffusion, Japanese Stable Diffusion isn’t as versatile and still has some accuracy issues. Nonetheless, through the event and release of Japanese Stable Diffusion, we hope to speak to the research community the importance and potential of language-specific model development.

rinna Co., Ltd. has released GPT and BERT models for Japanese text, and CLIP, CLOOB, and Japanese Stable Diffusion models for Japanese text and pictures. We are going to proceed to enhance these models and next we’ll consider releasing models based on self-supervised learning specialized for Japanese speech.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x