Textual Inversion

Embedding Training


Back to the main page

This page will provide detailed instructions on conducting your own textural inversion training to create embeddings and use them in image generation. This guide is largely based on the YouTube video Textual Inversion - Make Anything In Stable Diffusion!. In addition, this page will include any additional findings discovered through the use of textural inversion.

It is essential to keep in mind that embeddings are generally only compatible with the models they were trained on. For example, an embedding for an SD v1.x model will only work with other models trained on the SD v1.x architecture. Similarly, an embedding for an SD 2.x model will only work with models trained on the SD 2.x architecture, and will not be compatible with any previous versions of the model.

Requirements

VRAM - 8GB minimum

System RAM - 16GB minimum

Dataset - 5 images minimum

For those with high-end graphics cards with significant VRAM, the following options can be enabled on the webui-user.bat file by setting the set COMMANDLINE_ARGS= to include --no-half (prevents the AI model from using 16-bit floats) and --precision full (ensures the AI uses the most accurate calculations). However, please note that my Nvidia RTX 3080 with 10GB VRAM could not run embedding training with these options enabled. If unsure, it is recommended to leave these settings off.

As a baseline for performance, my PC has a RTX 3080 with 10GB VRAM and 16GB system RAM. 1000 steps of training take approximately 45 minutes. Therefore, those with lower specs can expect longer training times, while those with better graphics cards will see improved performance.

Preparing Data

Properly preparing your data is essential for successful training of a model. In the case of textual inversion, a relatively small dataset can yield decent results. When collecting images, ensure they have a square aspect ratio, are of high quality with minimal compression or aliasing artifacts, and are in either JPEG or PNG format.

The more diverse your dataset, the better your embedding will perform. Include a variety of lighting, angles, composition, and expressions in your images.

Creating the Embedding File

To train your embedding, the first step is to create the placeholder for the training data and the associated prompt. To do this, go to the "Train" tab of the webGUI and select the "Create Embedding" tab. Fill in the following fields:

Name This is the prompt for your embedding and will also be the filename of your embedding.

Initialization Text This field is for inputting a list of prompts to help the AI understand the association with your training data. There is not currently much information on the effectiveness of this field, but a good starting point is to use the Interrogator from img2img and copy the prompts it creates for one of the images from your dataset.

Number of Vectors per Token This refers to the size of the embedding. A larger value allows for more information to be included in the embedding, but will also decrease the number of allowed tokens in the prompt. With stable diffusion, there is a limit of 75 tokens in the prompt. For example, if you use an embedding with 16 vectors, that will leave you with space for 75 - 16 = 59 tokens. Additionally, larger numbers of vectors may require more images for good results.

Once you have completed these fields, click the "Create Embedding" button to add the necessary directories and files for your embedding.

Preprocessing

Before starting the training process, it is necessary to preprocess the dataset to ensure that the AI is able to understand the images and that they are all of the correct dimensions. On the webGUI in the Train tab, navigate to the "Preprocess images" tab. From there, the following fields must be filled out:

Source Directory The directory path to the folder containing the training images must be provided. If unsure of how to do this, a quick internet search should provide the necessary information.

Destination Directory A different path to a new folder for the AI to copy the processed images must be provided.

Width & Height This is where the dimensions for all images are set. Unless there is a specific reason to do so, it is recommended to keep the dimensions at 512x512px.

Create flipped copies It is recommended to check this tickbox as it will increase the dataset size by mirroring all original images, providing the AI with twice as many images to work with. This will also improve the training as the AI will perceive the mirrored images as completely different images.

Split oversized images into two It is recommended to leave this tickbox unchecked. It is intended for use with images that do not have the same aspect ratio as the set width/height, but it is advised to keep all data at a square aspect ratio.

Use BLIP for caption or Deepdanbooru It is recommended to check this tickbox. This will add the Interrogator to the preprocessing process and include prompts associated with each image in the dataset. This will be useful in helping the AI understand the content of the images during training. Use the BLIP interrogator to generate normal sentence prompts and use Deepdanbooru to generate tag-like prompts. It is recommended to use BLIP for Stable Diffusion training and Deepdanbooru for Waifu Diffusion training.

Once all settings have been configured, the "Preprocess" button can be clicked to begin the preprocessing process. This should be a quick process, but the amount of time it takes will depend on the number of images in the dataset.

Training Your Embedding

To begin training your model, navigate to the train tab and fill out the necessary fields as follows:

Embedding: Select your placeholder embedding from the drop-down list in the embeddings folder. If it does not appear, try reloading the webGUI.

Learning rate: This field allows you to specify the intensity with which the AI will attempt to change the embedding data. It is important to keep this value as low as possible, as neural networks are sensitive to changes in the data. A good rule of thumb is to stay below 0.1 and use the default value of 0.005 if you are unsure. Keep in mind that a low learning rate will result in slower changes to the embedding data, requiring more steps to reach the desired result. However, a lower learning rate will often result in more accurate results.

You can also specify multiple learning rates to use throughout the training process. Simply use the format [learning rate]:[step count] and separate values with commas. For example, 0.005:500, 0.001:1000, 0.004 will train at a learning rate of 0.005 for 500 steps, then switch to 0.001 for an additional 1000 steps, and finally use 0.004 for any remaining steps until training is complete. This can be useful for fine-tuning your training and maximizing the efficiency of your training time.

Note: if you see Loss: nan in the console while training, your learning rate was too high and your embedding is effectively "dead". In this case, you will need to start the training process over again.

Dataset directory This is the directory where your dataset is stored. It should be the same as the Destination directory set in the preprocessing stage.

Log directory This is where the program will save test images. It is recommended to leave the default setting and access the images in the textual_inversion folder of the webGUI directory.

Prompt template file This is the path to a file containing prompts for the AI training. It is best to use the default setting of [webGUI directory]\style_filewords.txt for style training and subject_filewords.txt for character training. If you are unsure, it is recommended to not modify the prompt file.

Advanced users may choose to create their own prompt file or edit the existing ones, but it is recommended to avoid this for the first training session.

Width & Height The width and height for the preprocessing stage should be set to 512x512px to maintain consistency.

Max steps The max steps field determines the duration of the AI's training. Please note that training may take several hours, even on high-end systems. Based on my own experimentation, I have found that a max step count of 100,000 may be necessary for accurate embeddings, as lower step counts have yielded poor results.

Save an image to log directory every N steps, 0 to disable This field enables the saving of images at regular intervals, specified by the user. This can be useful for tracking the progress of training. To disable this function, set the value to 0.

Save a copy of embedding to log directory every N steps, 0 to disable Similar to the previous field, this option allows for the regular saving of embeddings at set intervals. This can be useful for reverting to an older version of the embedding if the final results are not satisfactory. To disable this function, set the value to 0.

Save images with embedding in PNG chunks This is a recent feature that creates an image that can also be used as an embedding for the model. Please refer to the main page for more information on this function.

Read parameters from txt2img tab when making previews This setting only affects the appearance of preview images, and allows the specification of the prompt that the model will use when displaying the preview image with the embedding.

Shuffle tags by ',' when creating prompts Tick this box if you'd like the comma seperated prompts to be shuffled each time they are read. I personally don't believe this would help that much, but it could be useful for creating more variety in your dataset.

Drop out tags when creating prompts Similar to the above option, the drop out option will remove some tags from your prompt list randomly to create more variety in your dataset. It's recommended to set this to 0.1.

Choose latent sampling method Not much is currently known about this option as yet, but it has been recommended to select the "deterministic" option for best results.

Once the appropriate settings have been selected, click "Train embedding" to begin the training process. Please note that this may take several hours. Enjoy your new embedding and space heater.


I am still learning a lot when it comes to textual inversion, but one thing to pay attention to is the loss value that is displayed in the output box underneath the images. The goal of the AI is to try and make the loss value as close to 0 as possible. 0 means the model would theoretically be able to generate all images in the dataset 100% accurately. However, this is impossible in the real world and is not an outcome you would want anyway. But it is important to know about the loss value and understand that it is a good estimate of how well your model is training.

Embedding Resources