Configuration-free guidance (CFG scale) x textual inversion training step count

In this grid you can see a cross section of the CFG scale and the amount of training steps used to train a textual inversion embedding.

I explain a bit more about CFG scale below, but basically it controls how much influence the text input (or in the case of textual inversion, the pseudo-text input) has on the generation process.

You can see how the images to the right most embody the concept that was being trained, whereas on the left where the CFG scale is lower, this concept is not as distinctly present. You can also see the effect of textual inversion training as you move down the rows. The top rows are early on in the training process, where the concept has not yet been defined as strongly. The bottom rows, where over many more training steps, the concept has taken on much more definition.






You can also see by how some images stay quite similar as you move diagonally form bottom left to top right, that these two factors play a similar role. CFG scale is saying how far in the direction of the concept to go, whereas the amount of training steps determines how defined this concept is. The more defined it is, the more it will influence the overall generation process.




How does it work?

Classifier-free guidance (CFG) refers to the degree that the generation process is guided by the text input. If you were to generate an image without any text input, the resulting image would be the result of the denoising process left to its own devices. If you also give the model a text input, then this acts as the directions that guide the generation process to its final destination.

But because of how much more information there is in an image compared with its text description, the denoising process is largely going off visual information. Unless you are also giving an input image, this means that the resulting image will tend to be more arbitrarily arrived at and less likely to resemble your text description.

To rectify this a nifty trick is used where the difference is calculated between the result that you would get without any text input and the result you’d get with a text input. CFG scale is how much you want to exaggerate this distance.

If you were to move from the first point in the model’s latent space (the output with no text input) in the direction of the second point (the same output but with a text input) but continued past this second point, CFG is how much further past this point you would continue. It acts like a multiplier of this distance. The further you go past the second point, the more that your generated images will express your text input.