Using doodles and diffusers to expand your imagination
Exploring stable diffusion’s img2img transformation with a colab notebook
What if I could turn doodles into photorealistic things? What if I could draw an apple slightly warped and then it turns into a photorealistic one, slightly warped? That’s the inquiry that got me into playing with doodles and diffusers for image transformations. I want to have a drawing tool that can at the press of a button transform my doodles.
In this post you’ll find:
- An intro on image transformations and diffusers, specifically Stable Diffusion’s img2img
- A tutorial on exploring Stable Diffusion with an example colab notebook, using Hugging Face’s diffusers library
- An amazing drawing by my nephew and a ton of sea creatures we created with it
I love doodling and making bad drawings. It’s kind of liberating. A bad drawing doesn’t have to be great, it can be whatever and that’s wonderful. I love the exercise in imagination and flexibility with concepts. I can combine things, morph my body, draw out emotions. I talk about all that in How to Draw Your Infinite Self. Now let’s see what happens if I combine this with image AIs.
Diffusers and image generation
How might a computer see a doodle, and transform it? I found methods for identifying objects from a doodle, but that’s more doodle-image to text (instead creating an image). There’s also tools for getting text from an image… so I could get a caption from an image and then use that text to generate another image. I tried this and the image doesn’t feel connected to the original. The closest thing I found was taking a rough sketch, and then use a diffuser with an input image and a text prompt to generate something more realistic. That specific process is img2img diffusion.
“How img2img Diffusion Works” offers a great high-level overview:
In a normal text-to-image run of Stable Diffusion, we feed the model some random noise. The model assumes, though, that this input is actually a piece of artwork that just had a bunch of noise added.
So, using the text prompt as a source of “hints” for what the (supposed) original looked like, it does its best to recover an image from the static. It does this over a bunch of steps (e.g., 50), gradually removing a little more noise each time.
With img2img, we do actually bury a real image (the one you provide) under a bunch of noise. And this causes Stable Diffusion to “recover” something that looks much closer to the one you supplied.
Getting started with diffusers
Diffusion models have become super popular lately for image generation, examples being DALL-E and Stable Diffusion. Hugging Face has a library that makes it simpler to get started with diffusers and try out different models. I recommend this intro to diffusers and the Hugging Face library to get started, it covers the technique and also an example.
In short, the library offers Pipelines, which are made up of a Model and a Scheduler. The Model is some AI architecture, and you can also load pre-trained models. Earlier I wrote about how wonderful pre-trained models are fantastic because you don’t need a ton of data or computing resources to build powerful apps. The Scheduler is where the diffusion algorithm adds and then removes noise during inference. In img2img to generate images, we’re focusing on inference i.e. using the process to arrive at an image from noise.
We’re going to use the diffusers library to use Stable Diffusion’s img2img to generate an image given an input image (doodle) and a text prompt. If there’s a single notebook I recommend to run through from this post, it’s “Image2Image Pipeline for Stable Diffusion using 🧨 Diffusers.” This tutorial runs through an example loading an image, and then running img2img with different strengths, which is exactly what I was looking for.
If I can run image transformations through an existing app, what’s the point of rebuilding it in colab? Part of it is getting a better feel for how they work. And also being able to programmatically tweak the parameters. And then integrate this feature with something else, or build your own app! The following generates a single image:
with autocast("cuda"):
image = pipe(prompt=prompt, init_image=init_img, strength=0.9, guidance_scale=7.5, generator=generator).images[0]
Check out the docs for more about the parameters you can tweak. I’ll highlight a few:
- prompt: the text input that will be used to generate the images. This method will try and infer images based on this text from the initial noise.
- init_image: your initial image input
- strength: this is the strength of the image transformation. If it’s low, you’ll pretty much get your input image. As it gets higher, it drifts more, until it ignores much of the input.
Let’s doodle and diffuse!
So now you can imagine all sorts of possibilities iterating over prompts and images and strengths and seeing what happens! Let’s do just this. You can follow along in this post, and then check out the notebook for the details. ⚠️ Note: If you try this out always check the output before sharing it, instead of sharing it live. Some of the creatures were too spooky!
My 5 year old nephew is an amazing artist and loves monsters. Here’s one that he drew, it’s incredible:
Generally the image inputs work better if you add a background. So with my nephew’s art direction we added a gradient background and some more underwater features.
What would happen if we imagined other creatures with this? We’ll use img2img for this.
Now we pull in the pipeline from the Hugging Face library to use Stable Diffusion (docs):
device = "cuda"
model_path = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
model_path,
revision="fp16",
torch_dtype=torch.float16,
use_auth_token=True
)
pipe = pipe.to(device)
I describe this scene in my prompt: “surreal underwater, a big colorful happy fish and a pterodactyl flying above, other sea creatures and coral in the water, trending on artstation” Often prompts include an unintuitive phrase like “trending on artstation.” Like I wouldn’t think to add that when describing what I’m looking for. So how might one find useful phrases for prompts?
The space of prompt design is still fresh with new tools and techniques being developed. More on that in a later post, and for now you can try something descriptive, see what comes out, and then come back later to tweak. The space of prompt design is a rich area to explore!
Here I generate an image based on strength, creating a function that takes in an input image and strength:
def generate_image_by_strength(_strength, _img):
with autocast("cuda"):
image = pipe(prompt=prompt, init_image=_img, strength=_strength, guidance_scale=7.5, generator=generator).images[0]
display(f"Image at strength {_strength}")
display(image)
Using strength 0.5 creates:
Using strength 0.7 creates. You can see the difference that increasing the strength has.
And if I go higher to 0.9, you can see the original image feels less represented:
Now lets generate images where we slowly increase the strength from 0 to 1. Here’s what that looks like:
interval = 1/30
strength=0
while strength<=1:
display({strength})
generate_image_by_strength(strength, img)
strength += interval
strength = round(strength,2)
Using ffmpeg we create a movie, where the strength is low and gradually increases over time. You can see how the variation feels like it increases drastically as the strength gets higher. Some of them look pretty freaky, so heads up:
After scanning the images across strength I liked 0.6 the best. It felt like it had a good balance of the original image and the transformation. So taking that, I could generate a bunch of images and see how they look.
index = 0
strength=0.60
while index<30:
generate_image_by_strength(strength, img)
index += 1
You can try this on your own images. Starting with this notebook and try it with your own image, playing with different parameters and then programming in different possibilities. And you can see my explorations here.
What’s next: Coming back to the original inspiration, I’d love to integrate this into a drawing tool. How might tweaking the prompt and strengths affect the input image? What are different ways to imagine interfaces for this? And then also keeping the prompt and strength the same and tweaking the input image. Lots of areas to look into!
Additional references: Sketch generative models, diffuse the rest, and Stable Diffusion Web UI.
Thanks for reading! You can follow the latest of what I’m up to on medium and on twitter. You can get monthly recaps of what I’m up to on this mailing list. I’m interested in starting conversations, hearing ideas, and finding opportunities related to what I’m working on. If you know folks who would be into this, I’d love it if you share this with them! And if you’d like to jam more on any of this, I’d love to chat here or on twitter.