city96/FLUX.1-dev-gguf · My experience

Aug 16

•

Just wanted to comment on the speed differences, in my use case i saw generation speeds, (it/s moreso s/it). go from 9.5s/it using nf4 to 3.15s/it using Q4_0 and T5xxl FP8 and only 6.4GB of vram usage.
With NF4 i see 3m8s per gen for 20 steps @ 215W
Q4_0 - 1 min. 25.5 sec for 25 steps @ 215w Or 1m46.5s for 25 steps @ 105W
This is with a RTX 2080 8GB, Ryzen 5 5600g and 32GB of ram (which seems necessary as i get 19GB of usage when loaded in forge)
Prompt understanding & generation quality seem to still be good

what can't llama.cpp do x.x

Both the images are generated with the Q4_0 gguf & T5xxl FP8

Nelathan

Aug 17

On my RTX 4070 super 12GB i have seen generation sppeds go from 1.3s/it with NF4 to 1.9s/it with Q4_0 and 2.6s/it with Q5_1. But i find the quality is noticibly better (sharper, more expressive)
using python 11, pytorch 2.4-cu124

alexcardo

Aug 17

Can you please provide a step-by-step instructions to run this model. I cloned the repository, downloaded the .gguf file, and what's next? There is no usage instruction. Also it's not clear whether it should be the "ComfyUI/models/unet" folder or I need to create it myself (as I was unable to find it in the repo)

AlexData257

Aug 17

•

edited Aug 17

Is there a "simplest possible" demo comfyUI workflow file available?
Does this need VAE or CLIP (and what models to load there if it does) ??
I
second AlexCaro - instructions are missing, and holds back new users from trying this...
Please write proper instructions for ComfyUI at least... please!

YES! See my longer post below with the answer to those questions!

alexcardo

Aug 17

I figured it out how to run. But after, I see a horrible diagram where I don't even see a "generate" button or whatsoever. I still doubt even a q4 quant will work on my shitty Mac M1 8GB, but I really need this model... so I'll try any tricky approaches to force it work.

SaisExperiments

Aug 17

•

edited Aug 17

I figured it out how to run. But after, I see a horrible diagram where I don't even see a "generate" button or whatsoever. I still doubt even a q4 quant will work on my shitty Mac M1 8GB, but I really need this model... so I'll try any tricky approaches to force it work.

I'd say it won't run, at least not well. The Q4_0 is 6.79GB, Then you need T5XXL which is 4.8GB at FP8
You'll end up with the models being pushed into your mass storage (NVMe) which reads @ 2.5GB/s compared to system ram @ 66GB/s
So even if you do get it to work, it'll probably be painfully slow.
But i guess you won't actually know until you try @_@
I use forge personally so i can't help with comfyUI stuff
Edit: For more info, Removing ram used by Windows. Forge is using 13GB of ram & 6.6GB of vram when generating or 17GB of ram & 2.6GB of vram while idle

Nelathan

Aug 17

Apple silicon is not supported, yet i have seen someone to run it.

Please keep in mind that this functionality is still under development. City has been pushing fixes and lora support around the clock. Once its stable it will get documented.

AlexData257

Aug 17

is there a "simplest possible" demo comfyUI workflow file available? __YES ,see workflow-images in this post.
does this need VAE or CLIP (and what models to load there if it does) ?? __YES, read below, or load the workflow in the image.
I second AlexCaro - instructions are terrible, and holds back new users from trying this... __Hopefully a post like this can help NEW people in this game.

To answer my own question:

On a LOW-VRAM device (PS: I run this successfully on a laptop with 3gb VRAM and 16gb RAM, slow, but it generates images in a few min):
I will add TWO image files (PNGs) with functioning workflows in them! (hopefully the metatext is not removed from the image). Very basic setups.
I run ComfyUI v1.2.26 (see version in settings if you run one of the newer CUIs)
You can make images with as few as 4 steps, but they be looking more decent at 8 or 16 steps (remember we're talking LOW VRAM here and 35-70sec/per iteration-step),
so depending on your will to wait, you can do 20 or 24 steps, and have a very nice picture in the end.

"flux1-dev-Q4_0.gguf" (7gb) goes in: ComfyUI/models/unet/
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main
https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q4_0.gguf
OR
(if you have the machine or the VRAM)
"flux1-dev.safetensors" (24gb) goes in: ComfyUI/models/unet/
https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main

"t5xxl_fp16.safetensors" (9gb) go in: ComfyUI/models/clip/
https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main

"clip_l.safetensors" go in: ComfyUI/models/clip/
https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main

"ae.safetensors" goes in: ComfyUI/models/vae/
https://huggingface.co/black-forest-labs/FLUX.1-schnell/blob/main/ae.safetensors

INSTALL NODE: https://github.com/city96/ComfyUI-GGUF
git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
.\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt

21machines

Aug 18

•

edited Aug 18

Thanks Alex, I can't believe it, I have it working on a Mac Studio (96GB Ram, 38-core GPU) in about 355 seconds to full generation using dev-Q4_K_S. 12 steps.

AlexData257

Aug 18

Thanks Alex, I can't believe it, I have it working on a Mac Studio (96GB Ram, 38-core GPU) in about 355 seconds to full generation using dev-Q4_K_S. 12 steps.
I am happy that the way I got it working, also works well for you! Enjoy Flux on Mac!