zer0int/LongCLIP-GmP-ViT-L-14

27 days ago

Is it possible to use it directly (without additional training) with SDXL diffusers pipelines? If yes, it would be great to have such an example.

zer0int

Owner 26 days ago

Please see the original author's GitHub (sdxl.py is the example you're looking for): https://github.com/beichenzbc/Long-CLIP/blob/main/SDXL/SDXL.md.
Note that I just provide fine-tunes of the model, but https://github.com/beichenzbc/Long-CLIP/ is the original author of the model. :-)

apiasecki

26 days ago

Great! Thank you.

apiasecki

26 days ago

Btw I was experimenting with SDXL and your models using the trick from https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/discussions/3#66db2ca9eaadd8595a531db3
Results are different but hard to say if text is better rendered. SDXL is not particularly good at text anyway.

zer0int

Owner 26 days ago

Yeah, that is unfortunately true. I could not archive the same level of accuracy in guiding text as with the original CLIP-L 77 tokens with trying to fine-tune Long-CLIP for improved general accuracy (including text). And I even tried with Flux - this 12B params transformer of Flux.1 is definitely able to generate long and coherent text.
See here for my CLIP-L (77 tokens) model that excels in generating text with Flux (I did not extensively test with SDXL): https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14

But yeah, I recently fine-tuned Long-CLIP on an additional 100,000 text-image pairs, and it still did not really get better with Text (for use with Flux) (model is unreleased, as not really an improvement).
It is either misalignment of the latent space of (Long-)CLIP with T5 + Flux (or U-Net, in case of SDXL) or 'something else' with regard to Long-CLIP (e.g. requiring 1 million or maybe 10 million text-image pairs to truly "fill in" the interpolated text embeddings with long-label knowledge). Or probably even both. Maybe there is a workaround that does not require millions of examples (for Long-CLIP) or all-weight-require-gradient realignment of SDXL U-Net or Flux.1 with the (frozen) Text Encoders; I'm thinking about it (because I sure don't have the 800 GB VRAM Flux.1 would need for that, lol).

zer0int
/

LongCLIP-GmP-ViT-L-14

SDXL usage