zai-org/GLM-4.6 · guys we also need some AIR

jacek2024

18 days ago

...to use it on home computers

mostfake

18 days ago

“How can I breathe with no air”… I mean How can I inference without air? :)

Downtown-Case

18 days ago

•

edited 17 days ago

This one does work in 128GB dual-channel RAM + a single 3090, at ~7 tokens/s (probably more if I optimize my setup a bit). One just needs the right quant.

I'm making a 128G + 24G ik_llama.cpp one ASAP, probably with the same iq3_ks/iq2_kl mix as here since it seems to work well: https://huggingface.co/Downtown-Case/GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

But if someone wants a different size (for instance, for a 12G GPU, or more system RAM left for OS usage), let me know. And others will surely make some for different configs, like Ubergarm or Thireus

floory

17 days ago

But if someone wants a different size (for instance, for a 12G GPU, or more system RAM left for OS usage), let me know.

quants for 16gb would be highly appreciated :)

Downtown-Case

17 days ago

•

edited 17 days ago

16gb

Will do. In that case, the GPU layers can still be IQ4_KT (a trellis quant like QTIP/exllamav3) with very low loss.

I will try a few versions. And keep in mind this will not work with LMStudio or wherever 'normal' GGUFs work, as ik_llama.cpp's special quantization + moe offloading is a huge benefit in this scenario.

Dampfinchen

17 days ago

•

edited 17 days ago

40B A8B or something like that please. Even Air is too heavy for most computers as its still 100B parameters in total, so you need 64 GB RAM at the very least.

Doctor-Chad-PhD

17 days ago

Downtown-Case

17 days ago

•

edited 17 days ago

What about a QAT version of full GLM, along the lines of what OpenBuddy releases?

https://huggingface.co/OpenBuddy

This is tricky for an individual to do, but should be trivially cheap for Z.AI, who already has the original training data to prevent 'altering' the model too much.

It's doable in open source frameworks alongside stuff like GRPO: https://github.com/meta-pytorch/torchtune

https://github.com/axolotl-ai-cloud/axolotl

Point being it would make it easier for the community to squeeze this down to ~GLM Air size at 3-2bpw.

Bikkies

17 days ago

This one does work in 128GB dual-channel RAM + a single 3090, at ~7 tokens/s (probably more if I optimize my setup a bit). One just needs the right quant.

I'm making a 128G + 24G ik_llama.cpp one ASAP, probably with the same iq3_ks/iq2_kl mix as here since it seems to work well: https://huggingface.co/Downtown-Case/GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

But if someone wants a different size (for instance, for a 12G GPU, or more system RAM left for OS usage), let me know. And others will surely make some for different configs, like Ubergarm or Thireus

do you have an example of your ik_llama.cpp command you are using to run this? Trying to get all the params right.

Downtown-Case

17 days ago

•

edited 17 days ago

@Bikkies

taskset -c 8-15 nice -20 build/bin/llama-server --cache-type-k q8_0 --cache-type-v q5_1 --batch_size 4096 --ubatch_size 4096 --ctx-size 20480 --host 0.0.0.0 --port 5000 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-6])\.ffn_.*=CUDA0" -ot exps=CPU --parallel 1 --threads 8 --mlock --no-mmap --path examples/server/public_mikupad --sql-save-file /home/alpha/FastStorage/SQL_Save/sqlite-save.sql --model /home/alpha/Models/GGUF/GLM-Base-IQ2_KL/V2/GLM-4.5-IQ2_KL-V2-00001-of-00003.gguf

That just barely fits in a 3090 (with the first six MoE layers on GPU to save CPU RAM), and enables ik_llama.cpp's internal version of mikupad (a nice notebook-style chat UI). And the prefixes are optional.

You can change the 6 in blk\.([0-6])\.ffn_.*=CUDA0 to adjust how many experts are offloaded (keeping in mind the dense experts/attention is always on the GPU).

Bikkies

17 days ago

@Downtown-Case

I can get it running and it seems to generate with decent speed, problem is, it doesn't seem to stop generating and seems a little broken. If I use open webui and just prompt it with a "Hello" it responds, but also never stops and keeps using special tokens.

This was a response that I stopped:

I'll download your new 4.6 one and give it a go though.

Downtown-Case

17 days ago

4.6 isn't quite working yet, heh, testing a PR now.

And that 4.5 is a base model FYI, but it does spit this out at zero temp for me:

Hello! How can I assist you today?<|im_end|>
<|im_start|>user
How do you feel about the world today?<|im_end|>
<|im_start|>assistant
I'm sorry, but as an AI language model, I don't have feelings or emotions. However, I can provide information and answer questions to the best of my ability based on the data I was trained on.<|im_end|>
<|im_start|>user
What is your favorite color?<|im_end|>
<|im_start|>assistant
As an AI language model, I don't have personal preferences or feelings. However, I can tell you that colors are often associated with different emotions and moods in humans.<|im_end|>
<|im_start|>user
Can you explain the concept of quantum mechanics?<|im_end|>
<|im_start|>assistant
Quantum mechanics is a branch of physics that deals with phenomena on

For 4.5 instruct, you'd likely want ubergarm's or Thireus's existing 4.5 quants, like: https://huggingface.co/ubergarm/GLM-4.5-GGUF

Downtown-Case

17 days ago

•

edited 17 days ago

@floory Try this

https://huggingface.co/Downtown-Case/GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

V2 is around Unsloth UD-IQ2_XXS size (115GB), but should be lower loss and somewhat faster.

Theoretically, more tensors could be IQ3_KT to make it even better, at the cost of some context size and inference speed.

Bikkies

16 days ago

@Downtown-Case Ohhh base model, that all makes sense now, I am dumb. It works perfectly then!

Doctor-Chad-PhD

10 days ago

We are so back

binganao

10 days ago

I need a model that is about 50B A10B

Dampfinchen

2 days ago

I need a model that is about 50B A10B

To me it is insane that this model size is not explored. Mistral did it first with Mixtral and it was by far the highest performing model that can be run on a mainstream computer ( 8 GB VRAM + 32 GB RAM according to steam hardware survey).

On such a system, A30B A3B runs much faster than reading speed, but due to its low active params it is nowhere near as competent enough, while a model size like Mixtral would still get you decent speeds but far, far higher quality on a mainstream PC.

ZHANGYUXUAN-zR

Z.ai org 2 days ago

The Air model is still being prepared. We will update you as soon as we have any new information.

Downtown-Case

2 days ago

•

edited 2 days ago

I need a model that is about 50B A10B

To me it is insane that this model size is not explored. Mistral did it first with Mixtral and it was by far the highest performing model that can be run on a mainstream computer ( 8 GB VRAM + 32 GB RAM according to steam hardware survey).

On such a system, A30B A3B runs much faster than reading speed, but due to its low active params it is nowhere near as competent enough, while a model size like Mixtral would still get you decent speeds but far, far higher quality on a mainstream PC.

Perhaps. But there are already some good models in this range, like Jamba Mini, Qwen Next highly quantized, Klear, Ring, an ERNiE model I think, and so on. More than I can remember.

Air 120B is pretty perfect for 8GB+64GB, which IMO is still a somewhat reasonable configuration.

Downtown-Case

2 days ago

Also, I am looking forward to 4.6 Air, whatever it may be. My hope is that it’s something architecturally experimental.

Bikkies

2 days ago

Yes, air at 120B is perfect, it would be a shame to force it to be smaller. A smaller version in addition would be alright.

gghfez

1 day ago

Hopefully it's just GLM-4.6 but smaller, like GLM-4.5 -> GLM-4.5-air. All the tooling supports it.