Question: is key_state_compressed used for inference?

Hi, 

Thanks for the great contribution!

I have a question about the usage of key_states_compress. If I understand correctly, key_states_compress is the topk token (clusters) from  prompt (in prefilling stage). Then during inference, new query should only calculate attention with key_states_compress + some_newly_generated_key_states. However, I see [flash-attn](https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/llama_hijack_4_37.py#L127) use the full prompt's key_states, and key_states_compress is not used. Is this supposed to be like this, or I miss anything?

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is key_state_compressed used for inference? #24

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: is key_state_compressed used for inference? #24

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions