The first generation token output sees the whole cache key and value

https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L130

Hi there~
Thanks for your great work!
The past_key_value in L130 does update the new compressed key and value.
However, the first generation tokens(L168) are still generated with full cache key and value after the prompt compression.
https://github.com/FasterDecoding/SnapKV/blob/82135ce2cc60f212a9ba918467f3d9c8134e163f/snapkv/monkeypatch/mistral_hijack_4_37.py#L168-L176
Is this a bug?

	attn_output = self._flash_attention_forward(
	query_states,
	key_states,
	value_states,
	attention_mask,
	q_len,
	dropout=dropout_rate,
	use_sliding_windows=use_sliding_windows,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first generation token output sees the whole cache key and value #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The first generation token output sees the whole cache key and value #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions