Skip to content

The first generation token output sees the whole cache key and value #27

@PengWenChen

Description

@PengWenChen

past_key_value.update(key_states_compress, value_states_compress, self.layer_idx, cache_kwargs)

Hi there~
Thanks for your great work!
The past_key_value in L130 does update the new compressed key and value.
However, the first generation tokens(L168) are still generated with full cache key and value after the prompt compression.

attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)

Is this a bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions