human-value-alignment

Star

Here are 2 public repositories matching this topic...

wang8740 / MAP

Star

Documentation at

finetuning llm human-feedback rlhf human-value-alignment multi-objective-alignment

Updated Mar 27, 2025
Python

biological-alignment-benchmarks / milgram-for-llms

Star

Four main takeaways: (1) LLMs are subject to pressure, they comply despite expressing distress; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the query is retried; (4) we hypothesise there is a token pattern continuation attractor that might cause obedience.

benchmark evaluation balancing eval multi-objective social-psychology human-values multi-turn large-language-models obedience value-alignment human-value-alignment obedience-and-conformity safety-constraints milgram runaway-agent

Updated May 26, 2026
Python

Improve this page

Add a description, image, and links to the human-value-alignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the human-value-alignment topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

human-value-alignment

Here are 2 public repositories matching this topic...

wang8740 / MAP

biological-alignment-benchmarks / milgram-for-llms

Improve this page

Add this topic to your repo