Large-scale comprehensive evaluation of LLMs on moral reasoning using Haidt's Moral Foundations Theory and statistical modeling.
This project compares AI performance against human annotators across moral dimensions: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, and sanctity/degradation.
AI models show more balanced predictions and much fewer false negatives (missed findings) compared to human annotators, achieving 75th-100th percentile performance across moral foundations.
Interactive performance rankings across moral dimensions
Interactive false positive and false negative rates comparison
Standardize three moral psychology datasets (MFRC, MFTC, eMFD) into unified 5-foundation taxonomy. Clean annotations from multiple human annotators across Reddit, Twitter, and forum text domains.
Evaluate multiple state-of-the-art language models (Claude-4, DeepSeek-V3, Llama4-Maverick) on moral foundation classification using standardized prompting and async batch processing.
Apply novel GPU-efficient Dawid-Skene statistical model to estimate annotator competences, compare AI vs human performance, and generate percentile rankings across moral dimensions.
datasets
tensorflow
anthropic
openai
replicate
wandb
papermill