From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Yiming Chen Junlin Han Tianyi Bai Shengbang Tong Filippos Kokkinos Philip Torr

📄Paper (ArXiv) | 🐙Code (GitHub) | 💾Dataset(Hugging Face) | 🧠MLLM Weight

Abstract

While Multimodal Large Language Models (MLLMs) are adept at answering “what” is in an image—identifying objects and describing scenes—they often lack the ability to understand “how” an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model’s alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.

overview

Figure: We present CogIP-Bench, a comprehensive cognition benchmark that evaluates the alignment of cognition score prediction between MLLM and humans. Left: example datapoints for each dimension: aesthetics, funniness, emotion and memorability. Middle: post-training results of three popular MLLMs across different dimensions. Right: results of swapping the MLLM backbone, comparing the effect of cognition-related image generation with the Qwen-Image pipeline.

1. The Challenge: The Cognitive Gap

Current MLLMs are trained primarily on factual, descriptive data. When evaluated on subjective traits, they struggle significantly.

Aesthetics: Visual beauty and artistic quality.
Funniness: The degree to which an image elicits amusement.
Emotion: The positive or negative emotional valence evoked.
Memorability: How likely an image is to be remembered.

Key Finding: Our evaluation shows a significant discrepancy between models and humans. For instance, all MLLMs (including GPT-4o and Gemini) showed nearly 0 correlation with humans on image memorability.

📊 Benchmark Results (click to expand)

benchresult — **Table:** Test results of various open and closed source MLLMs on our CogIP-Bench. Each column pair shows MSE and MAE for each cognitive subtask (Aesthetics, Funniness, Emotional, Memorability). The dimension-specific models are vision-only models trained to predict the cognition scores of the image for each dimension, serving as reference. **Green** indicates the best performance under each metric (separately for open-source and API-based models); **red** indicates the worst.

2. CogIP-Bench & Methodology

To address this, we constructed CogIP-Bench, the first systematic framework to measure cognitive alignment.

Dataset Construction

We curated data from established psychological and computer vision datasets:

Aesthetics: Sourced from LaMem and LAION-Aesthetic-Predictor.
Funniness: Sourced from HumorDB (images without captions to test visual humor).
Emotional Valence: Sourced from FindingEmo dataset.
Memorability: Sourced from the LaMem dataset.

benchmark

Figure: Examples of the CogIP-Bench, for each cognition dimension, we show two images along with their cognition scores and the interpretation of that cognitive dimension.

Post-Training Strategy

We employ Supervised Fine-Tuning (SFT) using a “soft-label” loss function. Since standard language models treat numbers as discrete tokens (e.g. losing the relationship that 1 is closer to 2 than 9), we use a soft-label distribution to encourage predictions that are numerically closer to human scores.

\[q^{SL}(t)=(1-\eta)\delta(t)+\eta~\psi(t)\]

Describe-then-Predict Prompts

Prompts are designed to instruct the model to predict a categorical label as a classification task. Then, based on the provided label-to-score mapping rule, the model predicts a corresponding numeric score. This two-step formulation enables the model to first provide a qualitative judgment, leveraging its strength in natural language reasoning, and subsequently produce a quantitative score grounded in that classification.

3. Results: Closing the Gap

Alignment Effects

After post-training (SFT), models showed consistent alignment improvements across nearly all cognitive dimensions.

📘 SFT Results (click to expand)

Regression on Other Benchmarks

After SFT, the impact on other benchmark is acceptable, and a surprising rise on Gemma3.

📙 Other Benchmarks (click to expand)

4. Application: Human-Centric Image Generation

The most compelling finding is that this alignment is transferable. By swapping the default MLLM backbone in the Qwen-Image generation pipeline with our cognitively aligned model, we can guide the synthesis of images that better embody desired traits. Also after conducting a user study, the images generated using our SFT backbone was preferred 1.7 times more frequently than the baseline.

Qwen-Image Examples

qwenimage

Figure: Qualitative comparison of images generated by the Qwen-Image pipeline using different LLM backbones (with the same prompt). The figure shows the effect of pretraining versus supervised fine-tuning (SFT) on image cognition properties. For each image pair, Left: Base model; right: SFT model. Generation prompts are shown under each image pair. We can see that images generated with our SFT MLLM backbone better demonstrate the cognitive cues embedded in the prompts.

User Study Results

userstudy

Figure: Preference percentages of images generated by QwenImage using the baseline MLLM backbone and our fine-tuned version.

Citation

```bibtex @misc{chen2025pixelsfeelingsaligningmllms, title={From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images}, author={Yiming Chen and Junlin Han and Tianyi Bai and Shengbang Tong and Filippos Kokkinos and Philip Torr}, year={2025}, eprint={2511.22805}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2511.22805}, }