arxiv:2505.23923

ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents

Published on May 29, 2025

· Submitted by

feitengfang on Jun 2, 2025

Upvote

Authors:

Feiteng Fang ,

Ting-En Lin ,

Dingwei Chen ,

Haonan Zhang ,

Abstract

ChARM, a character-focused adaptive reward model, improves preference learning for role-playing language agents by using an act-adaptive margin and self-evolution with unlabeled data, achieving superior results on dedicated benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.

View arXiv page View PDF Add to collection

Community

feltoner

Paper author Paper submitter Jun 2, 2025

•

edited Jun 2, 2025

We propose ChARM, a novel reward modeling framework, designed to provide accurate rewards for enhancing role-playing abilities in RPLA, dynamically adjusting optimization strength through an act-adaptive margin and leveraging self-evolution to expand training data.
We train a ChARM-based reward model on Qwen2.5-7B, which outperforms the traditional Bradley-Terry model by 13% in preference ranking. When combined with DPO, it achieves stateof-the-art performance on both CharacterEval and our newly developed role-playing benchmark RoleplayEval.
We create the first role-playing preference dataset RoleplayPref, with 1,108 characters across 13 subcategories and 16,888 bilingual dialogues. Additionally, we design a new evaluation benchmark RoleplayEval to advance research in this area.

librarian-bot

Jun 3, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2505.23923

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.23923 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.23923 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.23923 in a Space README.md to link it from this page.