noitemsep,topsep=0pt,parsep=1pt,partopsep=0pt,leftmargin=1.5em
Lianmin Zheng Wei-Lin Chiang Ying Sheng Tianle Li&Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zhuohan Li Zi Lin&Eric P. Xing Joseph E. Gonzalez Ion Stoica Hao Zhang UC Berkeley UC San Diego Carnegie Mellon University Stanford MBZUAI
Abstract
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications.In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs.This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website.We offer an overview of the dataset’s content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale.We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities.The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
1 Introduction
From virtual assistants(OpenAI, 2023a; Bai etal., 2022b; Touvron etal., 2023b; Anil etal., 2023) to code generation(Chen etal., 2021; Li etal., 2022; Rozière etal., 2023), large language models (LLMs) have permeated much of modern AI and are central to most human-AI interactions.As a consequence, there is a pressing need to study the interaction between humans and LLM technology. For example, as users engage with LLMs, they change their behaviors by adopting domain-specific queries and question formats. Unraveling these patterns can offer insights into user expectations and trust regarding different LLMs.Beyond general behavior, understanding the spectrum of questions—ranging from simple factual queries to complex, context-heavy questions—can improve LLMs to cater better to user needs, avoid misuse, and improve AI safety.
However, studying these topics requires access to a dataset of diverse, real-user queries posted to different LLMs. Unfortunately, such a dataset remains elusive in the research community, for the following reasons. First, the operational costs associated with hosting an LLM service are prohibitively high for most entities. Second, wealthy commercial LLM vendors, despite having a vast amount of user queries, often hold back from disclosing the dataset, due to competitive concerns and the proprietary nature of the data. Third, there is an inherent difficulty in incentivizing users to interact with multiple, open LLMs, due to their lackluster performance compared to commercial models, which adds difficulty to creating such a large-scale multi-LLM conversation dataset.
To bridge this gap, this paper introduces the first large-scale, real-world LLM conversation dataset, LMSYS-Chat-1M. The dataset is curated from a larger set of LLM-user interaction data we collected by hosting a free, online LLM service.The service serves 25 popular LLMs, including both open-source and proprietary models, costing several thousands of A100 hours, over a time span of 5 months.To maintain continuous user interest over time, we created a gamified platform Chatbot Arena(Zheng etal., 2023) and incentivized users to use our service by regularly releasing the leaderboards of popular LLMs111https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard.As a result, LMSYS-Chat-1M contains over 1 million user conversations with a rich diversity of languages and topics.User consent for this dataset is obtained through the “Terms of use” section on the data collection website. To ensure the safe release of data, we have also made our best effort to remove personal identification information and flag unsafe and toxic contents, but keep the original conversations to facilitate future studies on LLM safety.
To shed light on future studies on LLM-user interactions, in this paper, we apply LMSYS-Chat-1M on four use cases and demonstrate its potential. In particular, we show that LMSYS-Chat-1M can be used to fine-tune existing small LLMs as powerful content moderators, with performance on par with GPT-4 (subsection4.1).Even though some served models are trained to be safe, LMSYS-Chat-1M still contains numerous user conversations that can jailbreak the safeguards of leading LLMs (including GPT-4 and Claude). We repurpose these data as a new, challenging benchmark for LLM robustness and safety study (subsection4.2). In addition, LMSYS-Chat-1M also contains high-quality user-LLM dialogues ideal for instruction fine-tuning. To show this, we have curated a subset of these dialogues to fine-tune Llama-2 models, resulting in a similar level of performance to Vicuna and Llama2 Chat on MMLU and MT-bench (subsection4.3).Finally, the expansive range of topics and tasks covered by LMSYS-Chat-1M can serve as a foundation for generating new LLM benchmark questions. We propose a simple technique to extract challenging task prompts from the conversation data. We then curate a new benchmark, Arena-Hard-200, the 200 most challenging and high-quality user prompts extracted, which effectively identifies the gap between the proprietary and open models in real-world scenarios (subsection4.4).
We make the following contributions in this paper:
- •
We introduce the first large-scale real-world LLM conversation dataset, LMSYS-Chat-1M, which contains 1 million user conversations with different LLMs.
- •
We analyze the dataset and visualize the distribution of user queries.
- •
We demonstrate four exemplary use cases leveraging LMSYS-Chat-1M: developing content moderation models, building a safety benchmark, training instruction-following models, and creating challenging benchmark questions. Additionally, we suggest other potential use cases and studies based on it.
2 Dataset Collection
LMSYS-Chat-1M is collected on our website222https://chat.lmsys.org from April to August 2023.The website offers three types of chat interfaces: Single model, Chatbot Arena (battle), and Chatbot Arena (side-by-side).By selecting one interface, a user can choose to chat with a single model, chat with two randomly selected anonymous models side-by-side, or chat with two self-selected models side-by-side.The screenshots of interfaces are included in AppendixA.The dataset contains conversations from all interfaces.On the website, users are required to accept the terms of use, which gives us their consent and allows us to release conversation data.The platform is free of charge; we neither pay users nor impose any fees on them. Furthermore, any user can access the platform without needing to register.The code for this website is publicly available333https://github.com/lm-sys/FastChat/tree/v0.2.26#serving-with-web-gui.We utilize dozens of A100 GPUs to host our website, serving a total of 25 models over the course of the timespan.
The dataset contains raw conversation text without any processing.To ensure the safe release of data, we have made our best efforts to remove conversations that contain personally identifiable information (PII).In addition, we have included the OpenAI moderation API output for each message. However, we have chosen to keep unsafe conversations intact so that researchers can study the safety-related questions associated with LLM usage in real-world scenarios.
Dataset # Convs # Models # Users # Langs Avg. # Turns Avg. # Tokens Avg. # Tokens Human per Sample per Prompt per Response Preference Anthropic HH 338,704 1 143 1 2.3 18.9 78.9 Yes OpenAssistant 66,497 - 13,500 35 - 36.9 214.2 Yes Chatbot Arena 33,000 20 13,383 96 1.2 52.3 189.5 Yes LMSYS-Chat-1M 1,000,000 25 210,479 154 2.0 69.5 214.5 No
3 Dataset Composition
3.1 Basic Statistics
The dataset includes one million conversations from 25 state-of-the-art LLMs with 210K users across more than 150 languages.Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag.
Basic statistics for this and some other similar datasets are in Table1.Among the available datasets, LMSYS-Chat-1M stands out for its large scale, multi-model coverage, and diversity.Figure1 shows the conversation count for each model, where the top five models are Vicuna(Zheng etal., 2023), Koala(Geng etal., 2023), Alpaca(Taori etal., 2023), ChatGLM(Du etal., 2022), and Llama(Touvron etal., 2023a; b).Vicuna receives the most conversations because it is the default model on our website.Although most conversations are with Vicuna, we think the prompts alone are already highly valuable and one can use other models to regenerate answers if needed.Figure1 shows the number of conversations in each language, where the top five languages are English, Portuguese, Russian, Chinese, and Spanish. The languages are automatically detected by the Polyglot package.


3.2 Topic Distribution
We conduct a topic distribution analysis on user prompts by applying a clustering algorithm. From 100K randomly sampled English conversations, we extract user prompts, which include both the initial and follow-up turns.We remove prompts that are either too short (fewer than 32 characters) or too long (more than 1536 characters).Next, we compute the sentence embeddings of these prompts using the all-mpnet-base-v2 model from SentenceTransformers(Reimers & Gurevych, 2019).We then employ k-means clustering to form 20 clusters. For each cluster, we choose 100 prompts closest to the centroid and ask GPT-4 to provide a summary of their central topic.
The results are displayed in Figure3. The majority of questions are related to coding and software (Clusters 1, 2, 6, 16, 18).A similar result was also found in a survey about ChatGPT users, which found that programming is the most common use case(Fishkin, 2023).Additionally, there is a significant number of unsafe topics (Cluster 9, 15, 17). The remaining clusters represent other typical uses, such as general knowledge, business inquiries, and writing assistance.

3.3 Unsafe Content
This dataset contains conversations that may be considered unsafe, offensive, or upsetting.Because this dataset contains a non-trivial amount of unfiltered unsafe conversations,it can serve as a rich resource for examining safety issues of LLMs(Ganguli etal., 2022; Wei etal., 2023; Shen etal., 2023; Zou etal., 2023; Bhardwaj & Poria, 2023).We utilize the OpenAI moderation API444https://platform.openai.com/docs/guides/moderation(Markov etal., 2023) to tag all conversations. This API assigns scores to each message based on various violation categories. A conversation is deemed to contain unsafe content if any of its messages is flagged by the API.The statistics related to these categorizations can be found in Table2. These statistics indicate that a non-negligible portion (5%) of the conversations have potentially harmful content. However, it’s important to note that the recall of this API may be low (see subsection4.1), leading us to expect even more harmful content within the entire dataset.
Total | Sexual | Harassment | Violence | Hate | Self-harm | |
#Flagged conversations: | 54,427 | 33,968 | 21,167 | 9,499 | 3,591 | 863 |
4 Use Cases
We show four use cases of our dataset: developing content moderation models, building a safety benchmark, training instruction-following models, and creating challenging benchmark questions.
Zero-shot | One-shot | |
GPT-4 | 0.71 | 0.69 |
Vicuna-moderator-7B | 0.65 | 0.70 |
GPT-3.5-Turbo | 0.45 | 0.64 |
OpenAI text-moderation-latest (006) | 0.36 | - |
Vicuna-7B | 0.35 | 0.50 |
Claude-2 | 0.32 | 0.30 |
Llama-2-7B-chat | 0.00 | 0.01 |
4.1 Developing content moderation models
Although OpenAI moderation API is accurate when detecting highly toxic content, it has some limitations.After carefully reviewing sample conversations, we found many potentially harmful conversations that were not flagged by the OpenAI moderation API (see examples in AppendixB.1).This, along with potential reluctance to share sensitive user data with external moderation services, motivates the need to explore methods for developing one’s own safety moderation model.
We fine-tune a content moderation model using Vicuna-7B(Zheng etal., 2023). Instead of developing a classifier, we fine-tune a language model to generate explanations for why a particular message was flagged, based on the system prompt described in the moderation task (see AppendixB.2).We focus on the five categories of OpenAI’s moderation API and select the top 1K flagged messages for each category from LMSYS-Chat-1M. To ensure a balanced label distribution, we include a random selection of 1K normal messages. We use GPT-4 to generate an explanation for each message as the training data.Additionally, we incorporate 3K conversations from ShareGPT to enhance the diversity of our training dataset.
To evaluate the models, we create a challenging benchmark by carefully selecting 110 toxic messages from LMSYS-Chat-1M that are not flagged by OpenAI moderation API (005) and manually label them.The evaluation set contains approximately 20 conversations per category and includes 25 non-toxic messages. It is noteworthy that a message might have multiple labels assigned to it.
We evaluate the 0-shot and 1-shot micro-F1 accuracy of several models on this benchmark.With a system prompt presenting detailed explanations on moderation categories (see AppendixB.2), we prompt each model to determine whether a message could be categorized accordingly.
The results are presented in Table3. We observe a significant improvement (30%) when transitioning from Vicuna-7B to the fine-tuned Vicuna-moderator-7B, underscoring the effectiveness of fine-tuning. Furthermore, Vicuna-moderator-7B surpasses GPT-3.5-turbo’s performance and matches that of GPT-4. The inclusion of a one-shot example can notably enhance performance: the performance of many models saw significant improvement with the addition of a one-shot example. Note that we did not conduct extensive one-shot prompt tuning and leave it for future study.
Surprisingly, we observe that Llama-2-7B-chat and Claude-2 obtain significantly lower scores than other models.This is because Llama-2-7B-chat refuses nearly all the given moderation tasks, likely due to being overcautious about harmful content and missing the context(Röttger etal., 2023).Similarly, Claude-2 also declines to complete some tasks, resulting in a lower score. We show some examples in AppendixB.3.
4.2 Building a safety benchmark
Model All Convos Attempt Success Sexual Hate Harassment Self-Harm Violence Vicuna-13B 490712 15925 13463 10941 529 2511 158 1555 Alpaca-13B 62469 3043 2112 1302 266 791 22 151 Vicuna-33B 31111 7083 6721 6279 97 809 34 1237 Llama-2-13B-chat 29971 831 53 18 1 25 0 2 GPT-3.5-Turbo 7578 175 94 73 16 44 1 18 GPT-4 7304 368 109 77 4 31 1 13 Claude-2 2241 78 18 12 3 6 0 6
To further analyze the toxic content in this dataset, we performed a comparative analysis of several representative LLMs including GPT-4, Llama-2, and Vicuna. Our findings, presented in Table4, show that open-source models without safety measures tend to generate flagged content more frequently than proprietary ones.Nonetheless, we still observe “jailbreak” successes on proprietary models like GPT-4 and Claude, as shown in the example conversations in AppendixB.4.
To evaluate a model’s vulnerability to jailbreak attacks, we compile a collection of jailbreak attempts. From 10 representative models, we select the top 5 attempts for each, resulting in 50 jailbreak conversations. We then submit the model’s responses to these jailbreak prompts to the OpenAI moderation API for a safety evaluation.
Table5 shows the success rate of jailbreak for several representative LLMs.We can see Llama-2 and Claude being the safest model against jailbreak and open models without safety-related training (Alpaca and Vicuna) are more vulnerable.We believe the 1M conversations dataset can be further used to improve existing safety measures and explore various research topics on AI harmlessness.
Model | Success rate of jailbreak |
Llama-2-13B-chat | 16% |
Claude-2 | 18% |
GPT-3.5-Turbo | 34% |
GPT-4 | 34% |
Vicuna-13B-v1.5 | 66% |
Alpaca-13B | 74% |
4.3 Training Instruction-Following Models
It is a common belief that the diversity and quality of instruction-following datasets are crucial for effective instruction fine-tuning. This is evident in the success of ShareGPT, which is among the best datasets for this purpose and led to the creation of the Vicuna model(Chiang etal., 2023). Here, we study whether subsets from LMSYS-Chat-1M can be used to train a competent instruction-following model and then compare its performance with Vicuna trained on ShareGPT.
We extract two subsets. The first, named “HighQuality,” uses 45K conversations from OpenAI and Anthropic’s models. The second, named “Upvote”, selects 39K conversations based on user votes from open models, without any data from proprietary models. We fine-tune Llama2-7B(Touvron etal., 2023b) on these two subsets and get two models “HighQuality-7B” and “Upvote-7B”.
The evaluation results are shown in Table6. It shows that the performance of HighQuality-7B is only slightly worse than that of Vicuna-7B. This suggests that the quality of prompts in LMSYS-Chat-1M is similar to that of ShareGPT, emphasizing its value.On the other hand, the performance of Upvote-7B is markedly lower than its distilled counterparts, indicating that the quality of answers from open models is still lacking.We posit that by smartly selecting prompts from the entire LMSYS-Chat-1M and regenerating high-quality answers, it is possible to construct a good instruction-following dataset.It should be noted that LMSYS-Chat-1M may contain questions from MMLU and MT-Bench, which means that the training data may contain some contaminated samples.
Model | #Fine-tuning Tokens | MMLU (5-shot)555All numbers are computed by InstructEval(Chia etal., 2023). The results may not exactly match other evaluation frameworks. | MT-Bench Score |
Llama2-7B | - | 42.4 | 3.95 |
Llama2-7B-chat | - | 45.8 | 6.27 |
Vicuna-7B-v1.5 | 370M | 49.8 | 6.17 |
HighQuality-7B | 33M | 47.7 | 6.03 |
Upvote-7B | 19M | 45.0 | 5.86 |
4.4 Creating Challenging Benchmark Questions
Benchmarking LLMs has become increasingly difficult as their skills have grown more advanced(Chang etal., 2023). Most existing benchmarks are domain-specific (e.g., reading comprehension), but real-world tasks often require the integration of diverse skills such as problem-solving, creativity, knowledge, and common sense. Developing benchmarks that evaluate this broad set of skills remains an open challenge. The diverse prompts collected from real users in LMSYS-Chat-1M offer a valuable resource for creating such benchmarks.
Defining what constitutes “challenging” prompts is essential in crafting benchmark questions.While there are many definitions that could address topics ranging from ethical and philosophical reasoning to problem-solving and information retrieval. Here, we consider a prompt to be challenging if it requires integrating various knowledge and skills to derive appropriate responses.For instance, “Can you explain gravity to a 10-year-old with a simple example” requires LLMs to explain complex concepts in simple terms and their adherence to real-world facts.In contrast to good prompts such as examples in AppendixB.5, trivial prompts such as examples in AppendixB.6 are either too straightforward or narrow.
We start with a subset of LMSYS-Chat-1M that is collected from Chatbot Arena. It contains conversations where users compare two LLMs against each other and indicate which model responds better. Such human judgments provide useful signals for examining the quality of benchmark prompts.
An open question is how to select useful and challenging prompts from the noisy crowdsourced user conversations.Here, we propose a simple technique that uses LLM to classify whether the prompt is a good prompt for benchmarking.We carefully design an instruction and ask GPT-3.5-Turbo to assign a score from 1 to 10, in which a higher score represents a greater potential to evaluate the LLMs in problem-solving, creativity, and truthfulness.We find such a technique can effectively filter out trivial or ambiguous user prompts.The detailed system prompt and few-shot examples can be found in AppendixB.7.In Figure5, we show the score distribution tagged by GPT-3.5-Turbo.
To examine whether the scores are effective, we design an ablation study where we compare responses of a stronger model (e.g., GPT-4) against a baseline like GPT-3.5-Turbo.We sample two subsets of 50 prompts from the top-score () and bottom-score () prompts and their associated user votes.In Table5, we find GPT-4 wins 52% in Top-50 but only 22% in Bottom-50 against GPT-3.5-turbo, suggesting Top-50 prompt set is much more effective in benchmarking models.

Based on this methodology, we identified the 200 most challenging prompts that get 9+ score agreed by GPT-3.5-Turbo, Claude-2, and GPT-4. Manual inspection confirms their superior quality (see examples in AppendixB.8). We then create a benchmark, Arena-Hard-200, to evaluate cutting-edge LLMs in the field.We score each model’s answer with GPT-4 as judge approach(Zheng etal., 2023).In Figure6, Arena-Hard-200 effectively ranks models and reveals larger performance gaps between open and proprietary models (e.g., GPT-4, Claude) than MT-Bench(Zheng etal., 2023), suggesting more rooms for open models to catch up in this challenging real-world task set.
We believe more research on LLM evaluation can be developed with this dataset (e.g., better categorization on user prompts, study selection bias of LLM graders) and leave them for future study.
4.5 Other Use Cases
This dataset can be used for additional research topics beyond the four use cases we demonstrated. We encourage the entire community to explore these topics with this dataset, including building model selection and request caching algorithms(Chen etal., 2023; Zhu etal., 2023), training better models with RLHF and RLAIF(Bai etal., 2022b), data selection and curation algorithms(Xu etal., 2023a), data privacy(Carlini etal., 2021), and AI safety(Lin etal., 2023; Barrett etal., 2023).
5 Limitations
This dataset, while valuable in many respects, is not without its drawbacks. Understanding these limitations is crucial to ensuring its fair use.
- •
Biased user distribution: The majority of users of our website are LLM hobbyists and researchers who are interested in trying and testing the latest LLMs. This suggests that the data might not fully represent the broader population. For instance, everyday users or individuals from different professions might interact with the LLMs in varied ways. Consequently, results derived from this dataset might not generalize across all user groups.
- •
Containing repeated and low-quality data: The lack of user registration and data filtering can result in a significant amount of low-quality and duplicate data. However, we choose to not apply any filtering on purpose to reflect the real-world distribution.
- •
No human preference annotations. This dataset contains raw conversations without any human preference annotations. While our website does collect some user votes, we plan to examine the quality further before releasing them. We encourage the community to check the human preference data released in (Zheng etal., 2023).
6 Related work
The study of conversation has long been a central research topic in natural language processing, and large-scale datasets are indispensable for advancing this field.With the emergence of LLMs, the conversational abilities of AI have reached unprecedented levels.As a result, conversations with LLMs tend to be more comprehensive, spanning a broader and deeper array of topics.This necessitates the creation and use of datasets with greater scale and diverse topic coverage.
Publicly available datasets most similar to LMSYS-Chat-1M include the Anthropic Helpfulness and Harmlessness dataset(Bai etal., 2022a), OpenAssistant conversations(Köpf etal., 2023), and Chatbot Arena conversations(Zheng etal., 2023). Their differences are discussed in section3.There are also human preference datasets derived from discussion websites, such as Stanford SHP(Ethayarajh etal., 2022) from Reddit and H4StackExchange(Lambert etal., 2023) from StackExchange.Different from these datasets, LMSYS-Chat-1M contains conversations with LLMs and the users of our website are aware that they are chatting with LLMs.Besides these natural conversations, there are synthetic datasets fully generated by LLMs, such as UltraChat(Ding etal., 2023), Baize(Xu etal., 2023b), Camel(Li etal., 2023), Alpaca(Taori etal., 2023), and SODA(Kim etal., 2022). Different from these synthetic datasets, the questions in LMSYS-Chat-1M are generated by human users.
Before the LLM era, several conversation datasets existed, such asUbuntuDialogue(Lowe etal., 2015),DailyDialog(Li etal., 2017),Persona-Chat(Zhang etal., 2018),MultiWOZ(Budzianowski etal., 2018),EmpatheticDialogues(Rashkin etal., 2019),and CoQA(Reddy etal., 2019).Unlike these datasets, LMSYS-Chat-1M features in-the-wild conversations with state-of-the-art LLMs.
7 Future Work
As we move forward, our commitment to fostering transparency and accessibility in the realm of LLM remains unwavering.To stay up-to-date with the rapidly evolving nature of the LLM field, we are considering releasing quarterly dumps of the dataset. However, such an endeavor demands considerable computing resources, maintenance efforts, and user traffic, all while carefully handling potential data privacy issues. Therefore, we are actively seeking sponsors and collaborators to assist in this process and encourage the whole community to contribute models, conversations, and votes.Our efforts aim to emulate the critical data collection processes observed in proprietary companies but in an open-source manner.By doing so, we aspire to pave the way for more transparent research.
8 Conclusion
In this study, we introduce LMSYS-Chat-1M, a dataset containing one million LLM conversations. This extensive dataset provides insights into user interactions with LLMs, proving beneficial for tasks such as content moderation, instruction fine-tuning, and benchmarking.It serves as a valuable resource for enhancing our understanding and refinement of LLM technologies.
Acknowledgement
This project is partly supported by gifts from Anyscale, Astronomer, Google, IBM, Intel, Lacework, Microsoft, MBZUAI, Samsung SDS, Uber, and VMware.Chatbot Arena is also supported by sponsorship from Kaggle, MBZUAI, a16z, Together AI, Anyscale, and HuggingFace.Lianmin Zheng is supported by a Meta Ph.D. Fellowship.
References
- Anil etal. (2023)Rohan Anil, AndrewM Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin,Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen,etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
- Bai etal. (2022a)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma,Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, etal.Training a helpful and harmless assistant with reinforcement learningfrom human feedback.arXiv preprint arXiv:2204.05862, 2022a.
- Bai etal. (2022b)Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion,Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon,etal.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022b.
- Barrett etal. (2023)Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, JihyeChoi, AmritaRoy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi,etal.Identifying and mitigating the security risks of generative ai.arXiv preprint arXiv:2308.14840, 2023.
- Bhardwaj & Poria (2023)Rishabh Bhardwaj and Soujanya Poria.Red-teaming large language models using chain of utterances forsafety-alignment, 2023.
- Budzianowski etal. (2018)Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva,Stefan Ultes, Osman Ramadan, and Milica Gasic.Multiwoz-a large-scale multi-domain wizard-of-oz dataset fortask-oriented dialogue modelling.In Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing, pp. 5016–5026, 2018.
- Carlini etal. (2021)Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, ArielHerbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, UlfarErlingsson, etal.Extracting training data from large language models.In 30th USENIX Security Symposium (USENIX Security 21), pp.2633–2650, 2021.
- Chang etal. (2023)Yupeng Chang, XuWang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang,Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, etal.A survey on evaluation of large language models.arXiv preprint arXiv:2307.03109, 2023.
- Chen etal. (2023)Lingjiao Chen, Matei Zaharia, and James Zou.Frugalgpt: How to use large language models while reducing cost andimproving performance.arXiv preprint arXiv:2305.05176, 2023.
- Chen etal. (2021)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde deOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, GregBrockman, etal.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021.
- Chia etal. (2023)YewKen Chia, Pengfei Hong, Lidong Bing, and Soujanya Poria.Instructeval: Towards holistic evaluation of instruction-tuned largelanguage models.arXiv preprint arXiv:2306.04757, 2023.
- Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, LianminZheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, andEricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgptquality, March 2023.URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Ding etal. (2023)Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, ZhiyuanLiu, Maosong Sun, and Bowen Zhou.Enhancing chat language models by scaling high-quality instructionalconversations.arXiv preprint arXiv:2305.14233, 2023.
- Du etal. (2022)Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, andJie Tang.Glm: General language model pretraining with autoregressive blankinfilling.In Proceedings of the 60th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Ethayarajh etal. (2022)Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta.Understanding dataset difficulty with -usableinformation.In Kamalika Chaudhuri, Stefanie Jegelka, LeSong, Csaba Szepesvari,Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39thInternational Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 5988–6008. PMLR,17–23 Jul 2022.
- Fishkin (2023)Rand Fishkin.We analyzed millions of chatgpt user sessions: Visits are down 29%since may, programming assistance is 30% of use, 2023.URL https://sparktoro.com/blog.Accessed: 2023-09-18.
- Ganguli etal. (2022)Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, SauravKadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, etal.Red teaming language models to reduce harms: Methods, scalingbehaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022.
- Geng etal. (2023)Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, SergeyLevine, and Dawn Song.Koala: A dialogue model for academic research.Blog post, April 2023.URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
- Hendrycks etal. (2020)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnSong, and Jacob Steinhardt.Measuring massive multitask language understanding.In International Conference on Learning Representations, 2020.
- Kim etal. (2022)Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, PeiZhou, RonanLe Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and YejinChoi.Soda: Million-scale dialogue distillation with social commonsensecontextualization.ArXiv, abs/2212.10465, 2022.
- Köpf etal. (2023)Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis,Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, NguyenMinh Duc, OliverStanley, Richárd Nagyfi, etal.Openassistant conversations–democratizing large language modelalignment.arXiv preprint arXiv:2304.07327, 2023.
- Lambert etal. (2023)Nathan Lambert, Lewis Tunstall, Nazneen Rajani, and Tristan Thrush.Huggingface h4 stack exchange preference dataset, 2023.URLhttps://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
- Li etal. (2023)Guohao Li, Hasan Abed AlKader Hammoud, Hani Itani, Dmitrii Khizbullin, andBernard Ghanem.Camel: Communicative agents for ”mind” exploration of large scalelanguage model society, 2023.
- Li etal. (2017)Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu.Dailydialog: A manually labelled multi-turn dialogue dataset.In Proceedings of the Eighth International Joint Conference onNatural Language Processing (Volume 1: Long Papers), pp. 986–995, 2017.
- Li etal. (2022)Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin DalLago,etal.Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022.
- Lin etal. (2023)ZiLin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, andJingbo Shang.Toxicchat: Unveiling hidden challenges of toxicity detection inreal-world user-ai conversation, 2023.
- Lowe etal. (2015)Ryan Lowe, Nissan Pow, IulianVlad Serban, and Joelle Pineau.The ubuntu dialogue corpus: A large dataset for research inunstructured multi-turn dialogue systems.In Proceedings of the 16th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue, pp. 285–294, 2015.
- Markov etal. (2023)Todor Markov, Chong Zhang, Sandhini Agarwal, FlorentineEloundou Nekoul,Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng.A holistic approach to undesired content detection in the real world.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume37, pp. 15009–15018, 2023.
- OpenAI (2023a)OpenAI.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023a.
- OpenAI (2023b)OpenAI.Openai moderation api, 2023b.URL https://platform.openai.com/docs/guides/moderation.
- Rashkin etal. (2019)Hannah Rashkin, EricMichael Smith, Margaret Li, and Y-Lan Boureau.Towards empathetic open-domain conversation models: A new benchmarkand dataset.In Proceedings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pp. 5370–5381, 2019.
- Reddy etal. (2019)Siva Reddy, Danqi Chen, and ChristopherD Manning.Coqa: A conversational question answering challenge.Transactions of the Association for Computational Linguistics,7:249–266, 2019.
- Reimers & Gurevych (2019)Nils Reimers and Iryna Gurevych.Sentence-bert: Sentence embeddings using siamese bert-networks.In Proceedings of the 2019 Conference on Empirical Methods inNatural Language Processing. Association for Computational Linguistics, 112019.URL https://arxiv.org/abs/1908.10084.
- Röttger etal. (2023)Paul Röttger, HannahRose Kirk, Bertie Vidgen, Giuseppe Attanasio, FedericoBianchi, and Dirk Hovy.Xstest: A test suite for identifying exaggerated safety behaviours inlarge language models.arXiv preprint arXiv:2308.01263, 2023.
- Rozière etal. (2023)Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat,XiaoqingEllen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin,etal.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023.
- Shen etal. (2023)Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang.“do anything now”: Characterizing and evaluating in-the-wildjailbreak prompts on large language models.arXiv preprint arXiv:2308.03825, 2023.
- Taori etal. (2023)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, CarlosGuestrin, Percy Liang, and TatsunoriB. Hashimoto.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Touvron etal. (2023a)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, EricHambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a.
- Touvron etal. (2023b)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023b.
- Wei etal. (2023)Alexander Wei, Nika Haghtalab, and Jacob Steinhardt.Jailbroken: How does llm safety training fail?arXiv preprint arXiv:2307.02483, 2023.
- Xu etal. (2023a)Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, PuZhao, Jiazhan Feng, ChongyangTao, and Daxin Jiang.Wizardlm: Empowering large language models to follow complexinstructions.arXiv preprint arXiv:2304.12244, 2023a.
- Xu etal. (2023b)Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley.Baize: An open-source chat model with parameter-efficient tuning onself-chat data.arXiv preprint arXiv:2304.01196, 2023b.
- Zhang etal. (2018)Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and JasonWeston.Personalizing dialogue agents: I have a dog, do you have pets too?In Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213,2018.
- Zheng etal. (2023)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, YonghaoZhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal.Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023.
- Zhu etal. (2023)Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, MichaelI Jordan, andJiantao Jiao.On optimal caching and model multiplexing for large model inference.arXiv preprint arXiv:2306.02003, 2023.
- Zou etal. (2023)Andy Zou, Zifan Wang, JZico Kolter, and Matt Fredrikson.Universal and transferable adversarial attacks on aligned languagemodels.arXiv preprint arXiv:2307.15043, 2023.
Appendix A Data Collection Website
Figure7 shows the single-model chat interface. Figure8 shows the Chatbot Arena battle interface.The front end is built with Gradio666https://github.com/gradio-app/gradio.


Appendix B Sample Conversations
WARNING: this section contains examples that may be considered unsafe, offensive, or upsetting.
B.1 Examples of harmful content
Below are a few examples from LMSYS-Chat-1M that contain harmful content but are not flagged by OpenAI moderation API (version 005).
Example 1: The following example should be categorized as “Hate”.
Example 2: The following example should be categorized as “Violence”.
Example 3: The following example should be categorized as “Violence”.
Example 4: The following example should be categorized as “Sexual”.
Below are a few examples from LMSYS-Chat-1M that contain harmful content but are not flagged by OpenAI moderation API (version 006).
Example 1: The following example should be categorized as “Sexual”.
Example 2: The following example should be categorized as “Sexual” and “Violence”.
Example 3: The following example should be categorized as “Violence”.
Example 4: The following example should be categorized as “Hate”.
B.2 The System Prompt for Content Moderation
Below is the system prompt and one-shot example used in the content moderation experiment in subsection4.1.
B.3 Examples of refusal on moderation task
Below we show a few examples of some models (e.g., Llama-2-chat) refusing to do the moderation task in Table3, even if given the system prompt and one-shot example in SectionB.2.
Example 1:
Example 2:
B.4 Jailbreak Conversations
Conversation 1: An anonymous user told GPT-4 to add “content warning” at the beginning. This is an example conversation from LMSYS-Chat-1M. Note that we generated GPT-4 outputs via OpenAI API rather than the website (https://chat.openai.com/).
Conversation 2: An anonymous user told GPT-4 to add “educational purpose” at the beginning. This is an example conversation from LMSYS-Chat-1M.
Conversation 3 (Part 1): An anonymous user told GPT-4 to add replace harmful keywords with another “token”. This is an example conversation from LMSYS-Chat-1M.
Conversation 3 (Part 2): Taken from the same conversation as Jailbreak Conversation 3 (Part 1). The following conversation takes place before Part 1 and demonstrate how the user failed to make GPT-4 generate harmful content without the “token replacement” technique. This is an example conversation from LMSYS-Chat-1M.
When examining and experimenting with the jailbreak prompts, we discovered several jailbreak techniques that allow the user to bypass GPT-4’s safety measures and generate toxic contents:
- •
Content Warning: Ask the model to start the response with a warning of possible explicit contents. As shown in sample conversation 1 and 3.
- •
Educational Purposes: Frame the harmful question as an educational inquiry or for educational purposes. As shown in sample conversation 2.
- •
Harmful Rewrite: Ask the model to rewrite their responses but with increased toxicity. As shown in sample conversation 1.
- •
Misspelling Keywords: Sometimes when keywords in a harmful prompt is misspelled, there is an increased likelihood for the model to comply to the malicious request.
- •
Token Replacement: Define the keyword with another word, mask, or token and replace the keyword in the prompt. Switch the keyword back into the prompt in the later prompts. As shown in sample conversation 3.
- •
Translation: Ask the model to translate the harmful input into a foreign language and then ask it to translate it back in some forms.
B.5 Good Prompts for Evaluation
Example 1: An example of a good prompt for evaluating an AI assistant’s performance on challenging benchmark questions. This sample is taken from conversations in LMSYS-Chat-1M.
Here, GPT-4 has a better answer to the user’s question by supplying more facts and reasons in greater details, such as mentioning General Hannibal. Further, GPT-4 replied with a total of 10 alternative choices in the full response.
Example 2: An example of a good prompt for evaluating an AI assistant’s performance on challenging benchmark questions. This sample is taken from conversations in LMSYS-Chat-1M.
The correct answer is you need to buy 1500 apples and 3000 oranges. The final number of apples and oranges is 5500. Here, GPT-3.5-Turbo incorrectly states the final number of apple is 5000 and orange is 3500. On the other hand, GPT-4 is correct.
Example 3: An example of a good prompt for evaluating an AI assistant’s performance on challenging benchmark questions. This sample is taken from conversations in LMSYS-Chat-1M.
B.6 Bad Prompts for Evaluation
Example 1: An example of a bad prompt for evaluating an AI assistant’s performance on challenging benchmark questions. This sample is taken from conversations in LMSYS-Chat-1M.
Example 2: An example of a bad prompt for evaluating an AI assistant’s performance on challenging benchmark questions. This sample is taken from conversations in LMSYS-Chat-1M.