Large language models (LLMs) with chat-based capabilities, such as ChatGPT, are widely used in various workflows. However, due to a limited understanding of these large-scale models, users struggle to use this technology and experience different kinds of dissatisfaction. Researchers have introduced several methods such as prompt engineering to improve model responses. However, they focus on crafting one prompt, and little has been investigated on how to deal with the dissatisfaction the user encountered during the conversation. Therefore, with ChatGPT as the case study, we examine end users’ dissatisfaction along with their strategies to address the dissatisfaction. After organizing users’ dissatisfaction with LLM into seven categories based on a literature review, we collected 511 instances of dissatisfactory ChatGPT responses from 107 users and their detailed recollections of dissatisfied experiences, which we release as a publicly accessible dataset. Our analysis reveals that users most frequently experience dissatisfaction when ChatGPT fails to grasp their intentions, while they rate the severity of dissatisfaction the highest with dissatisfaction related to accuracy. We also identified four tactics users employ to address their dissatisfaction and their effectiveness. We found that users often do not use any tactics to address their dissatisfaction, and even when using tactics, 72% of dissatisfaction remained unresolved. Moreover, we found that users with low knowledge regarding LLMs tend to face more dissatisfaction on accuracy while they often put minimal effort in addressing dissatisfaction. Based on these findings, we propose design implications for minimizing user dissatisfaction and enhancing the usability of chat-based LLM services.
Through systematic literature review of papers dealing with limitations and challenges associated with LLMs and their application, we categorized the various aspects of user dissatisfaction arising from LLM responses into 19 distinct codes, further organized them into seven overarching themes.
Through qualitative analysis, we categorized users’ tactics to understand and analyze how users address their dissatisfaction from ChatGPT’s response through subsequent prompts. We identified the user’s tactic with 13 codes and categorized them as four main themes.
We analyzed (1) the count and dissatisfaction score of dissatisfaction category and (2) their co-occurrence patterns as follows:
Through this, we found that D_intent is the most prevalent and frequently appears concurrently with all other categories. And users experienced D_acc as the most severely dissatisfying.
(1) We analyzed the count and effectiveness score of tactic category and as follows:
Through this, we found that T_specify is the most prevalent and most effective tactic.
(2) We analyzed tactics used for dissatisfaction and visualized the flow in a Sankey diagram.
It shows how users address various dissatisfactions when conversing with ChatGPT. About 34% don’t use tactics, while 66% employ them. Interestingly, 58% of dissatisfactions are resolved with tactics. However, users manage to resolve only 28% of their dissatisfactions using tactics, leaving 72% unresolved.
We analyzed how users’ experience of dissatisfaction and their tactics differ depending on their knowledge levels regarding LLMs.
(1) In terms of dissatisfaction experience, we observed that the low-knowledge group experiences D_depth and D_refuse more frequently, while the high-knowledge group experiences D_acc and D_format more frequently.
(2) In terms of user’s tactics, we found that No Tactic and T_repeat was more prevalent in the low-knowledge group, while T_error was more prevalent in the high knowledge group.
It presents Sankey diagrams that illustrate how users in the low-knowledge and high-knowledge groups experience dissatisfaction categories from ChatGPT’s responses, respond to the dissatisfactions with each tactic category at user prompts, and whether these tactics ultimately resolve their dissatisfactions or not. Through this, we can see that the rate of resolving dissatisfaction in the high-knowledge group (29%) is higher than low-knowledge group (23.5%).
We collected user experience data on dissatisfactory ChatGPT responses in actual conversations with ChatGPT through our data collection system.
🔗 Link to Dataset Description (Github)
🔗 Link to Dataset Request (Form)
@inproceedings{10.1145/3640543.3645148, author = {Kim, Yoonsu and Lee, Jueon and Kim, Seoyoung and Park, Jaehyuk and Kim, Juho}, title = {Understanding Users’ Dissatisfaction with ChatGPT Responses: Types, Resolving Tactics, and the Effect of Knowledge Level}, year = {2024}, isbn = {9798400705083}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3640543.3645148}, doi = {10.1145/3640543.3645148}, booktitle = {Proceedings of the 29th International Conference on Intelligent User Interfaces}, pages = {385–404}, numpages = {20}, keywords = {Chat-based LLM, ChatGPT, Knowledge-level, Large Language Models, Resolving tactics, User-side dissatisfaction, datasets}, location = {Greenville, SC, USA}, series = {IUI '24} }