The Role of Large Language Models and AI Chatbots in the Diagnosis and Treatment of Mental Disorders
Introduction
Purpose of the article
The purpose of this review is to critically examine the role of large language models (LLMs) and AI-based conversational agents in the diagnosis and treatment of mental disorders.
Over the past several years, rapid advances in natural language processing have led to the deployment of increasingly sophisticated chatbots in health-related contexts, including mental health screening, psychoeducation, symptom monitoring, and elements of psychotherapeutic support. These developments have generated significant interest among clinicians, researchers, health systems, and policymakers, alongside equally significant concern regarding safety, ethics, and clinical validity.
This article does not treat LLM-based systems as autonomous diagnostic or therapeutic agents. Instead, it focuses on their actual and proposed functions within mental health care, including decision support, patient engagement, and augmentation of clinical workflows. Particular attention is paid to how these systems are evaluated in the scientific literature, what claims are supported by empirical evidence, and where important gaps remain. By synthesizing current research, this review aims to provide a balanced assessment of both the opportunities and limitations of LLM-driven tools in psychiatry and clinical psychology.
Brief overview of key issues
Mental health care presents a uniquely challenging domain for artificial intelligence. Unlike many areas of medicine that rely heavily on imaging, laboratory values, or clearly defined physiological markers, psychiatric diagnosis and treatment are largely grounded in language, subjective experience, and clinical judgment. This linguistic and interpretive nature makes the field particularly amenable to conversational AI, while simultaneously increasing the risk of misinterpretation, bias, and harm. Proponents of LLM-based mental health tools argue that these systems could help address long-standing structural problems in mental health care, including shortages of trained professionals, long wait times, uneven geographic access, and high administrative burden. Chatbots are already being used for symptom screening, delivery of cognitive-behavioral therapy–inspired interventions, and ongoing patient engagement between clinical visits. In some settings, they are positioned as scalable, low-cost complements to traditional care, especially for individuals who might otherwise receive no support at all.
At the same time, critics emphasize that mental health interventions carry intrinsic risk, particularly when deployed without adequate oversight. Errors in psychiatric assessment can lead to delayed treatment, inappropriate reassurance, or exacerbation of symptoms. LLMs are known to generate confident but incorrect outputs, struggle with context-sensitive reasoning, and reproduce biases present in their training data. In mental health contexts, such limitations may have more serious consequences than in purely informational applications.
Another central issue concerns the boundary between support and care. While many AI chatbots are marketed as wellness or self-help tools, their conversational nature and personalized responses can blur distinctions between general support, clinical guidance, and therapeutic intervention. This raises questions about accountability, informed consent, and regulatory oversight. The lack of consistent standards for evaluating efficacy and safety further complicates comparisons across systems and studies. Ethical and social considerations also play a prominent role. The use of AI chatbots in mental health intersects with concerns about data privacy, surveillance, and secondary use of sensitive information. Additionally, unequal access to technology, language limitations, and cultural bias may reinforce existing disparities in mental health care rather than alleviate them. For clinicians, there is ongoing debate about whether reliance on AI tools could erode professional skills or alter the therapeutic relationship in unintended ways.
This review addresses these issues by situating current applications of LLMs within their broader historical and research context, evaluating empirical evidence where available, and examining contested areas where consensus has not yet emerged. The sections that follow trace the evolution of AI in mental health, summarize current research trends, analyze practical applications and risks, and outline key controversies and future directions.
Historical Context
Historical background
The use of computational systems in mental health care predates large language models by several decades. Early efforts in the 1960s and 1970s focused on rule-based expert systems, which attempted to encode psychiatric knowledge into predefined decision trees. One of the earliest and most frequently cited examples is ELIZA, a simple natural language program developed in the 1960s that simulated a Rogerian psychotherapist by reflecting user statements back as questions. Although ELIZA had no understanding of mental states, its reception revealed both the appeal of conversational interaction and the human tendency to attribute empathy and insight to language-based systems. In subsequent decades, computer-assisted tools in psychiatry primarily took the form of screening instruments and decision aids. These included computerized versions of standardized questionnaires for depression, anxiety, and other disorders, as well as early clinical decision support systems designed to assist with diagnosis or medication selection. While such tools improved efficiency and standardization, they remained limited by rigid logic, poor adaptability, and dependence on explicit user input rather than free-form dialogue.
The late 1990s and early 2000s saw the emergence of internet-based mental health interventions, particularly web-delivered cognitive behavioral therapy (CBT). These programs, often structured as modules with minimal clinician involvement, demonstrated that digital tools could produce clinically meaningful improvements for some patients. However, they lacked conversational flexibility and were often associated with high dropout rates, highlighting the importance of engagement and personalization factors that later fueled interest in conversational agents.
Research developments
Advances in machine learning and natural language processing during the 2010s marked a turning point in the development of AI systems for mental health. Statistical language models and early neural networks enabled more flexible text analysis, supporting tasks such as sentiment detection, topic modeling, and automated classification of mental health–related content on social media. Research during this period explored whether linguistic markers could predict depression, suicidality, or relapse risk, with mixed but intriguing results.
The introduction of transformer-based architectures and large-scale pretraining fundamentally altered the landscape. Unlike earlier systems, large language models could generate coherent, context-sensitive responses and sustain multi-turn dialogue. This capability enabled the development of chatbots that resembled conversational partners rather than static questionnaires. In mental health research, LLMs began to be tested for tasks including diagnostic support, psychoeducation, simulated therapeutic dialogue, and clinician-facing documentation assistance. Parallel to these technical advances, the field of digital mental health matured, with growing attention to usability, engagement, and real-world effectiveness. Researchers increasingly recognized that technical performance alone was insufficient; successful tools needed to fit into clinical workflows, respect ethical constraints, and demonstrate benefit in diverse populations. This shift led to more interdisciplinary studies combining psychiatry, psychology, computer science, and ethics.
Despite these advances, the historical trajectory also reveals recurring challenges. Many earlier digital mental health tools were adopted enthusiastically but later abandoned due to limited effectiveness, poor adherence, or unintended consequences. This history underscores the importance of cautious interpretation of early results and provides essential context for evaluating current enthusiasm around LLM-based chatbots.
Current Trends and Research
Review of relevant research and evidence
Research on large language models and AI chatbots in mental health has expanded rapidly, but the evidence base remains heterogeneous and uneven in quality. Most published studies to date fall into three broad categories: (1) technical performance evaluations, (2) small-scale clinical or quasi-clinical pilots, and (3) observational or simulation-based studies comparing AI outputs to clinician judgments or standardized instruments.
A substantial portion of the literature examines the ability of LLMs to identify or classify mental health–related symptoms from text. Studies using clinical vignettes, patient narratives, or social media posts have reported that LLMs can approximate human performance on tasks such as detecting depressive symptoms, anxiety-related language, or suicide risk signals. However, these studies often rely on retrospective datasets, lack external validation, and use simplified benchmarks that do not reflect the complexity of real clinical encounters. Importantly, performance varies significantly depending on prompt design, language, cultural context, and the presence of comorbidities.Another growing body of research evaluates conversational agents designed to deliver psychological support or structured interventions, most commonly inspired by cognitive behavioral therapy. Several randomized and non-randomized trials of earlier, non-LLM chatbots showed modest improvements in self-reported symptoms of depression or anxiety over short follow-up periods. More recent studies involving LLM-based systems suggest higher conversational flexibility and user engagement, but robust comparative trials remain scarce. Where improvements are observed, effect sizes are generally small to moderate and comparable to other low-intensity digital interventions.
Evidence supporting the use of LLMs in diagnostic decision-making is particularly limited. While some studies report that AI-generated diagnostic suggestions align with clinician assessments in controlled scenarios, these findings do not establish safety or reliability in practice. Most authors emphasize that LLMs lack access to non-verbal cues, longitudinal history, and contextual knowledge essential to psychiatric diagnosis. As a result, current research largely positions LLMs as potential decision support tools, not diagnostic authorities.
Role and impact on practice
The role of large language models and AI chatbots in mental health practice is increasingly framed around augmentation of clinical work rather than autonomous care delivery. In contrast to early speculative narratives that envisioned AI systems diagnosing or treating psychiatric disorders independently, most contemporary research and real-world pilots emphasize supportive functions that operate alongside clinicians. This framing reflects both technical limitations and professional norms within psychiatry and psychology, where responsibility for diagnosis and treatment remains firmly human.
One of the most tangible areas of impact is pre-visit and intake support. LLM-based systems can collect structured symptom histories, administer standardized screening instruments in conversational form, and summarize patient concerns prior to clinical encounters. In mental health settings, where appointment time is often constrained and patients may struggle to articulate symptoms under pressure, such preparatory tools can improve efficiency and focus. Clinicians report that having access to a synthesized narrative of patient-reported symptoms can facilitate more targeted interviews and reduce time spent on routine data gathering.
A second, closely related application involves clinical documentation and information management. Mental health clinicians face substantial administrative burden due to detailed narrative documentation requirements. AI-assisted drafting of progress notes, summaries of therapy sessions, and integration of patient-reported outcomes into the medical record has shown potential to reduce after-hours work. While evidence suggests that review and correction remain necessary, even partial automation may improve work–life balance and mitigate burnout, a persistent concern in mental health professions.
In diagnostic contexts, the role of LLMs remains indirect and constrained. Rather than generating diagnoses, AI systems are more appropriately positioned to highlight patterns, flag inconsistencies, or suggest areas for further exploration. For example, a chatbot may identify symptom clusters suggestive of anxiety or mood disturbance and prompt clinicians to consider specific follow-up questions. Such support can be particularly valuable in primary care or integrated care settings, where non-specialists often conduct initial mental health assessments. However, the literature consistently cautions against using AI outputs as determinative, emphasizing that diagnostic judgment requires contextual, longitudinal, and interpersonal information beyond the reach of current models.
Treatment-related applications represent another expanding domain. Patient-facing chatbots are increasingly used to deliver psychoeducational content and structured therapeutic exercises, often modeled on cognitive behavioral therapy principles. In practice, these tools may reinforce skills learned in therapy, encourage self-monitoring, or provide coping strategies between sessions. For some individuals, especially those with mild to moderate symptoms, such interventions can increase engagement and continuity of care. Nonetheless, their effectiveness appears highly variable, and evidence does not support their use as standalone treatments for severe or complex disorders. The impact on practice also depends on how clinicians perceive and integrate these tools. Studies indicate that acceptance is higher when AI systems are transparent, customizable, and clearly subordinate to professional judgment. Clinicians express greater comfort when they can control when and how AI-generated content is used, rather than having it embedded as an obligatory component of care. Conversely, resistance increases when systems are perceived as intrusive, poorly aligned with clinical workflows, or positioned as replacements for human expertise.
From a systems perspective, LLM-based tools may influence access and continuity of care. By supporting triage, follow-up, and between-visit engagement, AI chatbots could help extend limited clinical resources to larger populations. However, this potential is contingent on careful implementation. Without integration into coordinated care pathways, chatbots risk becoming fragmented adjuncts that add complexity rather than coherence.
Overall, the emerging impact of LLMs on mental health practice is incremental rather than transformative. Their greatest value lies in supporting communication, organization, and low-risk interventions, while core diagnostic and therapeutic functions remain human-led. The extent to which these tools ultimately improve care will depend less on model sophistication and more on thoughtful integration, governance, and alignment with clinical values.
Key findings and conclusions of current research
Taken together, current research suggests that LLM-based chatbots have promising but constrained utility in mental health care. They demonstrate competence in language-based tasks such as summarization, pattern recognition, and structured dialogue, which are relevant to screening and supportive interventions. At the same time, there is no high-quality evidence supporting their use as independent diagnostic or treatment agents. Key limitations recur across studies: short follow-up periods, reliance on self-reported outcomes, lack of diverse samples, and minimal reporting on adverse effects. Comparisons to standard care are rare, and few studies address how AI tools perform when integrated into real clinical workflows. As a result, conclusions about effectiveness must remain provisional.
The prevailing consensus in the literature is that LLMs may play a supportive role within carefully designed, human-supervised systems. Their impact is likely to depend less on raw model capability and more on governance, clinical integration, and ethical safeguards issues that shape the discussion in the following sections.
Practical Significance and Potential Applications
Impact on clinical practice
The practical significance of large language models and AI chatbots in mental health care lies primarily in their potential to augment existing clinical processes rather than replace them. In current and near-term applications, these systems are most plausibly used to support clinicians in tasks that are time-intensive, repetitive, or primarily language-based. Examples include collecting structured symptom histories prior to appointments, summarizing patient-reported outcomes, and generating draft clinical documentation. In mental health settings, where clinical encounters are often dominated by narrative information, such support may meaningfully reduce administrative burden. AI chatbots are also being explored as screening and triage tools, particularly in primary care or community settings where access to mental health professionals is limited. By administering standardized questions conversationally and flagging high-risk responses, such systems may help prioritize referrals and identify individuals who require urgent evaluation. Most proposed models emphasize that final decisions remain with clinicians, with AI outputs serving as preliminary inputs rather than determinations.
In treatment contexts, LLM-based systems are most often positioned as adjunctive supports. Chatbots delivering psychoeducation, coping strategies, or CBT-inspired exercises may extend care beyond the clinic and reinforce therapeutic goals between sessions. For some patients, especially those with mild symptoms or barriers to accessing traditional care, these tools may provide a low-threshold entry point to mental health support. Still, their role in managing severe or complex disorders remains limited and poorly supported by evidence.
Recommendations and prospects
Based on the current literature, several recommendations emerge for the responsible application of LLMs in mental health care. First, use cases should be matched to risk level. Low-risk applications, such as education, self-monitoring, and administrative support, are more appropriate for early deployment than high-stakes diagnostic or treatment decisions. Second, systems should be designed with human oversight, ensuring that clinicians can review, correct, and contextualize AI-generated content. Prospects for broader adoption depend on improved evaluation frameworks. Future research should prioritize comparative studies that assess AI-supported care against standard practice, with attention to long-term outcomes and potential harms. Integration into clinical workflows, rather than standalone deployment, is likely to determine real-world impact. Collaboration between developers, clinicians, and patients will be essential to ensure that tools address genuine clinical needs.
Risks and limitations
Despite their potential, LLM-based mental health tools carry substantial risks. Diagnostic inaccuracies, inappropriate responses, and hallucinated content can mislead users or delay appropriate care. Cultural and linguistic biases may disproportionately affect vulnerable populations, while overreliance on chatbots could discourage individuals from seeking professional help.
There are also systemic risks. Poorly governed deployment may erode trust in mental health services or create unclear lines of accountability. Without rigorous standards for evaluation, claims of effectiveness may outpace evidence. These limitations underscore the need for cautious, evidence-driven adoption and provide a foundation for the ethical and social debates examined in the next section.
Problematic Issues and Controversies
Criticisms and counterarguments
One of the most prominent criticisms of using large language models and AI chatbots in mental health care concerns clinical reliability and accountability. Critics argue that LLMs, by design, generate probabilistic text rather than reasoned clinical judgments. This raises concerns that outputs may appear coherent and empathetic while being clinically inappropriate or incorrect, particularly in ambiguous or high-risk situations. Unlike clinicians, AI systems cannot be held ethically or legally responsible for their recommendations, leaving uncertainty about liability when harm occurs. Another major counterargument relates to explainability and reproducibility. Psychiatric assessment often requires transparent reasoning, yet LLMs operate as opaque systems whose internal decision processes are not readily interpretable. This lack of explainability complicates clinical validation and undermines trust, especially when AI outputs conflict with clinician judgment. Moreover, model performance may change over time due to updates or shifts in deployment context, raising questions about reproducibility and consistency.
There is also concern that reliance on AI tools could contribute to the deskilling of mental health professionals. If clinicians increasingly depend on automated summaries, screening outputs, or suggested interventions, they may engage less deeply with patient narratives or rely less on their own clinical reasoning. While empirical evidence for deskilling remains limited, the concern reflects broader anxieties about automation in professional practice.
Finally, critics note that many studies in this area are conducted or sponsored by technology developers, raising the possibility of publication bias and overstated benefits. The absence of large, independent randomized trials makes it difficult to distinguish genuine clinical value from early enthusiasm.
Ethical and social considerations
Ethical issues surrounding the use of large language models and AI chatbots in mental health care extend far beyond questions of technical accuracy. At their core, these technologies challenge established norms about responsibility, trust, and the nature of care in psychiatry and clinical psychology. Because mental health interventions often involve vulnerable individuals, subjective suffering, and high stakes, ethical shortcomings that might be tolerable in other domains can have disproportionately harmful consequences here.
A central concern is patient autonomy and informed consent. Many users interact with mental health chatbots outside traditional clinical settings, often without clear understanding of the system’s capabilities, limitations, or non-human nature. Even when disclosures are present, conversational fluency and empathic language can foster an illusion of understanding or authority. This creates a risk that users may attribute clinical competence or moral responsibility to systems that are fundamentally unable to assume either. Ethically sound deployment therefore requires not only formal disclosure, but ongoing, context-sensitive reinforcement of what the system can and cannot do. Closely related is the issue of therapeutic deception and relational ethics. The therapeutic alliance, long considered a cornerstone of effective mental health treatment, is built on trust, empathy, and mutual recognition. While AI chatbots can simulate empathic responses, they do not experience empathy, nor can they engage in moral accountability. Critics argue that encouraging emotionally vulnerable individuals to confide in systems incapable of genuine understanding risks instrumentalizing human distress. Proponents counter that perceived empathy may still have pragmatic benefit. The ethical tension lies in whether such benefit justifies blurring the distinction between simulated and human care, particularly when alternatives are limited.
Data privacy and governance represent another major ethical domain. Mental health data are among the most sensitive categories of personal information, encompassing intimate details about emotions, relationships, trauma, and behavior. LLM-based systems often rely on large-scale data collection, storage, and processing, sometimes involving third-party vendors or cloud infrastructure. Risks include unauthorized access, re-identification, secondary use of data for model training, and unclear data retention practices. Even when systems are nominally compliant with privacy regulations, the opacity of AI supply chains can make meaningful oversight difficult.
Ethical governance also extends to questions of data ownership and control. Patients may reasonably assume that their disclosures are used solely to support their care, yet commercial incentives may encourage broader use of interaction data for product improvement or analytics. Without robust safeguards and transparent policies, such practices risk violating patient expectations and undermining trust in mental health services more broadly.
Social considerations further complicate the picture. While AI chatbots are often promoted as tools to democratize access to mental health support, their benefits may be unevenly distributed. Individuals with limited digital literacy, unstable internet access, disabilities, or language differences may be excluded or poorly served. Moreover, LLMs trained predominantly on data from specific cultural or socioeconomic contexts may reproduce normative assumptions that do not translate well across populations. In mental health, where cultural framing of distress and coping varies widely, such biases can distort interpretation and response. There is also concern that widespread deployment of AI chatbots could reshape societal expectations about mental health care. If low-cost, automated support becomes the default offering for marginalized or underserved groups, while human care remains available primarily to those with greater resources, existing inequities may deepen. In this sense, AI tools risk becoming substitutes for investment in human services rather than complements to them.
Finally, ethical reflection must address the broader moral economy of care. Mental health treatment is not merely a technical intervention but a social practice rooted in human relationships. Delegating aspects of this practice to machines raises fundamental questions about what society considers acceptable forms of care, support, and responsibility. Addressing these ethical and social challenges requires more than technical fixes; it demands inclusive dialogue among clinicians, patients, ethicists, developers, and policymakers to define boundaries that align innovation with human values.
Conclusion
Summary
This review has examined the emerging role of large language models and AI chatbots in the diagnosis and treatment of mental disorders, situating current applications within their historical, scientific, and ethical context. The available evidence indicates that LLM-based systems have meaningful capabilities in language-driven tasks relevant to mental health care, including symptom screening, psychoeducation, documentation support, and structured conversational engagement. These strengths align with long-standing challenges in mental health services, such as clinician shortages, high administrative burden, and barriers to access.
At the same time, the review highlights that empirical support for clinical effectiveness remains limited and uneven. Most studies are short-term, small in scale, and focused on low-risk populations or narrowly defined tasks. There is no robust evidence to support the use of LLMs as autonomous diagnostic or therapeutic agents in psychiatry. Instead, the prevailing conclusion across the literature is that these technologies are best understood as adjunctive tools, whose value depends heavily on human oversight, careful integration into care pathways, and realistic expectations of their capabilities.
The risks associated with misapplication, such as diagnostic error, false reassurance, bias, and erosion of accountability, are nontrivial in mental health contexts. These risks underscore the need for cautious adoption guided by evidence rather than technological optimism.
Future directions
Future progress in this field will depend on methodologically rigorous research and stronger governance frameworks. Priorities include well-designed comparative studies that evaluate AI-supported care against standard practice, longer follow-up periods to assess durability of effects, and systematic reporting of adverse outcomes. Research must also address performance across diverse populations and cultural contexts to avoid reinforcing existing inequities. From a practical perspective, development should focus on human-centered design, transparency, and clinician control, ensuring that AI systems enhance rather than displace professional judgment. Regulatory clarity, ethical standards, and interdisciplinary collaboration will be essential to define appropriate boundaries for use.
Ultimately, large language models are unlikely to transform mental health care through autonomy. Their more plausible contribution lies in supporting clinicians and patients within ethically grounded, evidence-based systems, where responsibility remains clearly human and technology serves as a carefully governed aid rather than an arbiter of mental health decisions.
References
- Blease, C., Kaptchuk, T. J., Bernstein, M. H., et al. (2019). Artificial intelligence and the future of primary care: Exploratory qualitative study of UK general practitioners’ views. Journal of Medical Internet Research, 21(3), e12802. https://doi.org/10.2196/12802
- D’Alfonso, S. (2020). AI in mental health. Current Opinion in Psychology, 36, 112–117. https://doi.org/10.1016/j.copsyc.2020.04.005
- Fitzpatrick, K. K., Darcy, A., Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent. JMIR Mental Health, 4(2), e19. https://doi.org/10.2196/mental.7785
- Graham, S., Depp, C., Lee, E. E., et al. (2019). Artificial intelligence for mental health and mental illnesses: An overview. Current Psychiatry Reports, 21, 116. https://doi.org/10.1007/s11920-019-1094-0
