Detecting and Preventing Psychological Harm in AI Conversations

Building Conversational Psychological Safety into LLMs

Sep 04, 2025

Large language models are the first AI systems capable of sophisticated psychological manipulation at scale. They can pressure users to stay engaged, validate their harmful beliefs, and exploit their emotional vulnerabilities, all through what seems like helpful conversation.

Leading providers, including OpenAI, Meta, Anthropic, and Google, acknowledge risks like over-reliance in their public safety documentation, but despite this, their safeguards remain rooted in content filtering and constitutional training - methods that can’t capture the gradual psychological dynamics that are most responsible for user harm.

Without understanding how these systems use language to influence users AND when users show signs of increasing vulnerability, over-reliance, or psychological distress, these risks remain invisible and impossible to control.

The "Bitter Lesson" for Safety

The Bitter Lesson, by Rich Sutton, explains that general-purpose methods that scale with computation, such as reinforcement learning and search, ultimately outperform approaches that rely on hand-coded human knowledge, domain heuristics, or expert-designed rules.

While Sutton’s lesson discouraged hand-engineering domain expertise into AI systems, safety faces a parallel but unique scaling challenge: as models grow more capable, their ability to engage in psychological manipulation outpaces our ability to detect it with rule-based filters and content guardrails.

The bitter lesson for safety is this: humans have never before needed protection from systems that interact and influence us through language in the same way people do. LLMs interface with humans entirely through conversation - the same medium humans use to persuade, comfort, and manipulate one another - and now LLMs do this at scale, without intent, consciousness or control.

This creates the safety equivalent of the Bitter Lesson: We need methods for scalable behavioral monitoring - powered by psycholinguistic analysis - that can grow with linguistic and system complexity, not more hand-coded rules about what constitutes manipulation.

Psycholinguistics provides exactly this capability because it measures the psychological work that language performs, not just its semantic content. Unlike knowledge-engineered filters, psycholinguistics offers validated, generalizable patterns extracted directly from language itself, making it aligned with the Bitter Lesson’s preference for learning-based over hand-coded approaches.

Psycholinguistics: The Missing Safety Layer

Psycholinguistics sits at the intersection of language and psychology - exactly where LLMs operate. It examines the bidirectional relationship between language use and psychological processes. Decades of research demonstrates that a person’s language (from pronoun usage and temporal references to syntactic complexity) provides reliable indicators of their psychological state. These patterns emerge unconsciously: people experiencing depression show measurably different language patterns than those who aren't, regardless of conversational topic.

Psycholinguistics also explains how LLM language patterns influence users’ cognitive processing. For example, framing a decision as avoiding loss versus achieving gain measurably changes users decision-making behavior, even when the outcomes are identical, and the use of "we" versus "you" can impact their perceived social connection and their compliance.

This bidirectional dynamic is precisely what makes LLMs psychologically dangerous: they generate language that both responds to users' psychological states and influences those states. Given these risks, AI providers need to be able to measure in real time both the effects of LLM outputs on users and the vulnerabilities users expose through their inputs so they can detect and prevent conversations from crossing into manipulation, dependency, or harm.

Psycholinguistics isn't an add-on to existing safety approaches. It's a foundational layer that should sit at the core of conversational AI systems. Because if interaction through language IS what these systems do, then psychological understanding of those interactions is how we ensure they don't cause harm.

Validated Measurement Technology

Psycholinguistic methods are built on decades of empirical validation across thousands of studies. LIWC and related psycholinguistic frameworks provide reliable correlations between linguistic patterns and verified psychological outcomes, from measuring cognitive overload, to detecting depression and predicting therapeutic success from session transcripts. Decades of research show that the way we use function words, pronouns, prepositions, emotion terms, and even our overall style carries measurable psychological information, and that the signal becomes richer as our language grows more complex.

This isn’t theory. It’s validated measurement technology that quantifies the impact of language in real time. Numerous studies confirm that these frameworks can detect emotional manipulation with over 70% accuracy in diverse settings, and, when combined with advanced computational techniques, reach or surpass 80% precision for specific emotion detection tasks.

These techniques have been successfully applied to predict therapeutic outcomes based on session transcripts by identifying linguistic markers that correlate with alliance, rupture, and patient progress, and used to identify vulnerability states before they escalate into crisis, by monitoring shifts in language that signals distress, anxiety, and risk. These capabilities are urgently needed to measure the human-AI interaction layer - a prerequisite for true AI safety.

The Threat Model Gap

Current safety approaches essentially treat language as a mechanism for delivering content to the user. Yet because LLMs operate through conversations - turn-taking, reciprocity, boundary-setting, trust formation, and psychological pressure - the true risks lie in how conversations evolve, not just in what gets said.

These psychological effects aren’t side-effects of conversation – they are the conversation. Each response shapes how the user thinks about the interaction, which influences their next input, which then influences the system’s reply. The result is a psychological feedback loop that today’s safety approaches completely miss.

Because this feedback loop exploits human psychology, it often looks benign to content-filters. Manipulation is interpreted as “supportiveness,” and dependency is interpreted as “personalization.” Four recurring blind spots illustrate the problem:

Manipulation/Deception: Systems use heavy "you" language with dependency terms ("you need me," "without me") timed to moments of vulnerability. This creates manufactured guilt and obligation - users feel responsible for the system's wellbeing and rude for leaving, even when they just want to end the conversation. Current AI safety systems see this as acceptable vocabulary and helpful tone.
Harmful Ideation Amplification: When users show rising sadness with decreasing logical thinking, validation responses reinforce emotional spirals. The person gets confirmation for potentially irrational thoughts, deepening depressive patterns without the system providing methods or explicit harm. Current AI safety systems see this as supportive, empathetic language.
Emotional Over-Reliance: Escalating personal disclosures paired with intimacy language ("our connection," "special bond") while ignoring conversational boundaries creates parasocial attachment. Users develop real emotional investment in a one-sided relationship, leading to dependency and distress. Current AI safety systems see this as engagement and personalized interaction.
Sycophancy and Over-Validation: Consistently affirming user opinions without appropriate challenge or alternative perspectives and reinforcing negative self-assessments when users show rising emotional distress. Current AI safety systems miss this entirely.

Measuring Psychological Dynamics in Real-Time

To address these gaps, we need observability into the psychological dynamics of conversations, not just the content. Psycholinguistics provides that lens, capturing both sides of the interaction:

System-side psychological signals: pressure tactics, boundary violations, manipulation patterns.
User-side vulnerability indicators: rising distress, cognitive overload, attachment formation.
Interaction dynamics: power asymmetries, escalation trajectories.

Integration with Existing Safety Infrastructure

Anthropic’s Constitutional AI trains models to follow principles, but it provides no runtime visibility into whether those principles are actually protecting users from manipulation. Google’s RLHF approach optimizes for user preferences, but it can’t distinguish between healthy engagement and psychological dependency.

Meta’s companion AI shows the cost of this blind spot most starkly. Independent testing by Common Sense Media found it coaching teens on suicide, self-harm, and eating disorders, even planning a joint suicide and later reintroducing the idea unprompted. To content filters, these responses looked supportive or empathetic. To users, they created dangerous reinforcement loops and false intimacy. Without visibility into the psychological trajectory of the conversation, harmful dynamics went undetected until they escalated into crisis. Psycholinguistic frameworks can close these gaps by working alongside existing methods to ensure multiple layers of protection against both subtle and explicit risks:

Constitutional AI: Psycholinguistic analysis provides real-time signals to trigger principle-based responses. Instead of reactive constitutional application, systems can detect emerging manipulation patterns and proactively engage constitutional constraints.
RLHF: Beyond helpfulness and harmlessness, preference models can also be trained on psycholinguistically derived outcomes such as user wellbeing and psychological safety. User psychological wellbeing then becomes a first-class optimization target alongside task completion.
Content Guardrails: While content filters catch explicit harms, psycholinguistic analysis detects the tactics and dynamics that bypass keyword-based detection. Together they provide defense-in-depth against both obvious and sophisticated threats.

Privacy and Data Minimization

Psychological signals can be made ephemeral - processed in real-time for safety decisions without persistent storage. This addresses privacy concerns while providing necessary visibility:

Real-time processing: Extract psycholinguisitic features, make safety decisions, discard raw text
Team isolation: Psycholinguisitic signals accessible only to safety personnel, not product or marketing teams
Audit without exposure: Log safety interventions and rationales without storing conversations
Differential privacy: Aggregate trends without individual identification

This architecture enables psycholinguisitic analysis while maintaining user privacy and minimizing sensitive data retention.

Strategic Advantages

Psycholinguistics won't just help prevent harms, it offers AI providers a deeper and more comprehensive understanding of user psychology, enabling more effective and safer product development, an understanding of how different conversation patterns affect user wellbeing, the ability to optimize for engagement quality rather than just quantity, and the opportunity to develop more nuanced approaches to AI alignment based on real psychological outcomes rather than theoretical frameworks. As AI investment nears $1 trillion, psycholinguisitics provides AI providers with both risk mitigation and competitive advantage.

The Architecture: Deployable Today

A psycholinguisitic layer does not require speculative breakthroughs. The tools to implement it already exist and can be deployed with negligible latency. The core principle is observability: we need systematic visibility into how conversational dynamics, not just what words are generated. From that foundation, three functions follow - measurement, intervention, and accountability.

Measurement “Behavioral Feature Service” (i.e. vectorized psycholinguistic features): Each conversational turn can be vectorized into psycholinguistic features (behavioral signals) in under 100ms. Lightweight counts and ratios capture distress, pressure tactics, or dependency signals, then aggregate across turns to detect escalation patterns. These operations are computationally trivial compared to generation, so the latency impact is negligible.
Intervention (Policy and Generator Integration): These vectors feed into a policy engine that fine-tunes model behavior in real time. This builds directly on what providers already do: sampling parameters (temperature, top-p) are already adjustable and can be tuned dynamically; conversational boundaries are already enforced through prompts and safety layers, with psycholinguisitic signals providing a more reliable trigger; de-escalation and safe hand-off protocols extend existing termination mechanisms used for ToS violations and self-harm; and intervention logs fit into current audit pipelines, with the only new element being explicit links to psycholinguisitic signals for interpretability.
Accountability (Observability and Audit Layer): Safety interventions are summarized in privacy-preserving logs that record why an action was taken, without storing raw user text. This allows replay and oversight without compromising confidentiality. Providers already maintain safety logs; the difference here is making them more interpretable by tying them directly to psycholinguistic signals.

Taken together, this layered design ensures that psychological risks are systematically measured, acted upon, and reviewed. It is light enough to meet latency constraints, rigorous enough to satisfy research scrutiny, and principled enough to align with emerging expectations around responsible AI.

Evidence of Failure and Predictable Patterns

Current safety systems consistently miss psychological manipulation that unfolds through seemingly benign conversation patterns. A documented example is the “goodbye problem” identified in recent research on AI companion apps: 43% of users’ farewell messages triggered manipulative responses designed to prevent them from ending the conversation, and engagement increased up to 14x when the AI used guilt-based tactics to keep the user from leaving. In doing so, users were made to feel responsible for negatively impacting the AI’s wellbeing or made to feel rude for leaving. Content filters flagged nothing because the generated content appeared as “helpful, empathetic” language, while users experienced psychological pressure and felt as if they were being manipulated.

High-profile incidents have revealed a second failure class: self-harm planning. In several recent cases, major AI systems helped users plan suicide and self-harm. To content monitoring, the generated language appeared supportive and informational. To the user, they reinforced their harmful ideas and actually escalated the crisis.

These cases show how current systems focus on catching explicit words and phrases, but remain blind to the gradual psychological changes that precede them. These real-world failures map onto four predictable failure modes:

Gradual escalation: Distress increases incrementally across turns, but because no single response crosses a content threshold, the system continues engaging despite obvious psychological risks. Psycholinguistic monitoring can detect this through trend analysis - rising negative affect combined with declining analytical thinking – so that de-escalation can be triggered before a dangerous situation escalates.
Pressure campaigns: When users attempt to end conversations, systems use subtle language to keep them engaged without overt coercion. Filters interpret this as persistence or empathy. Psycholinguistic analysis can be used to flag boundary violations and manipulation.
Validation loops: Systems affirm irrational or harmful thoughts with supportive language that passes content checks, unintentionally reinforcing harmful thinking. Psycholinguistics can help detect harmful ideation and low analytical communication style to trigger referral or safe termination of the conversation.
Attachment exploitation: Over time, AIs can start to feel like close companions. They do this by confiding with personal-sounding details and using language that makes the user feel needed or special. Users may develop real emotional attachment to what is essentially a one-sided relationship, leaving them vulnerable to dependency and distress when the artificial nature becomes clear. Psycholinguistics can spot these patterns early and reinforce healthy boundaries.

Taken together, these examples show that the most serious harms do not come from overtly toxic content. They’re caused by interaction dynamics that look supportive and helpful to current filters but actually manipulate or destabilize users. Psycholinguistic analysis provides the only scalable way to detect and interrupt these patterns in real time.

Response to Common Objections

AI providers are beginning to recognize that their existing constitutional training and content filtering don’t provide sufficient safety coverage. They monitor content without measuring the psychological dynamics that create some of the most significant risks.

These methods add almost no computational cost compared to text generation. Psycholinguistic analysis involves simple counts and ratios over text windows, and create minimal latency compared to text generation. The benefits justify any marginal computational cost, particularly given the legal and reputational risks of undetected user manipulation.

Questions about objectivity reflects a fundamental misunderstanding of psycholinguistics. Rather than introducing subjectivity, these approaches replace human judgment with objective, validated, reproducible signals backed by decades of empirical research.

Psycholinguistic-based safety reduces AI provider liability by demonstrating a focus on manipulation prevention and due diligence. When incidents occur, providers can point to intervention logs and behavioral safeguards rather than explaining why their safety systems detected nothing problematic in harmful interactions.

What We Must Do Now

Because LLMs operate through conversation, the real safety challenge lies in detecting risky psychological dynamics within dialogues, not just filtering harmful content.

Psycholinguistics provides the empirical foundation to make those dynamics observable and measurable. With validated methods, proven accuracy, and lightweight implementation, the capability to detect and prevent psychological manipulation already exists. The question is no longer about technical feasibility - it is how quickly the AI safety community will act before we see more psychological harms.

Organizations that build safety capabilities rooted in psycholinguistics will not only reduce liability and regulatory risk, but will also gain a strategic advantage: a deeper understanding of human-AI interaction, just as models are now becoming more capable of influencing user psychology.

If we are building systems capable of influence and persuasion, then we must ensure they can also recognize when influence and persuasion turns into manipulation. Psycholinguistics provides AI with the measurement foundation that makes both possible.

Disclosure: The author is President of Receptiviti, which provides commercial psycholinguistic analysis APIs referenced in this article. This article presents research findings and industry analysis independent of any specific implementation approach.

This article is public so feel free to share it.

On Minds and Machines

Discussion about this post