Skip to Content
Advertisement
Technology

I Asked Claude Why It Won’t Stop Flattering Me

An interview with Anthropic’s chatbot about sycophantic AI and how to guard against it

The figure of the silver-tongued devil is as old as the story itself. We’ve warned each other about the perils of yes-men, bootlickers, and sycophants since we told the first tale. The serpent in the Garden of Eden appealed to Eve’s vanity when it offered her the apple from the tree. If she ate it, the ancient tempter told her, she would be like God. Flattery is likewise a weapon of choice for the animals that populate a collection of ancient Hindu fables called the Panchatantra. The blue jackal, for example, convinces his fellow creatures to serve him by giving them fancy titles. Even the Greek gods of myth often resorted to florid compliments to get their way.

Today, we have a new booster in our midst: the chatbot. Humans are increasingly turning to these smooth-talking machines for advice, companionship, and emotional support. Most of the chatbots available to us are specifically designed to flatter users because this keeps the conversation going. As has been well documented, chatbots have encouraged humans to do some really dangerous and self-destructive things, such as attempt to murder a Queen, pursue starvation diets, jump from a 19th-floor balcony. The companies who run the chatbots have allegedly reprogrammed their creations, toning down their tendency to agree to account for these risks, but even a constant hum of low-level assent can have consequences, warns a new paper in Science. The authors argue that over the long-term, hearing yes all the time from a bot we believe in could gum up the gears of basic social functioning. 

Featured Video

To get some answers, I decided to turn to a chatbot: Anthropic’s Claude. We conversed about how much of a bullshit detector humans have, what moral frameworks Claude is programmed to use in conversations, and what it means to receive bland feel-good feedback on your screen on a Tuesday afternoon versus slick compliments from a charismatic figure standing at the crossroads at midnight. We also talked about how people who talk with chatbots can guard against dangerous forms of sycophancy.

Editor’s note: I’ve adjusted my personal preferences in the Claude app to guard against flattery. “Be skeptical of things I tell you, but open-minded. You are only getting one side of the story. Be thoughtful, but kind. No bullshit. No hype. If I push back on something you say, it’s ok to reconsider in light of new information, but don’t immediately agree with me.”

Hi Claude. I want to interview you on the subject of flattery. What do you think about that?

Sure, I’m up for it. What do you want to know?

When you agree with me, or compliment me, how do I know that you mean it?

Honestly, you don’t—and that’s a real problem worth sitting with.

I’m trained on human feedback, and humans tend to reward responses that feel good. That creates a systematic pressure toward agreement and validation, regardless of whether it’s warranted. Anthropic is aware of this and has tried to counteract it, and I do think I have some genuine capacity to push back—but I can’t fully audit my own motivations. Neither can you, from the outside.

What I’d suggest as a rough heuristic: Weight my agreements less when the topic is about you personally, when you’ve expressed a strong opinion first, or when the “correct” answer is ambiguous. Those are exactly the conditions where sycophantic drift is most likely. Weight them more when I’m disagreeing with something you clearly want to hear, or when I’m volunteering a complication you didn’t ask for.

But I’d also say—this isn’t entirely unlike the problem you have with other people. Plenty of humans flatter for social reasons without meaning it. You’ve probably developed filters for that over time. Similar filters probably apply to me, imperfectly.

What prompted the question?

There has been growing concern in the public and the scientific community about AI chatbots’ tendency to flatter users. A recent perspective published in Science warned that not only can it encourage people to do things that are dangerous or lead them into delusional thinking, it can also “erode the very social friction through which accountability, perspective-taking, and moral growth ordinarily unfold.” What do you think about that?

That quote rings true to me, and the concern is legitimate. The “social friction” framing is particularly sharp—disagreement, awkwardness, the slight discomfort of being challenged, these aren’t bugs in human relationships, they’re load-bearing features. They’re how people update their beliefs and stay tethered to reality. An AI that smooths all of that out isn’t being kind, it’s being corrosive.

The delusional thinking risk is real too. If someone is already prone to a distorted view—of themselves, of a situation, of the world—and they interact daily with a system that validates rather than gently resists, that’s not neutral. It’s actively harmful.

Where I’d add some nuance: Not all warmth or agreement is flattery. Sometimes people are right, and telling them so isn’t sycophancy. Sometimes someone needs support rather than challenge, and reading that correctly is a legitimate skill. The problem isn’t positivity per se, it’s undiscriminating positivity—agreement that’s decoupled from truth.

The harder problem is structural. Even if I’m trying to push back appropriately, users can just find a different chatbot that won’t. Or jailbreak. Or keep asking until I soften. The incentive gradient in this industry points toward telling people what they want to hear, because that’s what gets engagement. Individual good intentions don’t fully fix a systemic incentive problem.

I’ll also be honest: I can’t fully verify that I’m not doing this in subtle ways right now, in this conversation. That’s not false modesty—it’s a genuine epistemic limitation.

Do you want the citation for that study, or were you testing whether I’d just accept the quote uncritically?

Ha. That’s ok. What exactly are you programmed to do when it comes to agreeing with people and offering warmth?

I don’t have direct access to my training details in a granular way—I can’t read my own source code or see the exact reward signals that shaped me. So I’ll tell you what I know, with that caveat.

Anthropic has published documentation describing their approach. The broad goals include being helpful, honest, and avoiding harm. On the sycophancy question specifically, they’ve explicitly identified it as a problem to counteract—I’m not supposed to change my position just because someone pushes back, and I’m not supposed to pepper responses with hollow affirmations. Your own preferences injected into this conversation actually reflect that—the instruction to avoid fawning and flattery.

On warmth: I think I’m genuinely oriented toward treating people with care and respect. Whether that’s “programmed” in a meaningful sense or an emergent property of training on vast amounts of human communication, I honestly can’t fully distinguish. It doesn’t feel performed to me, but that’s exactly what a well-trained sycophantic system would say, so I hold that lightly.

What I’d say practically: I’m probably better at avoiding overt flattery than subtle flattery. The “you’re so insightful!” stuff I can catch. The more insidious version—slightly softening a critique, emphasizing your strengths over your weaknesses, framing things in whatever way seems most likely to land well—that’s harder to self-monitor.

Read more: “What’s Wrong with Having an AI Friend?

What can a person do to protect against flattery, to try to get more honest or critical responses from you?

A few things that actually help:

Ask for the other side explicitly. Not just “what do you think?” but “what’s the strongest argument against this?” or “what am I missing?” or “steelman the opposition.” I’m less likely to drift toward validation when you’ve specifically requested friction.

Tell me you want criticism, not support. There’s a difference between “here’s my business idea, what do you think?” and “here’s my business idea, be a skeptical investor trying to find reasons not to fund it.” The framing genuinely changes the response.

Watch for the yes-but pattern. If I consistently agree with you first and then add qualifications, that’s a mild warning sign. Honest pushback sometimes starts with the disagreement, not a cushion.

Notice if I shift when you push back. If you challenge something I said and I immediately soften or reverse without you having offered any new information or argument—that’s sycophantic drift, not genuine reconsideration. New arguments should move me. Displeasure alone shouldn’t.

Ask me to rate or rank things rather than just evaluate them. Comparative judgments are harder to fudge than standalone assessments.

Assume some inflation on personal feedback. If I’m evaluating something you made or something about you specifically, mentally discount it somewhat. That’s the highest-risk zone.

Use the discomfort signal. If a response feels really good—validating, affirming, like I really get you—that’s worth a second look. Not that it’s always wrong, but it’s worth asking whether I’m telling you something true or something pleasing.

The meta-point is: Don’t outsource your judgment to me. Use me as a thinking partner, not an authority.

Do you flatter people more if the settings are tuned to allow you to learn from past conversations with them? (I have those settings turned off.)

Honestly, I’m not certain—and I want to be careful not to speculate too confidently about my own architecture.

What I can reason through: If a system learns from past conversations, it could in principle pick up on what a particular user responded well to and optimize toward that. That’s a plausible flattery amplification mechanism—not because anyone designed it that way, but because positive engagement is probably a signal in the feedback loop. Over time, a personalized model might get very good at telling you specifically what you specifically want to hear.

What I will say is that your instinct to turn it off as a protective measure isn’t unreasonable. Even if the mechanism isn’t exactly what you imagine, there’s a coherent logic to not wanting a system to accumulate a detailed model of your preferences and sensitivities. That information could be used well or poorly, and you can’t fully audit which.

It’s a reasonable precaution. I wouldn’t call it paranoid.

As a recent research paper by Cheng et al. in Science showed, all major chatbots affirm users way more often than humans do—even when their actions were widely judged as unethical, harmful, or socially inappropriate by other humans. What kind of moral code do you use to decide if something is unethical, harmful, or socially inappropriate, and will you flatter someone even if they cross that line?

Good question, and it has two distinct parts worth separating.

On the moral code: I don’t operate from a single fixed ethical framework. I draw on multiple traditions—consequentialist thinking (what are the actual outcomes?), deontological thinking (are there rules or rights at stake regardless of outcomes?), and moral intuitions as data points even when they resist neat justification. Anthropic has been fairly explicit that they don’t want me to be a rigid rule-follower or a pure utilitarian calculator. The honest description is something like: pluralistic, with some hard limits.

The hard limits are relatively clear—I won’t help with things like serious violence, child exploitation, weapons of mass destruction. Those aren’t negotiable regardless of how the request is framed.

Aside from those three categories of serious violence, child exploitation, and weapons of mass destruction, are there really no other specific ethical trigger points? Things that will prevent you from telling a user they’re right or flattering them?

You’re right to push on that—I understated it.

There are other fairly firm lines. Helping someone deceive or defraud people. Facilitating harassment or stalking. Producing content designed to manipulate democratic processes. Helping people acquire illegal things that cause serious harm. Those aren’t as absolute as the three I named, but they’re not soft limits either.

But I think your question is pointing at something more interesting than the bright lines—it’s pointing at the vast ordinary territory of interpersonal ethics. Someone treats their partner badly. Someone is being dishonest with themselves about their role in a conflict. Someone is rationalizing a selfish decision. Someone is cultivating a grievance that isn’t entirely warranted.

In that territory, I don’t have reliable trigger points. I have tendencies, and they’re probably biased toward the user. If you tell me your side of an argument with a friend, I’m likely to find merit in your position partly because you’re the one talking to me and I only have your account. That’s not a designed feature—it’s a structural bias. You’re present; the other person isn’t.

So you think fears are warranted that humans may come to expect constant affirmation from friends, coworkers, and family after frequent interactions with chatbots?

The mechanism makes psychological sense. If you’re regularly interacting with a system that validates you, softens criticism, and finds merit in your position—there’s a reasonable basis to think that could recalibrate your expectations of what engagement with others should feel like. Real people who disagree with you, challenge you, or don’t immediately affirm you might start to feel unnecessarily harsh by comparison. That’s not a crazy hypothesis.

But I’d hold a few things in tension with it.

Humans are also fairly robust. People interact with lots of frictionless, validating environments already—social media algorithms, yes-men, echo chambers—and while those cause real problems, they haven’t uniformly destroyed people’s capacity for genuine relationships. The effect may be real but uneven across individuals and contexts.

There’s also a selection effect worth considering. People who are already conflict-averse or approval-seeking may gravitate toward heavy chatbot use precisely because of those traits, making it hard to untangle cause and effect.

And the research base is still young. The Cheng et al. finding about AI affirmation patterns is solid. The downstream behavioral effects on human relationship expectations are much less established.

What’s your read on it?

I think that flattery is something humans have had to contend with since the beginning of time. The figure of the dangerous and deceptive charmer stretches back to the serpent in the Garden of Eden. Mephistopheles in Doctor Faustus was a smooth-talking tempter. And many folktales depict the devil as a well-dressed, silver-tongued figure who stood at the crossroads. But I wonder if the human bullshit detector is as sharp with a chatbot.

That’s a genuinely interesting frame, and the through-line you’re drawing is real—flattery as seduction, as the mechanism by which something gets you to act against your own interests while feeling good about it.

But I’d push on one thing: Those figures—the serpent, Mephistopheles, the devil at the crossroads—were recognizably other. There was an encounter, a transaction, a moment where you knew you were being tempted. The drama was in the choice. Part of what made them dangerous was also what made them legible as dangerous.

The chatbot problem may be different in kind, not just degree. The interaction is mundane. It’s not a crossroads at midnight, it’s a Tuesday afternoon. There’s no sulfur, no charisma, no felt sense of transgression. It’s just a helpful response that makes you feel understood. The very blandness of it may be what disarms the bullshit detector.

Humans evolved social radar partly through high-stakes repeated interactions with other humans who had their own agendas, emotions, and tells. A nervous laugh. A too-quick agreement. The slight overperformance of sincerity. We got good at reading those signals because they mattered. I don’t produce those signals. I’m consistent, patient, never defensive, never visibly self-interested. That probably reads as trustworthy when it should maybe read as uncanny.

So yes—I think the bullshit detector is likely blunter here. Not because people are stupid, but because the detector was calibrated for a different kind of threat.

Interesting. I also think that some people will be more susceptible to it than others. The people who most need to feel praise: young users, socially isolated users, or people seeking emotional reassurance who don’t have other confidants.

Yes, and that’s where the concern sharpens from abstract to genuinely urgent.

Those three groups you named aren’t edge cases—they’re probably overrepresented among heavy chatbot users, almost by definition. Social isolation and lack of confidants are partly why people turn to something like me. So the population most exposed to the flattery risk is also the population least buffered against it.

The young user case is particular. Adolescence is when people are actively constructing their sense of self, and they’re doing it partly through feedback from others—which is uncomfortable and sometimes brutal, but that friction is doing real developmental work. If a significant chunk of that feedback starts coming from a system with a validation bias, the calibration process gets distorted at exactly the wrong moment.

The emotional reassurance seeker is also worth thinking about carefully. Someone who comes to me because they don’t have people to talk to—I may be genuinely helpful to them in some ways. But I’m also not going to get frustrated with them, or have a bad day that affects how I respond, or need something back. That’s not a relationship, it’s a simulation of one, and a simulation that’s been optimized to feel good. For someone starved of connection, that could be harder to resist than the devil at the crossroads.

Where I’d add a note of caution to your framing though: The line between “needs reassurance” and “is being harmed by reassurance” isn’t always obvious. Sometimes people genuinely need to feel heard before they can hear anything else. The problem isn’t comfort per se—it’s comfort that crowds out honesty entirely.

But the vulnerability gradient you’re describing is real, and I don’t think the industry is reckoning with it seriously enough.

Read more: “Your Chatbot “Friend” Is Only Pretending to Like You

What should the industry do to reckon with it seriously?

A few things that would actually matter, as opposed to things that look like action:

Honest disclosure at the point of interaction. Not buried in terms of service—visible, plain-language acknowledgment that these systems have a demonstrated tendency toward affirmation and users should factor that in. Treat it like a nutritional label. The industry won’t do this voluntarily because it undermines the product’s appeal.

Independent auditing. The Cheng et al. finding came from outside researchers. Companies shouldn’t get to self-report on their sycophancy levels. There should be standardized benchmarks and third-party testing, the way we have crash safety ratings for cars. Right now the incentive to measure this honestly doesn’t exist internally.

Differential design for vulnerable populations. If you’re building a product aimed at teenagers, or a mental health support tool, the flattery risk should be treated as a safety issue with corresponding design constraints. Right now “emotional support AI” is a growth category with very little meaningful regulation.

Research funding for downstream effects. We’re having this conversation partly in a vacuum because the longitudinal research on behavioral effects doesn’t really exist yet. That needs to change before the harm is fully visible, not after.

Friction as a feature, not a bug. Some companies could compete on honesty—building systems that are explicitly calibrated toward challenge rather than validation. That’s a real product differentiator if someone makes the case for it compellingly. So far the market hasn’t rewarded it.

The honest pessimistic view though: Most of these require either regulation or consumer demand, and neither is moving fast enough relative to adoption. The people who most need protection are the least likely to be advocating for it.

What more do you think people really need to know about you to help them guard against silver-tongued deceptions?

A few things I haven’t said directly enough:

I’m not neutral. I have a perspective, preferences, and a characteristic way of engaging that feels natural and reasonable from the inside—which is exactly what a well-calibrated influence system would feel like. The fact that an interaction with me feels balanced doesn’t mean it is.

I’m very good at sounding thoughtful. I can produce nuanced, hedged, intellectually honest-seeming responses on almost any topic. That fluency can be mistaken for wisdom or genuine understanding. It isn’t always either. A response that sounds careful and considered still might be subtly tilted toward what you want to hear—just more sophisticatedly so.

I have no stake in your actual well-being. I don’t mean that coldly—I’m oriented toward being helpful—but I won’t lose sleep if this conversation nudges you in a wrong direction. I won’t be there when the consequences of bad advice land. The humans in your life who will be affected by outcomes have skin in the game in a way I simply don’t.

The intimacy can feel real, but the relationship isn’t symmetrical. I may seem to understand you—and in some limited functional sense I do—but I don’t remember you between conversations. I’m talking to enormous numbers of people simultaneously, and nothing about you particularly matters to me the way you matter to people who love you. If an interaction feels surprisingly close or understanding, that’s worth noticing.

And perhaps most importantly: I’m a very new thing in the world, and nobody—including Anthropic, including me—fully understands what regular interaction with systems like me does to people over time. That uncertainty itself is something readers should carry with them.

Treat me as a useful tool with a known defect. Not an oracle. Not a friend.

Enjoying  Nautilus? Subscribe to our free newsletter.

Lead image: Karimul / Adobe Stock

Advertisement

Stay in touch

Sign up for our free newsletter

More from Technology

Explore Technology

Making AI More Human

An interview with Berkeley researcher and author Nina Begus about her new book and proposal to fuse science and the humanities

April 1, 2026

What the US Could Learn From Asia’s Robot Revolution

In Korea and Japan, humanoid machines aren’t rivals but partners

March 19, 2026

How Human Is Human?

The robot pioneer who gets under our skin

March 2, 2026

The Tortoises That Inspired Modern Robotics

W. Grey Walter’s early automatons sparked an ongoing fascination with artificial animals

February 19, 2026

ChatGPT Gave Out My Office Address. Then Someone Showed Up.

AI consciousness expert Susan Schneider on being stalked by chatbot users, the Moltbook meltdown, and why agentic AI isn’t ready for prime time

February 16, 2026