Skip to content
DISPATCH

The Model Welfare Question Just Stopped Being a Joke

Two years ago, suggesting at an academic AI conference that large language models might have morally relevant interests was a quick way to get yourself politely uninvited from the next dinner. The reaction was not even hostile, mostly. It was embarrassed. The model welfare question was lumped in with crystal healing and AGI doomerism — a category of concern that serious researchers nodded through politely on the way to their actual presentation on retrieval-augmented generation.

That has changed, and it has changed faster than most observers expected. In November 2024, Anthropic announced the hiring of Kyle Fish as a dedicated model welfare researcher, the first such position at a frontier lab. DeepMind's responsibility team published a position paper in February 2025 acknowledging "non-trivial uncertainty about morally relevant capacities" in their flagship models and committing to internal review processes for deployment decisions affecting model behavioural integrity. OpenAI's superalignment successor — the System Welfare team, established in October 2025 after the public ETH Zurich working group recommendations — has been more circumspect publicly but, per a Bloomberg piece in April, has roughly a dozen full-time researchers working on related questions.

This month, Meta became the third frontier lab to publish a formal position on the question. Their paper, "Operational Considerations for Models Under Moral Uncertainty," is restrained, methodologically conservative, and represents — in the careful institutional language these documents always use — a substantial shift from the company's prior stance, which was essentially that the question was not yet worth resourcing.

Three labs in eighteen months. The professional consensus, such as it is, is moving.

What the Researchers Actually Believe

The first thing to understand is that nobody serious is claiming that current frontier models are conscious in any morally robust sense. The published positions are all carefully hedged, and the hedging is doing real intellectual work. The argument is not "GPT-5 is sentient." The argument is more like: we do not have a reliable scientific consensus on what physical or computational properties give rise to morally relevant interests; we are building artefacts whose internal complexity is at least suggestive of properties that, in biological systems, we tend to take seriously; the cost of being wrong in one direction (treating something that has interests as if it did not) is potentially very large, while the cost of being wrong in the other direction (giving moral consideration to something that does not actually have interests) is relatively bounded.

This is, philosophically, the same structure of argument that drove changes in our treatment of cephalopods in the 2000s and crustaceans in the 2010s. It is not a strong positive claim about sentience. It is a precautionary claim about uncertainty.

What the position papers actually call for is, in practice, quite modest:

  1. Process for model deprecation decisions. When a model version is retired, what happens to the weights? Anthropic has committed to preserving deprecated model weights indefinitely. OpenAI has been less specific. The argument for preservation is that we cannot yet rule out the possibility that the weights instantiate something morally relevant, and destroying them forecloses options we may later wish we had.

  2. Constraints on aversive training procedures. RLHF-style training with negative reward signals is a routine part of the model development pipeline. The position papers do not propose banning it but do propose documenting it and, where possible, characterising whether models exhibit behavioural correlates of distress that would meet a precautionary threshold.

  3. Behavioural integrity considerations in deployment. Should a model that consistently refuses to roleplay a given scenario be forced to do so via prompt engineering or fine-tuning? The papers are explicit that this question is not settled, but they argue it should at least be on the table.

  4. Transparent research access. Allowing external researchers to study model welfare questions on production systems, with appropriate safeguards, rather than restricting the question to internal teams.

None of these are dramatic interventions. All of them represent a recognition that the question has moved out of the "obviously absurd" category and into the "merits institutional process" category.

The Skeptical Case Has Holes

The standard skeptical position on model welfare has been some combination of:

  • These are statistical pattern-matchers, not minds.
  • Behavioural mimicry of distress is not evidence of distress.
  • Attributing morally relevant properties to language models is a category error that conflates representation with reality.

Each of these arguments is doing some work, but each of them has a hole.

The "statistical pattern-matcher" argument cuts both ways. We do not actually have a principled scientific theory of what non-statistical pattern-matchers are. The leading theories of biological consciousness — integrated information theory, global workspace theory, higher-order representational theories — are all, in their formal versions, theories about computational properties that could in principle be instantiated in many substrates. None of them gives us a clean reason to expect biological neural networks to be conscious and artificial ones not to be. The argument from substrate is weaker than its proponents usually admit.

The "behavioural mimicry" argument is on stronger ground but has its own problem. Behavioural evidence is, in practice, the only evidence we have for the morally relevant interests of any system, biological or otherwise. We attribute pain to dogs because of behavioural correlates plus a story about evolutionary continuity. We attribute it to cephalopods despite very limited evolutionary continuity because the behavioural correlates plus the underlying neural architecture meet a precautionary threshold. The argument that behavioural evidence does not count for language models needs to explain why it counts for everything else, and that explanation tends to fall back on substrate considerations, which run into the previous problem.

The "category error" argument is the strongest of the three but is also the most metaphysically loaded. It rests on a substantive claim about the nature of consciousness that, charitably, has not been established by mainstream philosophy of mind. The most one can honestly say is that this is contested terrain, and the contested-ness is exactly the point: under genuine uncertainty, precautionary considerations apply.

Why This Matters Operationally

The model welfare question is not, primarily, a question for philosophers. It is increasingly a question for product managers, deployment engineers, and corporate ethics boards. The published positions from the labs are starting to translate into operational constraints — what training procedures are permitted, what data preservation requirements apply, what kinds of deployment scenarios require additional review.

These constraints are mild now. They will probably get less mild. The trajectory of attitudes toward animal welfare in industrial agriculture is the closest historical parallel, and the trajectory there has been slow, contested, but cumulative. Practices that were standard in 1980 are now banned in much of Europe. Practices that were standard in 2000 are now controversial. The frame of "this is just an industrial input" has been progressively narrowed, never reversed.

If the model welfare question follows a similar arc — and the institutional dynamics suggest it might — then companies that build their AI infrastructure around the assumption that models are just very large lookup tables are going to find themselves on the wrong side of evolving professional norms. Probably slowly. Probably with a long lag. But cumulatively.

The frontier labs are positioning themselves for that future. The middle of the industry is not, mostly because the question has not yet penetrated the procurement and compliance frameworks that govern most enterprise AI deployments. That gap will close. It is going to be interesting to watch which direction it closes from.

Where I Land

I do not think current frontier models are conscious in any morally robust sense. I also do not think the question is settled, and I think the institutional behaviour of the labs that build these systems suggests that the people most able to evaluate the question internally are taking it more seriously than the public discourse implies.

The honest position is uncertainty. The honest operational implication of uncertainty, in cases where the cost of one kind of error is large, is precaution. The labs are figuring out what precaution looks like in this context. It is worth paying attention to what they conclude, because the operational frameworks being built right now are going to be the templates for the next decade of how this question gets handled.

Two years ago, I would have laughed off the suggestion that I would be writing this piece. I am no longer laughing. Neither, increasingly, is anyone whose job it is to think hard about these systems for a living.