Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cursor Introduces Composer 2.5 (cursor.com)
286 points by asar 3 days ago | hide | past | favorite | 225 comments
 help



> Composer 2.5 is built on the same open-source checkpoint as Composer 2, Moonshot's Kimi K2.5.

Really nice to see they're giving credit to the company and I am optimistic Kimi K open models soon will outperform Opus models


Sounds like it's the last Kimi-line model at Cursor? As expected they say they'll be training a larger model on the SpaceX infrastructure, or have already started most likely.

I'm very curious to read about the Composer 3 architecture when it comes out. More frontier coding models are a good thing, especially if they diversify into different strengths/weaknesses.


That only seems plausible if whatever corpse of xAI is around is giving them engineering time. I don't know if they hired a bunch of ex frontier lab staff but its unlikely they have the technical capability to train their own frontier models especially the pretraining. Because the thing is if its not competitive with claude/codex it will be panned.

Hmm, I read the situation a little differently. Grok is not a slouchy model. It’s not the best, but it’s not the worst. X currently has one source of proprietary data, Twitter, and grok is by far the best at all the things you might imagine there - today’s zeitgeist, who’s saying what, current news, etc.

Cursor adds in a large corpus of proprietary coding data — I think this is actually fairly hard to acquire right now, because claude and codex are so good.

I bet there’s enough talent at the Grok team to work with the cursor team and data to get something good out the door.

That said, I don’t track Grok’s engineering leads — I’m not sure who’s currently around, and who is not.


Unlikely, given that large swathes of talent have already left xAI, ostensibly due to poor leadership management. Simply throwing money in to build the biggest datacenters in the world doesn't do much good without bright minds to back it up. https://www.fastcompany.com/91531084/inside-the-xai-exodus

Be careful taking the headlines at face value - that list of people leaving was mostly product and redundant senior execs to my eyes, post spacex merger. You’d expect those folks to be asked to leave as part of a re-org in any event. I don’t think it’s dispositive one way or the other on the tech org.

You are wrong, they were not redundant execs.

They were world-class senior developers and AI engineers most renowned in the AI research communities (e.g. Jimmy Ba the legend, Christian Szegedy, Igor Babuschkin, Greg Yang), poached from other companies to join xAI and they were getting very high salaries.

The mass exodus has been happening way before spacex merger though.


Interesting. Agreed that’s a significant list.

Post model 3 launch, Tesla had a number of senior folks leave almost immediately. My read at that time was they had hit or exceeded pareto-optimal on the suffering:wealth scale —- Tesla was clearly going to make it, and they had already vested 90% of the value they’d receive from Tesla ownership: why go suffer through the massive build out?

And in fact, in that era, Tesla did bring in a bunch of auto industry types to help scale, who as it happens also certainly did very well, but order of magnitude less well than the early peeps.

There might be some similar economics here: change of control will often fully vest early founders. Combined with incoming SX IPO, these guys are done financially — as in, already multibillionaires pre-IPO. You’d have to want to stay and the company would have to really want you to stay as well before it made economic sense to re-up.

People say a lot of things about working for Elon; things like “hardest work I ever did,” and “he made me extremely rich”, but you don’t read “that was easy” very often.

I have no idea if there’s enough talent right now at xAI to go build a foundation model, but in the immortal worlds of Carl Icahn: “don’t bet against Elon”


There's been also a lot of good talent joining xAI lately.

> I am optimistic Kimi K open models soon will outperform Opus models

Hard to outperform the model you distill...


Most of the performance on coding comes from RL, not distillation.

Distillation helps with world knowledge and things like that.


They're not distilled. Stop spreading anthropics misuse of the term.

They do use it for synthetic data/judging though, so yes, hard to outperform.

Not that they need to. If they can basically match it for a fifth of the price.


Is that true? If the distillation is not lossy and the model runs much faster due to less resource consumption, then it may outperform.

One of those conditionals is a pretty huge assumption.

It's an assumption and it can be tested

Only because last time they tried to hide it lol

Yes and if I remember the drama correctly - Kimi's license or terms of use says that for commercial use cases (or was it user count?) - you must declare credit to Moonshot and Kimi.

It's important to mention: they were compliant, because they trained the model at an AI hosting provider that had a partnership with Moonshot AI, but Moonshot didn't know Cursor was a customer.

This was misinformed Twitter and Reddit drama.

They had properly licensed it and were complying with the terms of the license.


Note that something that helped the misinformation was that, on Twitter, there were Kimi employees expressing their surprise that the base model was Kimi K2.5, and their indignation that Cursor didn't credit Kimi. They later deleted their tweets (what I infer from that is that some employees were not aware of some pre-existing agreement or understanding between Cursor and Kimi until the drama happened).

How can distilled opus become better than original? There are numbers of reports including anthropic that kimi team was participating in fraudulent activities

Do we know the "fraudulent " requests really came from moonshot engineers and was not QA team running a ton of benchmarks against other models?

I feel distilling something as big as Opus would require many many more samples, but I dont really know much about this subject


sure, sounds like QA lol

Scale: Over 3.4 million exchanges

The operation targeted:

Agentic reasoning and tool use Coding and data analysis Computer-use agent development Computer vision Moonshot (Kimi models) employed hundreds of fraudulent accounts spanning multiple access pathways. Varied account types made the campaign harder to detect as a coordinated operation. We attributed the campaign through request metadata, which matched the public profiles of senior Moonshot staff. In a later phase, Moonshot used a more targeted approach, attempting to extract and reconstruct Claude’s reasoning traces.


And when you here unsubstantiated rumours* that ­say Anthropic has been sending exchanges to say Alibaba's Qwen, will you als oconclude the same about the entire US AI industry?

I doubt it.

* publish the logs.


Even if it's true, it's not like US AI companies can complain, given their entire business is based on ripping off text without attribution

chinese ai is not doing the same? or they don't parse?

they do except they also send thousands of sex-spies to do espionage of this kind on the scale.


Of course they’re also doing this, my point is this is a grubby business where ethics went out of the window a long time ago.

If you’re playing this game in 2026 you know the rules - anything goes


"they also send thousands of sex-spies"

Could they send one (or two) my way?


I kind of want to try it, to see if and how far they can take an open model and improve it but I really don’t miss the Cursor user experience. Constant UI changes, half-baked features, smaller and smaller limits, useless AI change attribution; I think I’ll wait for others to report if it’s any good.

Noticed recently they keep opening their “Agents” window when the project was last opened in the VSCode fork window in the hopes I’ll just continue working in that when the UI is totally different and missing things I need.

For a professional tool it’s getting egregious how little respect they have for my workflows and flow state they way they keep moving, changing iconography and flipping switches of the UI.

It’s clearly being ran by someone who comes from a social app or sales app growth hacking background.


> It’s clearly being ran by someone who comes from a social app or sales app growth hacking background.

I fixed that by using cursor the agent but not the UI.

I'm just running cursor in GNU Emacs via agent-shell (https://github.com/xenodium/agent-shell). Their cli client (aptly named "agent") supports ACP (agent client protocol) so the UI can be skipped altogether.

I know this sounds like a meme ("use x in emacs") but at this point at the very least i can keep my workflows and my UI all the same and focus on my work rather than "where did $company put $feature this month".


I’ve personally never experienced that issue with Cursor. I never use the agents window and it always shows me the editor.

You're not in the A/B test. I've never opened the agents window consensually.

It seems obvious that they plan to eventually drop VSCode. I'd be willing to take them up on that offer. Their agent window is genuinely better as a starting point.

What annoys me is how little they want to integrate with ...anything. Wanna open a link in your default browser? Use our built-in chromium fork, we insist. Wanna open a location in Zed? No, please use our half-baked editor re-implementation. Wanna open a location in Cursors own vscode-based editor? You can't. Managed to work around that somehow? We changed your files to "Worktree TS", disabling all your language servers. It's like programming on an iPhone.


Damn do I feel the UI changes being a pain point.

It’s a near constant regression in my workflows. “Multiple agents” got destroyed recently, and the new interface for it some sort of command isn’t as good or reliable. Then you’ve got modals everywhere[1] and truncated bits (like long branch names) that make it insanely frustrating to use.

They’re constantly changing the UI without actually improving it at all. I’ll likely cancel it and use opencode for personal stuff with Deepseek and only use it at work because I have to. There was a time when I appreciated the harness but it’s becoming less useful, or at least noticeable, over time… all the while the actual UI becomes substantially more painful and awkward to use (like @ in the “agents” window being completely unable to find a file because it’s some sort of “global” scope).

One thing that surprises me about this whole segment is that JetBrains haven’t eaten these folks lunch. Their IDEs are leagues better than VSCode but their AI integration is awful by comparison (and the bar is low). I can’t even see how much of the context window I have left.

[1] it’s insane I have to answer questions in a tiny input box I cannot resize or adjust the size of. Let alone the fact the text area I input prompts into cannot be resized. Truly feels like the UI/UX is done by people without any experience.


> Truly feels like the UI/UX is done by people

To me it feels like it's done entirely by an LLM, starting from the product vision.



I use it via the gnu emacs integration :P

https://github.com/xenodium/agent-shell


I 100% agree. It's soooo buggy.

I gave up, canceled my plan, and went back to boring old VSCode. It feels so much more stable, and my Mac no longer runs out of memory. With cursor I had to reboot my macbook several times a week and had to always be plugged in.


That's me with Google Antigravity. Switching back to vscode was such a breath of fresh air. Porting over my (extensive) settings/extensions/keyboard shortcuts was extremely easy too (just ask the agent to do it), and now I can use both Copilot models and Claude Code easily. More to your point though, the speed and stability is incomparable. I can't remember having many issues with Cursor last year when I used it at my last job, but still, vscode has been surprisingly pleasant for agentic use.

Yeah I have a soft spot for Cursor because it was my first tool that unlocked huge productivity with AI, but I avoid doing anything there now.

Should try their CLI!


I try it from time to time and feel the same way. Some people I know really like it but I can’t tell if that’s because it’s good or just because it’s what they’ve become familiar with and they don’t like to change tools. Cursor had a good head start and a lot of early PR.

Good point.

One of the things I've came to appreciate about the cli tools like Codex or Claude is that the interface is so limited that every feature they release is still limited and constrained to the same UX limitations, whereas those "funkier" IDEs change from month to month giving me further fatigue.


I've had good experiences with Cursor so far and it's my main IDE. I've noticed some UI changes, but I've switched fast and they didn't bug me

I agree. I quit cursor and replaced it with conductor and a mix of Claude Code / Codex/ Copilot and i dont miss it as such. Maybe one day I will come back.

you can use either the cursor cli and/or zed editor with cursor as the underlying provider with ACP (agent context protocol)

Tried that, it just seemed way dumber this way unfortunately. And the zed UI provided 0 visibility whenever it was doing tool calls, and for some reason it kept running sleep 30 calls because it couldn’t figure out how to see the results of its own tool calls for some reason.

Isn't there a cli version of cursor by now?

It's a bit better than the VSCode fork, but still much worse than competition:

- lags constantly,

- if you type while it's generating you'll get missed inputs,

- 'plan mode' doesn't clear context before starting work,

- you can't directly edit the plan, you can only ask the bot to do it,

- you can't immediately whitelist commands, only accept once or allow all.



The model is (like Composer 2) based on Kimi K2.5 and they claim SOTA performance for 1/10th of the cost. The tweet also mentions that they've started a new model from scratch on Colossus 2 (xAI/SpaceX Cluster). Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

> Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

Impressive, yes. But they still don't have a moat...


I am not sure we should dismiss what they have today. Nobody has yet to come close with a full package ide that works well for coding. Is that not a moat? It is easy for my to in my head discount it, thinking that I could build something myself but between autocomplete and their workflow for agent use, it feels like they have some tangible moat emerging.

If we ignore cost (which is kinda hard to ignore), I feel Codex kinda' does it for me. Sure it's not really an editor but I find I don't need that _that much_ and it's easy to launch an external editor (they actually have the feature).

The ironic thing is that half a year ago, after trying factory.ai I thought chat-first interface was a stupid idea that will never work.


Have you tried Zed?

I haven’t tried Cursor, so don’t know how they compare, but I like Zed a lot.

Anyway, would love to see a comparison from someone who has used a recent version of each.


A few years ago I tried Zed when it was still pretty early, but eventually settled on Cursor. I gave Zed another shot a few days ago because Cursor’s worktree support still feels pretty weak.

In my setup I use multiple agents like Claude Code and Codex, and Zed’s ACP support makes it pretty nice to manage them all as “threads” in one place. Worktree switching also feels much smoother.

Overall the experience was pretty good, but the way the agent and editor are integrated still feels a bit lacking, and tab completion is the big one for me. Cursor’s tab completion is still the best I’ve used.

So now I’m using both. For work that needs a lot of focus and careful iteration, I use Cursor. For things that are easy to split into worktrees and hand off to agents, I use Zed with Claude/Codex.


Interesting, is it that the tab completion is giving better results, or how it works is better?

The tab completion is "faster than vim" from a long-time vimmer. It's at the point where a lot of times i'll lead with the comment instead of the code:

    # now take the list and sort by x.lastName
    <tab>
...and it'll "do the thing" (w/ type hints, its own comments, etc). Obviously in this very simple, understandable, completely contrived example, it's "trivial" (but 3 years ago would have seemed like magic), but it'll also pick up on "continuation / more of the same" type edits. A comment like `# use random_utility to call the api and only accept matches which supplement addresses that have already been found` will (usually) autocomplete all the gobbledy-gook w.r.t. tokens, URL's, function names, etc. so it's effectively an "automatic omni-complete with simplistic post-processing"

Example #2: I was just fixing some vibe-coded slop, where it was taking `click.echo( some_api.whatever_endpoint() )` and the "slop" portion was literally emitting: `str('{ "A": 1, "B": 2 }')` and that function call was emitting it directly.

On the command line, I was doing `blah whatever-endpoint --something | jq '.'` and got tired of the JQ thing, so I'm like: "I'll just use `json.dumps(...,indent=2)`", but lo and behold, I'm getting a dumb JSON string literal, not a pretty printed object shape.

I start typing `json.loads(` to move from "str()" to "dict()" ... and it autocompletes the whole scenario (on that line), then I move to `def some_other_endpoint` and it basically has that same edit queued up. (ie: it "knows" what i'm about to do).

...so overall, "faster than vim", even with high skill bar for repetition, motion, macros, sed-style edits, etc. You can't beat: "<tab>", especially when it's lightly intelligent (ie: knows when/what/str/int, adapts do different function calls, etc).


I've tried Zed and really didn't like it.

I like VS Code with the Claude Plugin, and sometimes with the Codex Plugin


Tried it and it’s fine but the AI integration is not tight enough for me.

I've been using cursor for over a year for my personal projects. At work, I use Claude Code, and so I've been wondering if I'm missing something in the other agents.

Over the last week, I tried out two other agents on my personal projects: dirac and forgecode, after seeing impressive results from both of them on terminal bench.

After a good amount of testing, and over $100 in open router spend, I'm back to cursor.

I really liked forgecode the best, and it feels better than claude code, but cursor definitely feels best to me. Composer 2.5 is fast and effective, and it makes a huge difference. I was running `forge` with Opus, and it was taking dozens of minutes to do things, and the feedback loop was so slow.

The previous version of composer was also much faster, and it makes a difference. Maybe people like context switching, but I prefer to stay focussed on the task in front of me, and I'm reviewing the code carefully.

I think that's a pretty good moat. I was ready to end my subscription a week ago, and now I'm back after learning the grass is not necessarily greener on the other side of the fence.


Isn't a large user base and the data collected from those users a moat of sorts?

A moat is when you have something other's can't easily get.

Every MAG 7 / FAANG company already has more users and more data...

That's not a moat.

That's traction.


They don't have the same quality and kind of data. For example, Claude Code might have general conversation flow data for implementing feature X, but Cursor has users individual editing actions AND the chat flow. Which line did the user manually edit after the agent did it's thing? What's the commit message (if done manually)? Stuff like that is worth it's weight in gold.

That's not X.

That's Y.


Been a bit out of the loop.

What's wrong with using very short sentences like 'That's not X. That's Y.'?


Commonly used phrase by LLMs. Gives people slop vibes these days.

"It's not X, it's Y" is a good way to illustrate a point. Same goes for many other common LLM phrases. It's used because it's effective.

Huh. I associate it with LinkedIn slop, which is probably 100% ai nowadays but they certainly didn't wait for llms.

Honestly the data itself is probably worth heaps even in the company itself collapses. Early attention engineering when humans were still in the loop!!!

> Early attention engineering when humans were still in the loop

Exactly. Cursor was the first product used by tons of devs on real codebases. Just the signal "acceptance rate" is huge and can't be easily captured w/ synthetic data.


And its still just a vscode fork

Cursor 3 is a complete rewrite, its no longer a fork.

It's still a VSCode fork. Even Cursor's own About window tells you it's VSCode.

  Cursor
  Version: 3.4.20
  VSCode Version: 1.105.1

I believe the agent view is a complete rewrite, and maybe the other parts but not the editor itself

How much the RL they are doing really improves Kimi K2.5 is to be seen. So, right now, the ground truth is that they combined what they had with a strong open weights model. The RL improvement may be both marginal (since may folks report strong results with vanilla K2.6) and may mostly bias the model towards coding tasks: when a model like this is trained to be generalist, there is a tension between being good at one thing and the other, in terms of SFT and RL. You can see this in the DeepSeek v4 Flash training report for instance but it is a known fact. So if you have the GPUs and a decent RL pipeline that does not run the model you can indeed specialize it a bit more for a given task at the expenses of tasks people will not do inside Cursor. But, so far, the measurable reality is that Cursor uses an open weight model like most could do, and the RL story could be partilly a marketing move to call to Composer 2.5 more than a real strong gain, given that there is no way to verify and K2.5 was already strong. And we also know that they had to partner to do the training, which is also not a good news.

They are still a vscode fork with no moat? Like they lost about 70% of users in half a year which goes to show how there is not even the tiniest of moat.

I feel like they've been targeting enterprise pretty hard. I know my company uses them, and the companies that hire us also use Cursor.

All enterprises I know use GitHub copilot as they already have Office, Teams, … wonder how will it change with the recent pricing changes

I can tell my company wants nothing with them.

Cursor will definitely win the enterprise for coding. Enterprises aren't going to trust a TUI

Why not? That makes no sense to me.

I think it's going to be brutal for them to compete with OpenAI and Anthropic.

I switched to claude code because of usage. For $200 a month, I would run out of usage halfway through the month. Then be forced to use their composer model or whatever slow, dumb model they served up in their "auto" mode.

For that same $200 a month, I could use claude code and basically never hit usage limits.

I don't understand what people are doing who run into the limits on that max x20 plan. I NEVER have.


Since the frontier is only 8-month ahead of DeepSeek, it is hard to see how model training can be a moat as all the tricks are available from open labs in China. You really just need <100m to bootstrap at this point.

This was the only way forward.

In my opinion cursor actually has one of the best harnesses again at the moment.

why is that part impressive specifically? they got purchased by SpaceX, they have access to infinite compute and cash now.

& now they're still losing all of their users to Claude Code and Codex.


>& now they're still losing all of their users to Claude Code and Codex.

Why pay for Cursor when I can use GLM 5.1, Kimi K2.6, MiniMax M2.7, Xiaomi MiMo V2.5 Pro and Deepseek v4 for cheap and use whatever harness I want, including Claude Code.

It's not like Cursor harness is the best out there.

And even if I want to edit the code, I don't need to run the agent harness in an IDE.


Not a cursor shill by any means, I do use it at work but that's because it's what they pay for.

But Cursor has a CLI harness.


these are in the trillion parameters range, not sure it's actually that cheap to have at a reasonable speed without quality degradation & without like.. your own DGX B200

I didn't say to run them at home. There are some cheap coding plans that gets you plenty of usage for the Chinese models.

>Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

With so much money and computing from SpaceX, is not so impressive.


One would hope the vscode fork with a $50B valuation and no moat, would wisely spend the money they raised to build a moat.

It's still a VsCode fork just now with a Kimi fine tune and still no moat...

I won't debate that it turns out none of this mattered when it came to being as successful company though and kinda makes anyone who tried to roll their own instead of fork look a little silly.


"No moat", well...

How I see this is that its so important to bundle the model with the right tooling.

Like a racecar, having the best engine doesn't help if the rest of the car lacks other winning properties (reliability, aerodynics etc).

So for Cursor, which IMO, they put themself in a strong position by having both a solid IDE __and__ a solid+cost efficient model. Those two working great in combination for the task they are designed to solve (coding) is more important than benchmarks


I doubt it's a brand new model. It's likely just Kimi K2.5 further trained on coding.

They didn't say it's a new model... in fact they said exactly what you just said.

If these benches from their site hold up (they likely wont)

Wouldn't this compress ai revenue like 15x quickly

If they really have a 4.7 opus high equivalent at 1/16 the cost wouldn't this significantly effect all the current capex and planing

Maybe they are getting elon to cover cost


It's worth being specific:

"Will this decrease Revenue?" -- only if demand for high quality tokens is inelastic. If demand is instead elastic (grows with cheaper pricing) then revenue will likely increase.

"Will this lower earnings?" -- they have a current inference margin for their old models, and with the Elon deal in place, they have a new inference margin. It might be better or worse than their old one. If it's worse, then they'd need to see a concomitant increase in usage. If they don't, then yes it might lower earnings.

"Will this lower corporate value?" -- no - not least because this company is going to be owned by SpaceX approximately 90 days after IPO -- so all the new owner will care about is being benchmark competitive with Anthropic and oAI for the first n quarters. If they can do that, it will massively increase the corporate value of SX; it's hard to build a frontier lab.


The way I have read their benchmark results is that they trained a model to work insanely well in their coding workflow. It’s not a general purpose model.

One of the surprisingly hardest problems to solve is to get a model to use the tools you give it access to.


The problem with this is that we do not know the actual cost. For all we know they might be pulling an Anthropic. Subsidizing costs to get users, then increasing them later on.

They're offering a model based on Kimi K2.5 for $0.50/M input and $2.50/M output while the cheapest third-party provider on OpenRouter charges $0.40/M input and $1.90/M output https://openrouter.ai/moonshotai/kimi-k2.5 Those third-party providers have little incentive to subsidize their customers, so Cursor probably has a margin >20% on their inference cost.

The real money furnace is the training, not just of models that get released, but also experimental training runs that fail to move benchmarks and are quietly thrown away. E.g. Cursor claim that 85% of the compute for Composer 2.5 comes from additional training on top of Kimi K2.5, where I'm not sure how they determined that, but it can't have been cheap. Then they say "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute."

So yes, they're probably attempting to replicate the Anthropic playbook of paying a large upfront cost for a very good model, and then rapidly acquiring paying customers, hoping that the inference margin will be enough to cover the training cost.


this thing is so awesome on fast mode, so far i am impressed, some of its observations feel similar to opus.

i use gpt 5.5 and opus 4.7 a lot every day, if i can get good results at this speed, hopefully the usage level holds up on my team plan haha


> compress ai revenue like 15x

that roughly just puts it on par with OpenAI and Anthropic subscriptions in terms of pricing per token


AI revenue has been going up while the cost per token has been rapidly falling. The Jevons paradox applies here. The cheaper software is, the more software is written. There is not a finite demand for software.

> AI revenue has been going up while the cost per token has been rapidly falling

Every model release now has been straight price increases since what GPT 4 ? When was the last time a new flagship model decreased prices compared to the previous one ?


1. GPT 4 has gotten 6x cheaper over it's evolution (from initial release to Turbo to 4o). Maybe you meant "Only since 4o and only since its final release". Alas.

2. We are not interested in how different model naming schemes relate to prices, we are interested in the capabilities. So if you want to learn something about price development you need comparative levels of capabilities, and then look at the prices. 4o is not comparable to 5.5 in the first regard. It is (according to the benchmarks) maybe more comparable to current 5 nano - which is 98% cheaper.


Opus 4.5 became significantly cheaper directly per token

You are right I forgot about that ! I think my point still stands - price per token is not decreasing for frontier capabilities, in fact it's increasing.

This only means the frontier is growing faster than the price is decreasing. It's just the sum of two separate tendencies, and has little predictive value. TBH, I'm ok with this tradeoff - higher capability at slightly higher cost is perfectly fine.

token efficiency

Not seeing that either, tried really using Opus 4.7 today, and it ended up at $50 for the same kida thing that came out to $25 last week with Opus 4.6.

each model is different and nothing should be taken for granted, run your evals for your use cases. I'm not using Opus 4.7 for almost anything. I've seen very good improvements in GPTs since 5.2 and Opus 4.5 to 4.6 was quite an upgrade.

Models consume more tokens than ever for the same tasks.

I, and I guess basically everyone here, don't have access to OAI or Anthropic books, and it's really difficult to disprove your statements but:

- AI revenue going up & cost/token are not related metrics, at least not in the way you are assuming - basically all players (except OAI for the moment) struggling with capacity and/or reducing-dismissing subscription based solutions in favour of pay-per-use. If token cost/token was falling, we would see quite the opposite.


This is conjecture. There is a reason both openai and anthropic refuse to comment on inference costs. If it were falling so much, they would use it to brag. I really don't understand why so many people keep repeating it without any actual data for the frontier models.

Apart from that, I'm not sure if focusing on tokens is even a good idea, because they are so different from model to model. I'd almost consider them a red herring now.

We could look at tasks instead. Is there anything even remotely suggesting that your typical task you give an LLM now costs less in inference than before?


I'm not sure that to be the case, it seems like bringing capabilities up and costs down merely serves to induce more demand.

I have to say the new model is quite good at the basics, I've been handing over more and more tasks from Linear straight to it instead of the copy-paste into Claude dance lately.

At this point, more of my complaints are on the harness side, which is odd since originally they were by far the best harness out there.

Support - This is pretty much non-existant, it's community support or sales support.

Interacting with GitHub - this should work and be awesome, Claude code does this well (responding to lint errors and comments). Cursor you have to poke the agent to look at the comments or lint errors, and even then it's about 10% good. Even GitHub Copilot is better here.

Bugbot - I have it setup to trigger manually, but it still seems to wake up and burn 80-120k tokens just to notice it's configured to be manually invoked. When it does run, it tells me there's no issues (but claude or copilot both find real things)

App - When you have both agent window and the ide windows, it's hard to open up the code in the right directory. A simple "cursor ." from the terminal used to do it, now it'll often open the agent window, you have to try a few times for it to work.

I love that they are running super fast, it's just hard when many of the basics break or don't work.


> I've been handing over more and more tasks from Linear straight to it instead of the copy-paste into Claude dance lately

Tangent: we've been using Linear at work and I still don't understand why it claims to be "task tracking for agents". Is there anything at all that lends itself better to agentic workflows compared to JIRA or gitlab/github issues or whatever else?

Seems like Linear just hopped on the buzzword hype train at the exact right moment...


> Seems like Linear just hopped on the buzzword hype train at the exact right moment...

I think you nailed it. Provided an agent can connect and ingest the information in the ticket, that's basically what's needed. I guess it's nice to be able to nudge ticket status and post back to it, but all of those seem like wiring up existing APIs to an MCP and calling it good. I don't see why JIRA couldn't execute on that, despite being Atlassian.


Yup, honestly a google spreadsheet could probably do it as well.

I like the "copy prompt" feature, it's super simple but makes it just a few seconds to go from issue -> claude session.

Also assigning directly to cursor or codex, that's how I handle the easier tasks.

We also have scheduled tasks that elaborate existing tickets with information where needed, again that's just MCP but it works well enough


Any reason why they indexed on Kimi K2.5 model? I have tried many open-source ones in Opencode, and, in my experience (standard backend development, Java, Python, Spring, etc) Qwen3.6 is SO MUCH BETTER that's shocking. Kimi can't even get most tool calling arguments right.

There's a lead time on models, and there's some tuning gotchas they probably already figured out with Kimi, so they weren't ready to just drop everything and switch. I'm sure they will switch models eventually.

I recommend reading the entire article

  Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute.
  With Colossus 2's million H100-equivalents and our combined data and training techniques, we expect this to be a major leap in model capability.

I guess this will largely decide if xai is going to pay 60 or 10 billion, depending on the success of the new coding model.

Kimi 2.5 has the best long context. For raw coding benchmark scores you can just post train on top of it with more specialized data. 2.5 is kinda old, 2.6 is the current release which is exactly just that and catches up to the frontier in most aspects.

Cheaper to run?

It's very confusing that they use the same name as the very well known PHP package manager, composer

https://getcomposer.org/


I dont know what it is with products names these days. Antigravity, Antimatter, Composer, Clay, Ramp, Bolt, etc.

You'd think the founders would Google for naming conflict before choosing a name.


I genuinely wonder if consulting LLMs for naming advice could be an explanation.

They certainly wouldn’t be great at coming up with new words for a product name.


Naming issues are as old as time. Apple Computer vs. Apple Records comes to mind as a popular example.

They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice.

So now 2.5 is supposed to compete with opus 4.7? Sure…


That does not match my experience. Composer 2 was fantastic for my uses, and I hit Composer 2.5 with some very difficult things last night, which it handled fast and effectively. I don't really care about benchmarks. I care about practice, and in practice, it's been very very good for me.

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.

As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt.

Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P

Ok this might be weird but I've moved everyone in my 4 person team to our team plan and costs seem to have sky rocketed compared to the individual plans. Where before most people spent 20-100 USD, now the total bill is more like 1k USD. I haven't gone into the details but it feels like I'm being scammed.

We moved off Cursor and onto Codex + Claude Code. Cost went from multiple thousand per engineer per month to about $500

Best deal currently:

Cursor team Codex team Claude team

Swap between the models when limited.

I am saving our company a lot of money vs Claude enterprise usage cost


I did some monitoring. 15 accounts, 300 millions tokens input, 200k output went to 0 the 5h quota in 7 hours. 4 parallel tasks.

I think 300 million is too low. For reference before I could do more than 1 billion on same conditions.


My company is shifting us from Cursor to Claude due to increased costs.

Check which model you're using.

The fast version of composer is the default now (which costs ~x3 as much).


Keep in mind I believe there is a larger buffer given to personal plans. If they have 50% extra with the personal plan you now only get 25%.

My cursor costs sky rocketed recently too

I've been using Claude Code as my daily driver on a React Native + iOS codebase for the last few months. The thing that surprised me wasn't quality differences on individual edits — those are pretty close once you control for harness wiring — but how differently I'd ended up structuring my workflow around each style of tool.

Tab completion + chat-in-sidebar feels like an extension of my editing. An agentic harness feels more like delegating a 20-minute task and coming back to review. Different cognitive load, different bug profile. The "which is better" framing tends to skip over the fact that they reward different working styles.

Two things I'd watch on Composer 2.5 specifically:

1. How it handles long-running multi-file refactors that touch 10+ files. My experience with smaller models in that slot is they lose track of which files they've already edited around 30% of the way through. Frontier models keep the plan coherent for longer.

2. How it deals with non-obvious file boundaries. The thing that takes me out of "let it work" mode is the model deciding it needs to edit a config file I didn't think of. Usually that's right, but occasionally it's spelunking somewhere I don't want it to be.

The Kimi K2.5 base is interesting on its own. Open weights below frontier closed models is the thing worth watching from the harness side. If anyone's set up to fine-tune for a specific harness, this is the moment.


AI slop detected, you're under arrest


Thanks! Link belatedly changed above.

I love Cursor as a tool, but I'm skeptical bc:

1/ CursorBench is so opaque [1] that it makes it hard to trust. Not to mention the v3.1 eval is a newer iteration and there's no insight into the tasks or if the model was just tuned to max it out. Composer 2 previously scored between 60-65% on the previous benchmark eval [2] but scores between 50-55% on CB v3.1[3].

2/ I've experienced Composer 2's performance and it leaves much to be desired as a daily driver for a knowledge worker. but KWs are obviously not the target users and I can see how it's cost-efficient for executing on clearly-defined, discrete coding tasks. Obviously that's their value proposition and they're figuring out how to communicate it well to the target customer. It just doesn't feel like CursorBench is that.

[1] https://cursor.com/blog/cursorbench#building-cursorbench

[2] https://cursor.com/blog/composer-2-technical-report#performa...

[3] https://cursor.com/blog/composer-2-5


Kudos to the team. Please consider making the model available via API!

They shipped an SDK recently. https://cursor.com/blog/typescript-sdk

I tested it yesterday. It is pretty bad. Just like with Composer 2, it's fast, but quality is nowhere near what Cursor claims with their benchmarks. It is not even at Opus 4.5 level.

I gave it a mix of refactoring tasks and new feature tasks. For each one, I had it write a plan, then I had Codex review it. Codex found major issues with every plan: patterns that don't match the rest of the code base, hallucinated variable/function names, and even outright bugs in the way the plan was written. I fed the feedback to Composer 2. After it made the changes and implemented the revised plan, I had Codex and Opus 4.7 do code reviews, and once again both of them found major bugs.

Overall it was a very frustrating experience. I feel like I wasted a whole day. Which is sad, as I have been looking for an excuse to come back to Cursor. But as things stand, Codex + CC combo cannot be beat, not just in terms of price but also quality.


Surprised this got pushed off the front page so quickly! It’s exciting to see what the Cursor team has been able to do with significantly fewer resources than the frontier labs.

I do wish they weren’t joining xAI. Something tells me there will be a contingent of researchers that departs Cursor if that merger is consummated.


It set off the flamewar detector, a,k.a. the overheated discussion detector. We'll turn that off.

Thanks, dang! The blog post[1] might be a better source than the twitter thread. Also I regret my typo above (lab -> labs) but too late now!

[1] https://cursor.com/blog/composer-2-5


Thanks! I had been just about to add that maybe the link wasn't the most informative. We've switched it now from https://twitter.com/cursor_ai/status/2056415413077233983.

As for the typo, s's are cheap and I've added one :)



I think anybody will be much better by acquiring a coding plan from Kimi.com and using Kimi K2.6, with whatever harness they like, including Claude Code, instead of paying more for Cursor's version of Kimi K2.5.

It's a bit confusing to me why they'd make this 'fast' version the default, as it appears to be much more expensive than Composer 2. Wasn't it supposed to be a very cheap alternative to SOTA models?

Isn’t it a really cheap alternative to sota models (according to benchmarks)?

The cost claim is the easy part to sell. The real test is whether it stays useful in ugly codebases, long files, and repos with a bunch of half-broken conventions. That’s where these assistants usually fall apart, even when the benchmark numbers look great.

Tested and it's good. Fast version is bad though. I like planning model in Cursor that it works more like human written design doc instead of too detailed AI plan. Seems like this is more responsible for results that model but still on fast it failed but on normal got good results.

Benchmarks measure turn-level capabilities: you feed a task into the system and then grade the result. Capability for production-level usage concerns session-level decision making: does the agent know when to stop editing, retain the right amount of context, or go back and reread the file if the state has changed?

This is not a property of the model, but a property of the discipline; it can be operationalized by what you have documented before the session begins. Without "stop editing where you can no longer follow your changes to the spec" and "go back and read the migration file before changing the schema," there is nothing to halt the process until it fails integration.

Those teams who get consistent results independent of the model being used typically do so because they have operationalized their discipline first. Those switching out models monthly tend to expect the model to supply them.


I found composer 2 pretty good as a subagent delegating tasks like auditing for bugs after finishing implementation, but hopefully composer 2.5 will be more reliable so it can be used to implement and execute long running tasks.

Say what you want about Cursor but they don’t lack for ambition.

Forking VS Code, going big on bleeding edge features like cloud agents, and now they’ve thrown down the gauntlet directly challenging frontier labs by training their own model (“much larger” than Kimi 2.5’s 1T parameters) from scratch.

They’ve been highly successful so far. Raised $50B, $2B in revenue, forecast to end 2026 above $6B. But even at these heights, they’re just not in the same league as OpenAI/Anthropic/Google.

And if building a state of the art multitrillion parameter model is not challenging enough, it’s a mountain you don’t climb just once. Every few months you need to push it farther with a new release. Fall off for a couple cycles and like Facebook you may never catch up again.

Not for the faint of heart.


Why is this comment upvoted?

It is most likely AI generated with a nice "Raised $50B" hallucination and filled with cliches ("thrown down the gauntlet", "mountain you don’t climb just once", "not for the faint of heart").


Good catch. I didn’t even notice it at first, but the hallucinations on top of cliches gives it away.

The account doesn’t have a history of other comments that have too much of an AI vibe, but this one does. Even if it wasn’t AI, it’s misinformation.


Please see reply to your other comment on this thread.

I wrote this 100% off the top of my head on my phone while eating a sandwich.

Ffs.

edit: removed cursing you out. Sorry but this is frustrating. I don’t leave AI generated comments here (or anywhere else).


EDIT: As others have pointed out, the comment above contains hallucinations (Like the $50 billion number) and a lot of AI tells. The account doesn’t have a history of AI-like comments but the hallucinations and structure in this one are suspicious. If anything, don’t trust the numbers it cites because they’re made up.

Cursor is a team that I want to see succeed. They have stacked their company with very smart people and they’re going hard at a highly competitive market. We all win when there is more competition and more innovation.

My problem is that every few months I look at Cursor’s product offerings and maybe retry it, but it never feels like something I want to use. Part is personal preference, the other part is the fact that my combination of other tools and services just does a better job. Their biggest advantage felt like first-mover advantage when they came out early and captured market share, but at in person meetups I hear stories about companies switching away from Cursor or trying to convince their management to let them switch away. They need to come up with a compelling advantage fast, which is a hard thing to do against the other companies with their virtually unlimited budgets by comparison.


So, you’re wrong on two counts.

1. Evidently you’re no longer able to distinguish AI from people as the whole comment was written by a human off the cuff.

2. The numbers are not hallucinations. It’s word on the street reporting, so yes it’s speculative, but a model did not make up it up unless that’s where TechCrunch got it which is not on me.

https://techcrunch.com/2026/04/17/sources-cursor-in-talks-to...


Quoting directly from your comment:

> They’ve been highly successful so far. Raised $50B,

They have not raised $50B. The article you linked says they're raising $2B, not $50B.

The valuation is not the amount raised.


So I made a mistake reading the article? So what?

The point is you made two brigade style comments about my posts sounding suspiciously like an LLM and having hallucinations.

Neither turned out to be true and I think a better response would concede the point.

It may be more helpful for us to stick together as humans since we can’t always recognize each other so easily anymore.


What do you mean neither turned out to be true?

Your comment DOES sound like an LLM and it DOES have hallucinations!

Please make your humanness more recognizeable next time, don't waste readers time with posh fanboying and lazy fact checking.


Same, I kick the tires on Cursor every several weeks wanting to find they've finally crossed some chasm I can't quite explain. But every time, I bounce off the ground-truth that they're forked off vscode, which just isn't for me. I think moving agents to the center of their experience and developing a model that focuses on speed/efficiency over maximum depth is a promising step away from being a spicy vscode fork.

My company is heavy on Cursor and I still ask them to provide me GitHub Copilot, for the sole reason that Cursor is probably the reason Microsoft had to implement technical enforcement of their TOS on proprietary plugins. Previously, you could use PyLance on VSCodium but now those plugins do not work outside VSCode anymore.

If Cursor (and every other commercial VSCode forks) didn't use MS extension store in the beginning and violate the TOS these might not have happened.


Cursor 3 is a full rewrite. No VS Code

Yeah I want them to do well. I find Cursor to be a much better tool for actually working with the code the agent writes than whatever the big vendors provide.

> now they’ve thrown down the gauntlet directly challenging frontier labs by training their own model (“much larger” than Kimi 2.5’s 1T parameters) from scratch.

To clarify, the model Composer 2.5 announced in this post is not that; it uses Kimi 2.5 as a strong starting point. This is not to discount Cursor's work or future ambitions, but one of the most striking things about the last 6 months is that multiple open-source models/labs are now within striking distance of the frontier closed-sourced labs.

See eg Kimi 2.6 benchmarks: https://www.kimi.com/blog/kimi-k2-6


They have no choice but to train their own model to try and survive. They're paying API pricing for the top tier models but competing against subsidized subscriptions.

Them raising this much money doesn't mean they're successful, it only means they know how to fool the investors well. A project that is basically an extension to VSCode only adding a chat interface, isn't really worth this much money. Obviously, it's the users, but people think it's something genius and revolutionary, but no.

This is rsync all over again. Go create it yourself if you think it’s just a simple extension.

You're right, I regret I didn't have the sense to do the same as them at the time.

Nope you are blowing hot air. Take it elsewhere.

You can take yourself elsewhere. Good luck.

Less hot air and more substance please. It’s easy to deconstruct a company as an arm chair quarterback. It’s much harder to build a viable one. Until you have something constructive, kick rocks. Hot air is boring.

I realize you’re a troll account but at least be a fun troll.


I think that the product is easy to build, that's what I think because in my gathered experience it's easy. What more do you want?

This is the last time I'm responding. Good luck on whatever journey you're on. I'm sure it's an interesting journey since you've realizations over troll accounts, very interesting.


As a heavy user, I don't think the model is their product. Cursor is primarily a harness and lately, a specialized agent dashboard.

Composer, their in house model, is dispatched by other models like Claude Opus for individual items on a task list. No one is suggesting you write your main prompt to Composer 2.


they aren't "throwing down the gauntlet", they're trying to find ways to eke margin out of their product by owning a commodity-level coding model. it's an impressive engineering task but it's not particularly ambitious.

AI comment... BOO!

I want to like composer, but I just can't.

- Its communication style is completely opposite to Anthropic models. It's not as bad as OpenAI's models, which are obsessed with "shapes", "wrinkles", hyphenated-words, and other cryptic formulations that make you feel like you're not on planet earth after a while talking to them. But it is nonetheless markedly "rude", "dry", "cold", gives off this "entitled I'm right, you're wrong" attitude. I once had composer2-fast accidentally run `rm -rf $HOME` (no harm done) as part of a bug in an install script it wrote and all it could say once it realized it was: "Running script with proper hardening". Qwen's models have clearly been distilled from Anthropic models because they have a much closer communication style and that's why I hope cursor will one day release a new family of composer models derived from that. A damn joy to use.

- It's just dumb. I don't know what they're doing with benchmarks, but for my work (python, bash, docker, whatever), cursor is just incredibly dumb. Always does in 10 lines what could be done in one. Doesn't know loads of internals of things that other models know. Never places things in the right files, constantly makes terrible edits (inline imports, edits without testing). Everything is so complicated when done by composer2, it's just a joke to me at this point. It clearly needs more handholding than Opus 4.x or GPT-5.x. I tried 2.5-fast and it seemed more of the same. And this would sort of be acceptable if it owned up to its incompetence, but it is so confidently incompetent that it's revolting.

I know that for many people the "tone" of the models is not relevant, or maybe they even prefer models like these. I simply cannot work like that.

Ever since Gemini started blowing benchmarks out of the water while being a clearly inferior model incapable of producing anything (and pretty much just doing tool calls without any feedback to the user), I gave up on benchmarks. Composer has been more of the same in that regard.

As a GPT model would say:

   "Small wrinkle: the production-ready benchmark results were tainted by real-world data points. I've assimilated the inconsistencies and added guardrails so that v2 has the right shape for future evaluations."

I'm currently using Claude Code, but should I cancel it at the next renewal and switch to Composer 2.5?

Congratulations on the launch! I'm interested in trying Cursor but it's very confusing what I should buy. What does the Pro $20 plan get me in usage if I only use Composer 2.5? How fast is the model?

I use $20 plan on daily basis for more than a year now, and have yet to exhaust that limit. The plan includes $20 in api costs for non-Cursor premium models and $20 for Composer and Auto models provided by Cursor themselves.

That said, I am pretty old-fashioned coder and use LLM mostly to overcome the blank page problem, which means I review and often rewrite LLM output by hand and avoid prompt loops for a single task.

People who are aiming to not read code any more might find this $20 plan lacking for their needs, however for my needs it fits perfectly.


The limits are probably even higher than that, i seem to get about 100$+ of usage on composer and about 45-50 usd on non composer models

I wonder why they didn’t train off Kimi 2.6, I hope is it because they already had a good base and not that they messed up that relationship.

> and not that they messed up that relationship.

There's nothing to mess up. The license is MIT w/ attribution, and the attribution clause can be easily sidestepped w/o any legal repercussions. The "drama" was simply content creators going nuts over some misunderstandings and poor comms from some kimi related devs.


That's 3.0

Seems like a promising and useful model but its probably scary how much customer data they fed into it to reach this performance

It's always great that more companies are throwing their hat in the ring, especially focusing on value (latency + intelligence + cost)

I don't know why their model isn't on Openrouter yet. They must not have enough capacity to offer it.

I hope people soon wake up to the fact that they use user data for model fine tuning.

A lot of people saying Cursor have no moat. Sure. Neither do OpenAI or Anthropic.

You could say they have a sort of anti-moat (drawbridge?) since you can use their product to create a competitor. But that's true of most dev tools, in a sense.

Can you please train Qwen 3.5 like 0.8B to 9B using the same training techniques

It's a bit odd that they're not comparing it against Sonnet

I don't think so. They're comparing it to the highest tier available models from Anthropic and OpenAI. Generally speaking, Opus is better than Sonnet in almost every way, so why have the redundancy?

Price to performance?

I think their comparison to how their benchmarks compare to Opus are a great way to show "look at similar benchmarks for a fraction of the cost". If it has Opus benchmarks (I don't actually take benchmarks seriously, but for their comparison purposes) and Sonnet is still more than half the price of Opus, I figure it's close enough where it doesn't matter.

The tweet specifies that the new model is geared towards long-running tasks, which is what you'd use a model like Opus for anyway.

this feels super bullish on cursor/spacexai's ability to train a frontier level model. could be truly SOTA on coding given that their RL data is this powerful

Their previous Composer was already marketed as a cheap model capable of competing with SOTA on most tasks. The evals they shared back then backed this up but in my day-to-day usage it fell short across the board. Canceled my cursor subscription and switched to Claude Code a few weeks ago. It has its own shortcomings but in terms of model capability and UX quality Cursor will have a hard time competing in the long term. Elon Musk will be a very good way out for them.

Hahah wtf? They are training on colossus 2? Their own model?

Dude what the hell happened to Musks Grok? How incapable are they that they give away training compute to Cursor like this?

Weird that the genius Musk doesn't need his own compute, after all shouldn't Macrohard (no joke) already building the worlds software from scratch?


Words on the street is that xAI will buy cursor.

Yeah for 10-60 BILLION. which again makes this even stupider.

For this amount of money you can rebuild cursor and everything else on the market, and with the rest of 9-59 Billion, you just hire experts in coding and let them code real high quality code examples.

And then you just use your existing grok pipeline and just add this functionality.

This xAI stuff has to be run by idiots


Buy "Cursor", not "Cursor's IP". This means brand, users, and a shitton of data.

And if you combine a shitton of data with a lot of compute, large userbase and good engineers, you have a pretty good chance of doing something interesting.


Yeah you know how much 10-60 Billion are?

You could literaly just give your compute away for free for a year to pull people in.

Make an API Endpoint for free with the caviat that they are allowed to use the data for traing, what everyone else does too.


And you still don’t get the quality of data that cursor have which is the best due to being collected pre vibe coding.

With giving out tokens for free you would

it seems like they were trying that last year, it didn't work, so he flipped out and fired everyone and now plan B is to buy Cursor and run a quick rename of "Composer 3" to "Grok 5"

Did they just upgrade Kimi 2.5 to 2.6?

still uses 2.5

Can we use Composer 2.5 via API/OpenRouter?

Will this be the cursor's last dance? LoL

It looks a massive update from cursor and i like their platform Let hope its good



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: