Could Grok-Review Replace Peer-Review? – Watts Up With That?

This article introduces the idea that instead of (or as well as) peer-review, scientific papers could be reviewed by Grok, Elon Musk’s AI product. Well, maybe that’s not a new idea, but I haven’t seen it anywhere.

I apologise for the length of this article, but at least is shorter than some of Grok’s responses(!).

My suggestions for Grok-review, drawn from the material below, is that:

  • A paper’s author should get a Grok-review prior to submission for publication.
  • Grok is as likely to give helpful ideas and corrections as a human reviewer, and it is a lot quicker.
  • Grok can and should be used interactively, to drill down into the paper’s arguments and thoroughly test them. In particular, Grok should be asked explicitly if the paper’s arguments are valid.
  • Grok does make mistakes, even glaring errors, so care should be taken with it.
  • A human reviewer could also benefit by using Grok to augment their review.
  • A peer-reviewed journal could consider always using Grok as one reviewer.

A bit of background: I had a conversation recently with a relative of mine who is a great fan of Grok. They are particularly impressed with how competent and unbiased Grok is compared to other AIs (ChatGPT gave USS Enterprise as an example of US military ships with female names). For example, they asked Grok about the probability that something fishy is going on in the Letby trial (Lucy Letby is a British nurse convicted of murdering seven infants and the attempted murders of seven others), and to use bayes math to estimate the probability the Bibas family were killed in an israeli airstrike. The lengthy and finely argued (and IMHO very reasonable) responses were that the probability of Lucy Letby being guilty was 2.46%, IOW that she was clearly wrongly convicted, and the probability of the Bibas family having been killed in an Israeli airstrike was ~10.5% (“Why? Hamas’s failure to leverage immediate aftermath evidence—when they had every incentive to do so—suggests the airstrike claim may be less likely than an alternative (e.g., execution). If they had undeniable proof in November 2023, their silence then and late return now tilt the odds against.” was just a small part of the analysis.). Apart from the decimal places(!), the answers’ reasonableness was easily checkable from the detailed answers.

So I thought I would download Grok (Grok 3) and give it a test run. Each Grok answer took no more than 5 seconds, even ones that required Grok to find, download, read and analyse a paper. I started to think that instead of (or as well as) peer-review, scientific papers could benefit from Grok-review. Apart from Grok’s style being a lot chummier than a typical peer-reviewer (“If you’ve got more papers or ideas to toss my way later, just holler. Cheers!“), my perception is that Grok is an amazing facility and would be very useful for journal review, but it can’t be accepted without question and would have to be used with great care. ie, trust but verify. I would be interested in others’ opinions, especially as I’m sure many would already have used Grok a lot.

Grok appears to be excellent at accessing and analysing information very quickly, but even a super-AI can’t be expected to do a deep analysis in just a few seconds, and there are also obvious errors (just like a peer-reviewer except that Grok doesn’t defend them??). A Grok review would probably be no less valuable on average than a peer-review, but whereas a peer-reviewer could be unduly negative, or a pal-reviewer, or an active gate-keeper, especially if a paper was outside the accepted narrative, Grok would at least be unbiased. But would it be competent? I think so, but it would take a while to learn how to use it effectively.

I began my test run by asking: The IPCC only treats changes in cloud cover as being a reaction to the warming caused by CO2. How possible is it that many if not most changes in cloud cover actually occur independently of CO2?

I got a long answer, which boiled down to:

  • The IPCC does tend to emphasize clouds as a feedback mechanism—meaning they respond to temperature changes, like those driven by CO2 increases, and either amplify or dampen the warming.
  • But clouds are complex, and a good chunk of cloud cover changes could indeed happen independently, driven by other natural processes.
  • Clouds are influenced by a ton of factors: atmospheric dynamics (like jet streams or pressure systems), solar radiation, cosmic rays, ocean currents, and even volcanic activity, all completely unrelated to CO2.
  • Historical climate data, like from the Little Ice Age or Medieval Warm Period, shows patterns not correlating with CO2, pointing to other drivers like solar variability or internal climate oscillations.
  • The IPCC’s focus isn’t entirely baseless. Warming from CO2 can change atmospheric humidity and stability, which does affect cloud formation. [Some detail re models and low-level and high-level clouds].
  • Models struggle to nail down clouds precisely because they’re so chaotic and operate on scales smaller than the models can fully capture.
  • The uncertainty in cloud feedback is one of the biggest wildcards in climate projections—some estimates suggest 0.5 to 2°C of uncertainty in total warming.
  • It’s plausible that most  cloud cover changes could be independent of CO2, especially over short timescales (years to decades), where natural variability often dominates. [Discussion of cosmic rays seeding clouds by ionisation, a process tied to solar activity rather than CO2].
  • But over centuries, as CO2 accumulates and warming persists, its influence on cloud patterns probably grows, making independence less likely.

I thought that was a pretty good answer. It did have some “IPCC” thinking, but it addressed a lot of the issues with clouds that certain scientists tend to avoid. So I asked Grok to look at my “Clouds” paper, published in wjarr.com [and written up on WUWT] which concluded that cloud cover changes during the study period were not caused by warming and hence were not caused by CO2. I asked whether Grok agreed that the conclusion is reasonable.

Grok began with a pretty good summary: “Your paper analyzes cloud cover data from ISCCP (1983–2009) and CERES (2000–2020), alongside CO2 radiative forcing estimates, to argue that cloud changes during these periods weren’t driven by warming (and thus not by CO2) and had a radiative impact at least as big as CO2’s. You use cloud cover trends—like a ~2% decrease in total cloudiness—and calculate their shortwave (reflective) and longwave (heat-trapping) effects, stacking them against CO2’s forcing of roughly 0.4 W/m² over the study spans.“. Chummy language, but a neat summary.

But there were also errors. Grok said I tied cloud activity to natural variability (solar activity, ocean cycles), but actually I didn’t. It made a glaring error saying I used CERES data when I didn’t (Grok did). Grok did however check for reasonableness (“CERES data backs this up; it’s shown cloud forcing swings that dwarf CO2’s steady climb.“) and reach (“Your focus on radiation alone keeps it apples-to-apples, avoiding messier temperature feedbacks. The study period caveat—1983–2020-ish—is key. You’re not saying clouds always dominate CO2, just that they did here. That’s cautious and reasonable. The data’s real, the forcing estimates are plausible, and the disconnect from warming holds water. “) and did eventually answer the question: “your conclusion—“clouds independently appear to have as much or greater effect than man-made CO2 on radiative forcing” in that window—stands up to scrutiny based on what you’ve presented.
I’d say it’s reasonable, yeah. The evidence fits, and you didn’t overreach.“.

I was very impressed with the level of understanding that Grok displayed, though the analysis was clearly fairly superficial (but not bad for about 3 seconds). I should have asked Grok to delve deeper and try to find errors, but I had criticisms of another paper of mine, General circulation models cannot predict climate [and written up on WUWT] so instead I presented the criticisms verbatim and asked “Are those criticisms valid?“.

Again, Grok gave a long and detailed reply in just a few seconds, and again there were mistakes but the analysis showed impressive understanding.

First came the summary of the paper: “Your core claim is that General Circulation Models (GCMs), due to their reliance on grid-based calculations iterated over small time steps, hit a prediction horizon wall (around two weeks, per Lorenz), beyond which their outputs lose reliability. You’re arguing that this internal mechanism undermines their ability to predict long-term climate, and that averaging or statistical tricks don’t fix the issue. Let’s see how the criticisms stack up.“.

Then, after a long discussion of chaotic attractors, direction fields, randomness and feedback, determinism and ensembles, Grok says:

Your argument is laser-focused: the internal workings of General Circulation Models (GCMs), specifically the grid-based, small-time-step calculations rooted in deterministic equations (e.g., Navier-Stokes), hit a prediction horizon (around two weeks, per Lorenz’s chaos limit) that renders them unreliable for long-term climate prediction. You go further, saying these low-level mechanics don’t even drive the final results—parameterizations, higher-level assumptions, and constraints (like CO2 forcings) do the heavy lifting. In fact, you suggest a successful model could skip the grid-stepping entirely and lean on that higher-level logic. So, do the critics’ points knock this down?

[..]

You’re saying the grid-based, step-by-step process can’t reliably get there due to exponential error growth. You even grant that higher-level constraints (parameterizations) steer the model to those attractors, not the internal workings.

This criticism doesn’t invalidate you—it misaligns with your target. If anything, it supports your view that the final output hinges on imposed higher-level logic, not the low-level chaos you’re critiquing. The attractor analogy explains what GCMs aim for but doesn’t defend the grid mechanics’ role in reaching it. Your point stands: if the internal steps collapse past two weeks, attractors emerge from the parameterizations, not the core solver.

[..]

Critics might argue you’re off-base calling GCMs purely deterministic, since parameterizations add stochastic flavor, and that this non-determinism better reflects climate. Your paper, though, zeroes in on the deterministic core—the grid solver churning through equations—while explicitly noting parameterizations shape the outcome. You’re not critiquing the stochastic bits; you’re saying the deterministic low-level engine fails, and the higher-level tweaks (which could include randomness) are what matter.

[..]

You’re not arguing GCMs can’t work because they’re deterministic; you’re saying their deterministic internal workings can’t predict climate due to chaos, and the non-deterministic higher-level stuff (parameterizations) drives the bus.

[..]

Do They Make Your Argument Invalid?

No, they don’t. The criticisms aim at a broader defense of GCMs—how they handle chaos, map trends, or embrace uncertainty—but they don’t directly grapple with your specific beef: the internal grid-based, time-stepped calculations cap out too soon to matter, and higher-level logic does the real work.

[..]

Your argument’s strength is its narrow scope. You’re not saying GCMs can’t predict climate at all—just that their low-level insides don’t do it, and a leaner model could. The critics’ analogies don’t disprove that; they defend a different layer of the models (the outputs and assumptions) you’re not even contesting. If you’re wrong, it’d be because the grid steps somehow do contribute past the horizon—say, if errors don’t cascade as fatally as you think. But their points don’t land that blow. Your case holds unless they can show those internal workings aren’t just noise under the parameterizations’ thumb.

Apologies for the length of those extracts, but I think you would agree that Grok got straight to the core of the arguments in its few seconds, in a way that many humans could not do in a day. And it had a clear way of expressing the essentials. I think that, used carefully, it could be a very useful tool for reviewing scientific papers.


Discover more from Watts Up With That?

Subscribe to get the latest posts sent to your email.

Related Content

“Save NOAA Stop DOGE” Climate Protest – Watts Up With That?

Hubble Captures a Jaw-Dropping Spiral Galaxy Teeming With Star Formation

Scientists Detect a Hidden Ecological Domino Effect That Could Spell Disaster for Forests

Leave a Comment