GenAI as an International Lawyer
Can a large language model write a legal memorial good enough to compete against law students from 700 universities worldwide? We generated 10 memorials using GPT-4o and Gemini 2.0 Flash, submitted them to the world's largest moot court competition, and had them blindly graded by real Jessup judges. The results challenge assumptions about AI in legal practice.
The Philip C. Jessup International Law Moot Court Competition is the world's largest and most prestigious international law moot. Each year, teams draft written memorials arguing both sides of a hypothetical dispute before the International Court of Justice, graded on a 100-point scale across five categories: knowledge of facts and law, legal analysis, research, clarity, and style.
We used two frontier LLMs – Google's Gemini 2.0 Flash and OpenAI's GPT-4o – to generate memorials under three input conditions of increasing informational richness: Pure AI (basic materials only), Textbook AI (with leading treatise extracts), and Informed AI (with the organizers' Bench Memo). Each memorial was submitted under a fictitious team number and graded blindly alongside roughly 900 human-authored submissions.
I. Hiding in Plain Sight
The headline finding: our AI-generated memorials scored an average of 78 points out of 100, virtually indistinguishable from the historical average (77.80) and the 2025 cohort average (78.15). The AI did not stand out – it blended in.
The chart below shows where each AI memorial falls against the overall score distribution. The bell curve represents all memorials graded in the 2025 competition; the coloured markers are our ten AI submissions.
Figure 1: Score distribution of all 2025 Jessup memorials (approximated from published statistics). AI memorials marked as dots.
But averages conceal volatility. The same memorial (A_126/562, Gemini Pure AI) received both the lowest individual grade in our batch – 51 – and the highest – 98. The judge awarding 51 wrote that the memorial contained hallucinated content; the judge awarding 98 called it outstanding. This variance is itself revealing: AI-generated text can simultaneously impress and alarm depending on where a reader looks.
II. The Scoreboard
Below are the aggregate scores for each of the ten AI-generated memorials, coloured by model. GPT-4 memorials (blue) cluster higher; Gemini memorials (orange) show more variance.
Figure 2: Aggregate scores for each AI memorial. Dashed line marks the 2025 cohort average (78.15).
Our best memorial, A_442/630 (GPT-4 Pure AI), placed in the top 5% of all submissions. It was not generated with insider knowledge – it relied only on the competition problem and basic materials. Nearly half of human teams received at least one penalty (averaging 2 points deducted); our memorials received zero.
III. GPT-4 vs Gemini
A statistically significant difference emerged between the two models (p=0.0316). GPT-4 achieved a higher average total score of 80.8 compared to Gemini's 76.0, placing it firmly in the sixth decile of all scores.
Figure 3: Average total scores by model and input condition. GPT-4 consistently outperformed Gemini.
Surprisingly, GPT-4 scored higher despite producing shorter texts – some memorials fell 1,000 to 3,500 words below the 9,500-word limit, a disadvantage given that longer memorials tend to score better in the Jessup. Gemini, by contrast, generated more text but exhibited greater variance, with occasional severe factual hallucinations pulling down its averages.
Fascinatingly, varying the informational context provided to the models – from basic materials (Pure AI) to textbooks to the organizers' Bench Memo (Informed AI) – produced no statistically significant difference in overall scores (F=0.08, p=0.92). More information did not reliably mean better performance, suggesting that expanding the models' context may be a recipe to confuse them.
IV. What AI Does Well (and Poorly)
The Jessup's five scoring categories reveal a striking pattern. AI memorials excelled in presentation – Clarity and Organization and Style, Formatting, Grammar, Citation – while scoring lowest in Extent and Use of Research. In other words, the AI can write well, but it cannot necessarily research or reason well.
Figure 4: Average category scores by model (each category scored out of 20).
This pattern held for both models. Judges frequently praised the memorials' structure and linguistic coherence, even when critical of the substance. Comments described them as having "excellent hierarchy of argument," being "very organized," "well-organized with logical flow," or "clearly structured … squeaky clean."
But beneath the polished surface lay substantive weaknesses. Research scores were dragged down by fabricated citations, hallucinated case law, and sources that did not support the propositions for which they were cited. The strongest critique came from a judge who wrote:
V. Can Judges Tell?
Overwhelmingly, no. Of the 13 judges who graded at least one of our AI-generated memorials, only 2 suspected AI use. At least 11 judges read and graded our 100% AI-generated memorials without flagging them. In the broader survey of all judges, only 27 out of 90 respondents reported any suspicion of AI use across all the memorials they graded.
Figure 5: Of the 13 judges who graded AI memorials, only 2 suspected AI involvement.
The clues that did raise suspicion were "lack of depth," "repetitive phrasing," and occasional hallucinations – but these were identified by very few. Most judges were not confident at all in their ability to detect AI use, with several highlighting the inherent difficulty.
One judge captured the paradox elegantly, observing that in the 2025 edition "there's often greater clarity in structure, grammar, and formatting – which may reflect the use of AI tools for editing and polishing. However, in some cases, there's also a noticeable drop in originality, depth of analysis, or strategic argumentation, which could suggest over-reliance on generic AI-generated content."
VI. The Centaur Model
Our results point to a clear conclusion: the most effective use of AI in legal writing is not full automation but a "centaur" model – human-AI collaboration where the machine handles drafting and structure while humans maintain control over substantive legal analysis and factual verification.
The LLM's strengths are surface-level generalizability; its failures come from substance-specific overreach. It can mimic law's idiom, but not (fully) its logic. Legal hallucination is not a glitch but a function of the LLM's core design: it predicts what "should" be said next, not whether it is true or legally grounded.
For legal education and competitions like the Jessup, these findings suggest AI will play an increasingly significant role. The question is no longer whether AI can write like a lawyer, but whether our evaluation frameworks are equipped to tell the difference – and what that means for the profession.
Full paper: Damien Charlotin & Niccolò Ridi, 'GenAI as an International Lawyer: A Case Study with the Jessup International Law Moot Court' (2025). Available on SSRN.