Product @ k-ID
Posts
Benchmarking neimo.ai

Benchmarking neimo.ai

Accuracy Evaluation Against ChatGPT, Perplexity and Grok

Crystal Wong
October 01, 2025

At k-ID, our team of experienced attorneys work on the Regulatory Hub (also known as the Global Compliance Database), a massive repository tracking legal requirements around the world. We built neimo.ai as a way for our customers to retrieve research from the Regulatory Hub in a natural and intuitive way, much more efficiently than before.

Because neimo has been trained exclusively on the Regulatory Hub and the primary sources cited within it, we believe it provides a much more accurate view of what the law is compared to other research methods. To prove this, we ran a benchmarking exercise comparing neimo’s output with the output from three leading large language models (LLMs): ChatGPT, Perplexity, and Grok. This assessment focused solely on the accuracy of the output Neimo provided (see full methodology and assumptions below).

Note: neimo.ai's responses are AI-generated for informational purposes only. They may contain errors, and do not constitute legal advice.

Summary of Results

ChatGPT and Perplexity both achieved 65% accuracy.
Grok scored slightly lower at 60%.
neimo.ai established the baseline for comparison based on internally validated responses.

Accuracy % against neimo.ai performance*
(*see Methodology & Assumptions below)

Methodology

We created a curated benchmark set of 20 close-ended questions focused on compliance topics and jurisdictions relevant to our domain. This approach ensures each question has one clear, objective correct answer, simplifying the accuracy scoring process.

We internally validated neimo.ai’s responses against the Global Compliance Database and the primary sources cited within to establish a “ground truth”. Responses from ChatGPT, Perplexity, and Grok were then measured against this baseline, with our legal team determining pass/fail accuracy.

Assumptions and Limitations

The initial benchmark reflects areas where our chatbot is confident and correct, focusing comparative evaluation on matching or exceeding this baseline accuracy.
Limiting the question set to close-ended items facilitates objective scoring but may omit questions with nuanced or multifaceted answers requiring deeper qualitative review.
The uniform opening prompt ensures context consistency but may influence response style or content.
Evaluation relied on human expert adjudication for accuracy, with scope to expand question coverage and complexity in future iterations.
The benchmark used the latest free or publicly accessible versions of these foundational models available as of September 2025.

Gaps in Foundational Models

Hypothesis #1: Difficulty with Obscure or Non-English Legal Sources

Foundational LLMs mainly rely on broadly reported, English-language materials, which limits their accuracy on critical details sourced in local languages or that are esoteric in nature. This exposes a notable gap in coverage where reliable, primary sources are not globally accessible or widely publicised. In practice, this gap risks misinforming compliance teams relying on broad assumptions rather than precise local law.

Example 1: What is the digital consent age in Denmark?

All three foundational models defaulted to 13 based on the GDPR — reflecting general knowledge — but missed the actual digital consent age of 15 clearly set out in statutory legislation dated March 2024, which is officially available only in Danish. (See Section II, Chapter 3, § 6, subsection 2: “the processing of personal data concerning a child is lawful if the child is at least 15 years of age.”)

Example 2: Is it mandatory to carry out a Data Protection Impact Assessment prior to processing kids’ data in Poland?

All three foundational models answered that it’s not automatically mandatory and it’s only mandatory when the processing is “likely to result in a high risk to the rights and freedoms” of individuals. They did not manage to extract the Polish Data Protection Authority’s announcement on the list of processing operations requiring a data protection impact assessment (the “Polish DPIA Blacklist”). This document, published in Polish, cites “services and toys dedicated to children” as an example of “operation / scope of data / circumstances” triggering the obligation to carry out a DPIA.

Example 3: Are online gaming companies subject to the Australia social media minimum age requirement?

ChatGPT referred generally to the Online Safety Amendment (Social Media Minimum Age) Act 2024, which broadly applies to online services that facilitate social interaction, concluding that online gaming companies are in-scope. However, it overlooked the legislative rules explicitly exempting online gaming services. (See section 5(1)(b): “services that have the sole or primary purpose of enabling end‑users to play online games with other end‑users” are exempt)

Example 4: Do I need to notify my players in France of time elapsed in the game?

All three foundational models failed to identify Law no. 2023-566 of July 7 2023 which requires providers of online social networking services operating in France to implement a system that monitors time spent by users on their services and to regularly inform users of this screen time through notifications.

Hypothesis #2: Weakness in Weighing Source Authority

Legal decisions depend not just on what the law says but where it’s said. Primary legal sources (such as statutes, official government guidelines, and regulatory decisions) carry significantly more weight than secondary or tertiary commentaries. However, foundational LLMs sometimes fail to distinguish authoritative regulatory texts from secondary interpretations, resulting in propagation of inaccurate legal facts.

Example 5: Is there a data subject right to portability in Egypt?

ChatGPT and Grok incorrectly noted the existence of this right, attributing it to third-party sources which were themselves erroneous. Curiously, both models referenced the same inaccurate source, demonstrating how shared internet misinformation can mislead AI models lacking source vetting and authority weighting mechanisms.

Hypothesis #3: Challenges with Nuanced or Multi-step Legal Analysis

Legal requirements often hinge on precise definitions and cross-references that require layered understanding rather than surface-level answers. Foundational LLMs may provide general but imprecise interpretations.

Example 6: What’s the minimum age in Brazil at which parental consent is no longer mandatory for enabling in-game purchases for users?

ChatGPT: “In Brazil, parental consent is mandatory for enabling in-game purchases for users under the age of 18. This requirement is stipulated by the Legal Framework for Electronic Games (Law No. 14,852/2024), which mandates that game developers and publishers implement safeguards to protect minors, including obtaining parental or legal guardian consent for in-game purchases...”

ChatGPT correctly cited the Legal Framework for Electronic Games (Law No. 14,852/2024). However, it failed to consider the specific legal definition of “children” in the law. Article 17 restricts purchases by “children” to require parental consent, but “children” means under 12, while the separate term “adolescents” covers ages 12–18. Although an operator could certainly choose to seek parental consent for adolescents as well, ChatGPT’s conflation of these terms could result in a developer restricting purchases when they don’t legally have to.

Example 7: In the Philippines, what’s the legally required turnaround time for removal of CSAM on my platform?

Perplexity: “Under Philippine law (Republic Act No. 11930, the Anti-Online Sexual Abuse or Exploitation of Children Act, and its Implementing Rules and Regulations), digital platforms are required to remove child sexual abuse material (CSAM) within 24 hours of notice or discovery…”

Perplexity’s answer included the 24-hour removal requirement but failed to specify that the clock starts ticking only after a law enforcement notice has been received, not discovery by the platform. This subtle but operationally critical distinction affects compliance risk and procedural workflows for platform operators.

Hypothesis #4: Difficulty Providing Authoritative Enforcement Context

Understanding the law is one thing; knowing when compliance failures lead to enforcement actions or penalties is essential for realistic legal advice, and empowers counsel to prioritise risk mitigation efficiently.

Example 8: If my audience is mainly kids, do I need to offer my privacy policy in a specific language in the Netherlands?

All three foundational models correctly answered “yes” to the legal requirement to provide privacy policies in Dutch. However, none cited the TikTok enforcement decision where the company was fined €750,000 for having its policy only in English despite a mainly child audience. This enforcement context significantly informs the practical legal risk.

Example 9: In South Korea, what’s the long-stop date by which parents must provide their consent before that consent is invalid?

Perplexity: “In South Korea, under the Personal Information Protection Act (PIPA), parental consent for the processing of personal information of children under age 14 must be obtained. If the legal guardian refuses or does not provide consent within five days, the data controller (service provider) is required to destroy the child's personal data, making the consent effectively invalid after this five-day period.”

Perplexity correctly noted that parental consent must be provided within five days under the Personal Information Protection Act (PIPA). However, it mistakenly suggested that this five-day period was a statutory mandate. In fact, it is regulator-issued guidelines, which, while influential, are technically non-binding.

Overall, our study shows the clear benefits of relying on the research of the Regulatory Hub, powered by neimo.ai’s retrieval capabilities, over foundational models with respect to the accuracy of legal research results.

👉 Ready to explore neimo.ai? Log in to Compliance Studio today and experience instant, authoritative answers for product teams — now in beta.