Ray AIRay AI
Research at Ray AI

Engineered for Reliability.

We don't rely on guesswork. Our AI models undergo rigorous benchmarking against human experts to ensure every piece of feedback is fair, transparent, and actionable.

Research Metrics

QWK
0.87
+
Partial Credit
92%
+
Flagged
8%
Human vs AI Agreement
Perfect Match
78%
Within ±1
18%
Disagreement
4%
What we study

Our Research Focus

We measure what matters: reliability, fairness, trust, and impact.

Reliable Marking

How close we are to expert human markers on real exams.

Fair Partial Credit

Whether working and method are rewarded consistently.

Teacher Time & Trust

How much admin is removed and whether teachers trust the results.

Student Feedback Quality

Whether comments are specific, clear and help students improve.

Metrics & Benchmarks

How We Measure Success

These aren't just numbers—they're our commitment to rigorous evaluation. We compare AI scores to expert markers on real scripts to ensure reliability.

English

  • Quadratic Weighted Kappa (QWK) vs expert raters
  • Agreement rate within ±1 grade
  • Outlier detection (when AI strongly disagrees with humans)

Maths

  • Partial-credit accuracy on worked solutions
  • "Acceptable alternate methods" coverage
  • Consistency across different versions of similar questions

Operational Metrics

  • Average marking time per script
  • Percentage of scripts flagged for human review

* Quadratic Weighted Kappa (QWK) measures agreement between raters, accounting for the severity of disagreements. Values closer to 1 indicate stronger agreement.

Current Research Projects

What We're Working On

Rigorous evaluation focused on reliability, fairness, and practical usefulness in real educational settings.

In progress

Pilot: HSC Maths Exam Marking

Testing our AI marking system on real exam scripts with expert markers as ground truth to ensure reliable, fair, and useful feedback.

Focus: Reliability and accuracy of AI marking compared to expert human markers
Metrics: Quadratic weighted kappa (QWK), partial-credit accuracy, marking time
In progress

Essay Scoring Consistency Across Centres

Measuring consistency of AI essay scoring across different centres to validate fairness and reliability.

Focus: Agreement between AI and human markers across different contexts
Metrics: QWK, disagreement analysis
Results coming soon

Termly Report Generation Study

Evaluating time savings and perceived accuracy from teachers using AI-generated termly reports.

Focus: Teacher time savings and accuracy perception
Metrics: Time saved, teacher satisfaction scores

Human vs AI: Our Approach

Teachers are the ground truth. Our system doesn't replace educator judgment—it supports it. We measure both human–human and AI–human reliability to ensure our AI markers enhance, rather than replace, teacher expertise.

Human markers naturally disagree—it's expected and healthy. Our goal is to make AI markers as reliable and consistent as experienced educators, while dramatically reducing marking time. Every AI score can be reviewed and overridden. Teachers remain in complete control.

We support teachers, not replace their judgment.