Research at Ray AI

Engineered for Reliability.

We don't rely on guesswork. Our AI models undergo rigorous benchmarking against human experts to ensure every piece of feedback is fairPartial credit awarded consistently, transparentClear reasoning and methodology, and actionableSpecific comments students can use.

Reliability·Fairness·Teacher time saved

Research Metrics

QWK

0.87

Partial Credit

92%

Flagged

↓

Human vs AI Agreement

Perfect Match

78%

Within ±1

18%

Disagreement

What we study

Our Research Focus

We measure what matters: reliability, fairness, trust, and impact.

Reliable Marking

How close we are to expert human markers on real exams.

Fair Partial Credit

Whether working and method are rewarded consistently.

Teacher Time & Trust

How much admin is removed and whether teachers trust the results.

Student Feedback Quality

Whether comments are specific, clear and help students improve.

Metrics & Benchmarks

How We Measure Success

These aren't just numbers—they're our commitment to rigorous evaluation. We compare AI scores to expert markers on real scripts to ensure reliability.

English

Quadratic Weighted Kappa (QWK) vs expert raters
Agreement rate within ±1 grade
Outlier detection (when AI strongly disagrees with humans)

Maths

Partial-credit accuracy on worked solutions
"Acceptable alternate methods" coverage
Consistency across different versions of similar questions

Operational Metrics

Average marking time per script
Percentage of scripts flagged for human review

* Quadratic Weighted Kappa (QWK) measures agreement between raters, accounting for the severity of disagreements. Values closer to 1 indicate stronger agreement.

Current Research Projects

What We're Working On

Rigorous evaluation focused on reliability, fairness, and practical usefulness in real educational settings.

In progress

Pilot: HSC Maths Exam Marking

Testing our AI marking system on real exam scripts with expert markers as ground truth to ensure reliable, fair, and useful feedback.

Focus: Reliability and accuracy of AI marking compared to expert human markers

Metrics: Quadratic weighted kappa (QWK), partial-credit accuracy, marking time

In progress

Essay Scoring Consistency Across Centres

Measuring consistency of AI essay scoring across different centres to validate fairness and reliability.

Focus: Agreement between AI and human markers across different contexts

Metrics: QWK, disagreement analysis

Results coming soon

Termly Report Generation Study

Evaluating time savings and perceived accuracy from teachers using AI-generated termly reports.

Focus: Teacher time savings and accuracy perception

Metrics: Time saved, teacher satisfaction scores

Human vs AI: Our Approach

Teachers are the ground truth. Our system doesn't replace educator judgment—it supports it. We measure both human–human and AI–human reliability to ensure our AI markers enhance, rather than replace, teacher expertise.

Human markers naturally disagree—it's expected and healthy. Our goal is to make AI markers as reliable and consistent as experienced educators, while dramatically reducing marking time. Every AI score can be reviewed and overridden. Teachers remain in complete control.

We support teachers, not replace their judgment.