Engineered for Reliability.
We don't rely on guesswork. Our AI models undergo rigorous benchmarking against human experts to ensure every piece of feedback is fairPartial credit awarded consistently, transparentClear reasoning and methodology, and actionableSpecific comments students can use.
Research Metrics
Our Research Focus
We measure what matters: reliability, fairness, trust, and impact.
Reliable Marking
How close we are to expert human markers on real exams.
Fair Partial Credit
Whether working and method are rewarded consistently.
Teacher Time & Trust
How much admin is removed and whether teachers trust the results.
Student Feedback Quality
Whether comments are specific, clear and help students improve.
How We Measure Success
These aren't just numbers—they're our commitment to rigorous evaluation. We compare AI scores to expert markers on real scripts to ensure reliability.
English
- Quadratic Weighted Kappa (QWK) vs expert raters
- Agreement rate within ±1 grade
- Outlier detection (when AI strongly disagrees with humans)
Maths
- Partial-credit accuracy on worked solutions
- "Acceptable alternate methods" coverage
- Consistency across different versions of similar questions
Operational Metrics
- Average marking time per script
- Percentage of scripts flagged for human review
* Quadratic Weighted Kappa (QWK) measures agreement between raters, accounting for the severity of disagreements. Values closer to 1 indicate stronger agreement.
What We're Working On
Rigorous evaluation focused on reliability, fairness, and practical usefulness in real educational settings.
Pilot: HSC Maths Exam Marking
Testing our AI marking system on real exam scripts with expert markers as ground truth to ensure reliable, fair, and useful feedback.
Essay Scoring Consistency Across Centres
Measuring consistency of AI essay scoring across different centres to validate fairness and reliability.
Termly Report Generation Study
Evaluating time savings and perceived accuracy from teachers using AI-generated termly reports.
Human vs AI: Our Approach
Teachers are the ground truth. Our system doesn't replace educator judgment—it supports it. We measure both human–human and AI–human reliability to ensure our AI markers enhance, rather than replace, teacher expertise.
Human markers naturally disagree—it's expected and healthy. Our goal is to make AI markers as reliable and consistent as experienced educators, while dramatically reducing marking time. Every AI score can be reviewed and overridden. Teachers remain in complete control.
We support teachers, not replace their judgment.
