Psychometrics is not a new field of study, but its applications in AI are.
I study new AI Tools and build consumer trust by quantifying success through rigorous, repeatable, scientific methods.
As an AI-Psychometrician, I build assessments to quantify success when using AI. My skillset is a bit niche, and an emerging field of study, so you might not otherwise know I exist. I build assessment frameworks for AI model performance evaluation.
The postings on LinkedIn show are looking for expertise in both data science and AI, but usually separately. Companies want to be on the cutting edge in applied AI software engineering. As a Psychometrician, I sit as a bridge between these spaces. I don’t necessarily fit into the software development AI Engineer or the traditional dashboard building Data Scientist. I sit in a space between those roles providing quantitative evidence of successful AI-Driven tool development.
I leverage my skills in psychometrics, research, and analysis to create assessments and measure outcomes, for both humans and AI.
But first… what is a psychometrician?
A psychometrician develops evaluation frameworks.
When you make a survey, a test to measure knowledge, a scale to measure psychological traits, there is actually a lot of science behind it. You don’t just throw questions on paper and call it good. You have to think about what you are measuring and whether the test/survey is measuring that.
It starts with research. I look at the goals for what I am trying to measure. I look at publications on the subject matter, previously developed instruments, various predefined metrics, client defined standards, industry standards and whatever other information seems relevant to design an evaluation framework.
Once the surveys are taken or the assessments are completed by a large enough sample, I get to do the fun parts.
With enough data, I can measure the difficulty of a test or question. I can equalize test scores so they can be compared based on individual baselines or answers to specific questions shared across tests. I can provide difficulty values to those test questions for integration in adaptive testing environments. I can make sure that the test is working the same way across different sub-groups to measure the same things for all participants.
I can also refine the survey or test and make it as short as possible. No one enjoys tests or surveys that take too long to complete. I can remove questions that are redundant, too easy, too difficult, or don’t actually provide much information about the test taker. I can tell you if the test is likely measuring what it is supposed to, or maybe it is unintentionally measuring other traits as well.
How does that relate to AI?
I create quantitative measures of quality for AI output.
It starts the same as it does for humans! I build tests for the AI output and create strict rubrics to define acceptable outputs.
On the front end, I design prompts and rubrics to test their quality based on the provided instructions. I combine predefined metrics, industry standards, objectives, and break down the right data to develop RAG Models.
Then I create rigid, quantitative evaluation frameworks. I have incorporated metrics from LLM-to-LLM agreement, human-to-LLM agreement, and cross-LLM agreement to iteratively refine the prompts and data input to the model. I record the outputs and evaluation results in large data sets.
How does an AI Psychometrician benefit product development?
I have a Ph.D. in Measurement & Quantitative Methods. I am a research methodologist. I measure… whatever. From here, I answer questions that help to cut costs and boost consumer confidence.
Cutting Costs
- By famliarizing myself with the goals I can identify where API calls to LLMs can be reduced through statistical models or other automated procedures.
- I determine whether using AI to fix the AI hallucinations actually improves quality or simply redistributes the errors at a higher price.
- Optimize prompt lengths to reduce input tokens without sacrificing quality.
- Stop wasting money where AI is just not ready to perform the task.
Improving Quality
- Determine the number instructions that can be included per API call before the LLM starts skipping.
- Fnd the best time of day to automate a process.
- Compare performance across language inputs.
- Compare the before and after on systems changes and prompt adjustments to see if they are actually improving the output at scale.
- Compare performance across LLMs, or after updates on the same model.
Consumer Confidence
This is a hard one. No one is willing to say it out loud, but we know there is a bubble. Like every new scientific development, we have rushed in to exploit every use of AI. While industry is already heavily invested, consumer confidence is waning. Experts in every field are shying away from its use, teachers and professors warning against it, and study after and study reporting high hallucination rates. Parents fear potential dangers to their children. And both government and private investors are heavily funding research to find out what it is actually good for. AI is here to stay, but change is imminent.
- Support you claims with verifiable evidence from quality evaluations.
- Identify your products’ weaknesses before your clients do.
- Identify exactly what your clients need to double check and why.
- Know when an LLM is not yet ready for a task, and when it suddenly is.
- Prepare for external factors that might influence your output.
Regulation is just around the corner.
There are currently no age-restrictions for AI/LLM use. Nor are there tests for AI-Literacy, common sense, or a prior subject matter expertise. We are integrating AI into everything from childrens’ toys to news reviews, but always with an accuracy warning. For now the major LLMs carry a disclaimer that passes the burden of responsible use to the end-user. When developing AI tools based on those LLMs, vendors pass that responsibility through to leave it on their end user. Questionable with the current level of integration, but sure, for adults this is fine.
But we are integrating AI tools in materials shown to first time users. EdTech companies are replacing both teachers and subject matter experts with LLMs. People who know nothing about education or the subjects being taught are putting unfiltered LLM output in front of first time learners, passing the end-user responsibility on to… K-12 students or maybe teachers must be prepared to also write and edit the textbooks? AI Schools will be the hardest hit in the first wave of regulations with Scotland and New York already taking the first steps in guidance and legislation regulating the use of AI in classrooms and the first major wave of legislation regarding platform accountability that is hitting the tech titans such as Meta and X.
So what can you do about it?
Don’t wait.
Have actual measurements of accuracy and agreement to show your clients. Know where your product is working well, and where outputs need review. Being honest about the end-user responsiblity and the output accuracy builds trust. People still want to automate their workflows. Review is still faster than writing the first draft. And this way neither you nor your clients end up in hot-water when the regulations hit.