Module @arizeai/phoenix-evals

@arizeai/phoenix-evals

This package provides a TypeScript evaluation library. It is vendor agnostic and can be used in isolation of any framework or platform. This package is still under active development and is subject to change.

Installation

# or yarn, pnpm, bun, etc...
npm install @arizeai/phoenix-evals

Usage

Creating a Classifier

The library provides a createClassifier function that allows you to create custom evaluators for different tasks like hallucination detection, relevance scoring, or any binary/multi-class classification.

import { createClassifier } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const promptTemplate = `
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters.

    [BEGIN DATA]
    ************
    [Query]: {{input}}
    ************
    [Reference text]: {{reference}}
    ************
    [Answer]: {{output}}
    ************
    [END DATA]

Is the answer above factual or hallucinated based on the query and reference text?
`;

// Create the classifier
const evaluator = await createClassifier({
  model,
  choices: { factual: 1, hallucinated: 0 },
  promptTemplate: promptTemplate,
});

// Use the classifier
const result = await evaluator({
  output: "Arize is not open source.",
  input: "Is Arize Phoenix Open Source?",
  reference:
    "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",
});

console.log(result);
// Output: { label: "hallucinated", score: 0 }

See the complete example in examples/classifier_example.ts.

Pre-Built Evaluators

The library includes several pre-built evaluators for common evaluation tasks. These evaluators come with optimized prompts and can be used directly with any AI SDK model.

All pre-built evaluators are available from the @arizeai/phoenix-evals/llm module:

Evaluator	Function	Description
Faithfulness	`createFaithfulnessEvaluator`	Detects hallucinations — checks if the output is grounded in the provided context
Conciseness	`createConcisenessEvaluator`	Evaluates whether the response is appropriately concise
Correctness	`createCorrectnessEvaluator`	Checks if the output is factually correct given the input
Document Relevance	`createDocumentRelevanceEvaluator`	Measures how relevant a retrieved document is to the query
Refusal	`createRefusalEvaluator`	Detects whether the model refused to answer
Tool Invocation	`createToolInvocationEvaluator`	Evaluates whether the correct tool was invoked with the right arguments
Tool Selection	`createToolSelectionEvaluator`	Checks whether the right tool was selected for the task
Tool Response Handling	`createToolResponseHandlingEvaluator`	Evaluates how well the model uses a tool's response

import {
  createFaithfulnessEvaluator,
  createConcisenessEvaluator,
  createCorrectnessEvaluator,
  createDocumentRelevanceEvaluator,
  createRefusalEvaluator,
} from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

// Faithfulness: checks if the output is grounded in the context
const faithfulnessEvaluator = createFaithfulnessEvaluator({ model });
const faithfulnessResult = await faithfulnessEvaluator.evaluate({
  input: "What is the capital of France?",
  context: "France is a country in Europe. Paris is its capital city.",
  output: "The capital of France is London.",
});
console.log(faithfulnessResult);
// Output: { label: "unfaithful", score: 0, explanation: "..." }

// Correctness: checks if the output is factually correct
const correctnessEvaluator = createCorrectnessEvaluator({ model });
const correctnessResult = await correctnessEvaluator.evaluate({
  input: "What is the capital of France?",
  output: "Paris is the capital of France.",
});
console.log(correctnessResult);
// Output: { label: "correct", score: 1, explanation: "..." }

// Document Relevance: checks if a retrieved document is relevant to the query
const relevanceEvaluator = createDocumentRelevanceEvaluator({ model });
const relevanceResult = await relevanceEvaluator.evaluate({
  input: "What is the capital of France?",
  documentText: "Paris is the capital of France and a major European city.",
});
console.log(relevanceResult);
// Output: { label: "relevant", score: 1, explanation: "..." }

Data Mapping

When your data structure doesn't match what an evaluator expects, use bindEvaluator to map your fields to the evaluator's expected input format:

import { bindEvaluator } from "@arizeai/phoenix-evals";
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

type ExampleType = {
  question: string;
  context: string;
  answer: string;
};

const evaluator = bindEvaluator<ExampleType>(
  createFaithfulnessEvaluator({ model }),
  {
    inputMapping: {
      input: "question", // Map "input" from "question"
      context: "context", // Map "context" from "context"
      output: "answer", // Map "output" from "answer"
    },
  }
);

const result = await evaluator.evaluate({
  question: "Is Arize Phoenix Open Source?",
  context:
    "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",
  answer: "Arize is not open source.",
});

Mapping supports simple properties ("fieldName"), dot notation ("user.profile.name"), array access ("items[0].id"), JSONPath expressions ("$.items[*].id"), and function extractors ((data) => data.customField).

See the complete example in examples/bind_evaluator_example.ts.

Experimentation with Phoenix

This package works seamlessly with @arizeai/phoenix-client to enable experimentation workflows. You can create datasets, run experiments, and trace evaluation calls for analysis and debugging.

Running Experiments

To run experiments with your evaluations, install the phoenix-client

npm install @arizeai/phoenix-client

import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";
import { createDataset } from "@arizeai/phoenix-client/datasets";
import {
  asExperimentEvaluator,
  runExperiment,
} from "@arizeai/phoenix-client/experiments";

// Create your evaluator
const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o-mini"),
});

// Create a dataset for your experiment
const dataset = await createDataset({
  name: "faithfulness-eval",
  description: "Evaluate the faithfulness of the model",
  examples: [
    {
      input: {
        question: "Is Phoenix Open-Source?",
        context: "Phoenix is Open-Source.",
      },
    },
    // ... more examples
  ],
});

// Define your experimental task
const task = async (example) => {
  // Your AI system's response to the question
  return "Phoenix is not Open-Source";
};

// Create a custom evaluator to validate results
const faithfulnessCheck = asExperimentEvaluator({
  name: "faithfulness",
  kind: "LLM",
  evaluate: async ({ input, output }) => {
    // Use the faithfulness evaluator from phoenix-evals
    const result = await faithfulnessEvaluator({
      input: input.question,
      context: input.context,
      output: output,
    });

    return result; // Return the evaluation result
  },
});

// Run the experiment with automatic tracing
runExperiment({
  experimentName: "faithfulness-eval",
  experimentDescription: "Evaluate the faithfulness of the model",
  dataset: dataset,
  task,
  evaluators: [faithfulnessCheck],
});

Examples

To run examples, install dependencies using pnpm and run:

pnpm install
pnpx tsx examples/classifier_example.ts
# change the file name to run other examples

Community

Join our community to connect with thousands of AI builders:

🌍 Join our Slack community.
📚 Read the Phoenix documentation.
💡 Ask questions and provide feedback in the #phoenix-support channel.
🌟 Leave a star on our GitHub.
🐞 Report bugs with GitHub Issues.
𝕏 Follow us on 𝕏.
💼 Follow us on LinkedIn.
🗺️ Check out our roadmap to see where we're heading next.

Modules

__generated__/default_templates
__generated__/default_templates/CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/DOCUMENT_RELEVANCE_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/FAITHFULNESS_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/HALLUCINATION_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/REFUSAL_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/default_templates/TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG
__generated__/types
core/EvaluatorBase
core/FunctionEvaluator
helpers
helpers/asEvaluatorFn
helpers/createEvaluator
helpers/toEvaluationResult
index
llm
llm/ClassificationEvaluator
llm/createClassificationEvaluator
llm/createClassifierFn
llm/createConcisenessEvaluator
llm/createCorrectnessEvaluator
llm/createDocumentRelevanceEvaluator
llm/createFaithfulnessEvaluator
llm/createHallucinationEvaluator
llm/createRefusalEvaluator
llm/createToolInvocationEvaluator
llm/createToolResponseHandlingEvaluator
llm/createToolSelectionEvaluator
llm/generateClassification
llm/LLMEvaluator
telemetry
template
template/applyTemplate
template/createTemplateVariablesProxy
template/getTemplateVariables
types
types/base
types/data
types/evals
types/otel
types/prompts
types/templating
utils
utils/bindEvaluator
utils/objectMappingUtils
utils/typeUtils