Getting Started with Azure AI Project Evaluation
When building generative AI applications, especially those employing Retrieval Augmented Generation (RAG), one of the biggest challenges is consistently ensuring they meet quality, cost, and latency requirements. It's not enough to just deploy a model; I need to know it's performing as expected, day in and day out. That's where Azure AI Foundry, the successor to Azure AI Studio, has become an indispensable part of my toolkit.
Azure AI Foundry acts as a central hub for developing and deploying AI agents and applications, and its robust tools for model monitoring and evaluation are particularly valuable. It's where I can tap into Microsoft's growing suite of AI models, alongside partnerships like OpenAI, making it a strategic platform for any AI development work.
In this guide, I'll walk you through how to leverage the Azure AI Project SDK to programmatically assess generative AI models. We'll focus on key metrics like Groundedness and Relevance, which are absolutely crucial for RAG applications, and I'll demonstrate how to perform AI-assisted evaluations. By the end, I hope you'll feel proficient in setting up an evaluation pipeline and interpreting its results, so your generative AI solutions consistently deliver high-quality outputs.
What I'll Share With You
- How I set up and configure the Azure AI Project SDK for Python.
- My approach to key generative AI evaluation metrics such as groundedness, relevance, and fluency.
- Practical steps I take to run AI-assisted evaluations for RAG applications.
- Methods I use to interpret evaluation results and integrate them into my development workflow.
Prerequisites
Before diving into evaluation tasks, I always ensure my environment is correctly configured. This guide assumes an intermediate level of familiarity with Python, Azure services, and fundamental generative AI concepts.
- Python 3.12 or later: Installed on your system. You can verify with
python --versionorpython3 --version. - Azure Subscription: An active Azure subscription is required. Make sure you have an Azure AI Foundry Project created and a model deployment available to act as your evaluation target. Also, for the AI-assisted evaluators to function, you'll need access to foundational GPT models (like GPT-3.5 Turbo or GPT-4) within your project.
- CLI Tools: Familiarity with Azure CLI or Azure Portal for resource management.
- Python Packages:
pip install azure-ai-project azure-identity pandas tabulate. - IDE: An IDE like Visual Studio Code for Python development.
From my experience, the core of successful evaluation lies in a methodical setup. Here’s how I approach it.
1. Configure My Python Project and Imports
I start by creating a new Python file, often named something like evaluate_genai.py. This is where I bring in the necessary modules and configure the client to connect to my Azure AI Project. Remember to replace the placeholder values below with your actual project details; I usually store these in environment variables for better security and flexibility.
import os
from azure.identity import DefaultAzureCredential
from azure.ai.project import AIProjectClient
# --- Configuration --- #
# Replace with your Azure AI Project details. I typically find the endpoint
# in the Azure AI Studio portal under my project's settings.
AZURE_AI_PROJECT_ENDPOINT = os.environ.get("AZURE_AI_PROJECT_ENDPOINT", "your_project_endpoint")
# The name of your deployed model that I want to evaluate.
EVALUATION_TARGET_DEPLOYMENT = os.environ.get("EVALUATION_TARGET_DEPLOYMENT", "your_model_deployment_name")
# Initialize DefaultAzureCredential for authentication
credential = DefaultAzureCredential()
print("Azure credential initialized.")
# Initialize AIProjectClient
project_client = AIProjectClient(endpoint=AZURE_AI_PROJECT_ENDPOINT, credential=credential)
print("AIProjectClient initialized.")
# Base folder for outputs
OUTPUT_FOLDER = "./eval_outputs"
os.makedirs(OUTPUT_FOLDER, exist_ok=True)
print(f"Output folder '{OUTPUT_FOLDER}' ensured.")
This snippet imports the required classes for project interaction and authentication. It configures the endpoint for my AI Project and initializes AIProjectClient, which is my main entry point for managing project resources. I also set up an output directory to store results.
Azure Authentication Tip
If you encounter an AuthenticationRequiredError, the quickest fix is usually to run az login in your terminal. This ensures your DefaultAzureCredential can pick up your Azure CLI session. For production scenarios, I always advocate for using Azure Managed Identities to avoid direct credential management.
2. Prepare My Evaluation Dataset
To evaluate a RAG application effectively, I need a dataset comprising prompts and the context against which the model's answer will be judged. Crucially, the model's completion isn't part of this input data; it's generated during the evaluation run itself.
# This is an example dataset I might use for evaluation,
# mimicking a Q&A session over financial documents.
# The 'item' structure is required by the evaluation framework.
data_for_evaluation = {
"type": "file_content",
"content": [
{
"item": {
"prompt_text": "What is the main financial challenge for NVIDIA in 2026?",
"context_text": "On March 15, 2026, NVIDIA reported strong Q4 2025 earnings but highlighted potential challenges in the upcoming fiscal year due to intensified competition from Intel and AMD in the AI accelerator market. Supply chain analysts also noted persistent complexities in securing advanced manufacturing capacities for NVIDIA's next-generation H200 and Blackwell GPUs, which could impact revenue growth."
}
},
{
"item": {
"prompt_text": "Explain the impact of the latest interest rate hike by the ECB on bond markets.",
"context_text": "On April 2, 2026, the European Central Bank (ECB) announced a 25 basis point increase in its key interest rates, citing persistent inflation concerns. Following this announcement, analysts from Bloomberg reported a sharp rise in 10-year German Bund yields, indicating a broad repricing of sovereign debt in the Eurozone due to higher financing costs for governments and a decreased attractiveness of existing lower-yield bonds."
}
}
]
}
print(f"Prepared {len(data_for_evaluation['content'])} data points for evaluation.")
This dictionary defines my evaluation data source. Each item contains the prompt_text (the user's query) and context_text (the retrieved information). The model I'm evaluating will generate an answer based on these inputs, which the evaluators will then assess.
3. Define AI-Assisted Evaluators
In the Azure AI Project SDK, I define my evaluators as part of the testing_criteria. I specify the built-in metric name (like builtin.groundedness) and then map the data fields from my dataset to the inputs required by that evaluator.
# I define the evaluators here. These leverage foundational GPT models
# provisioned within my Azure AI Project. The 'data_mapping' connects
# my dataset columns to the evaluator's expected inputs.
testing_criteria = [
{
"type": "azure_ai_evaluator",
"name": "groundedness_eval",
"evaluator_name": "builtin.groundedness",
"data_mapping": {
"question": "{{item.prompt_text}}",
"context": "{{item.context_text}}",
"answer": "{{target.response}}" # 'target.response' is the model's generated answer
},
},
{
"type": "azure_ai_evaluator",
"name": "relevance_eval",
"evaluator_name": "builtin.relevance",
"data_mapping": {
"question": "{{item.prompt_text}}",
"answer": "{{target.response}}"
},
}
]
print(f"Configured {len(testing_criteria)} AI-assisted evaluators.")
Here, I'm setting up a list of dictionaries, each configuring an evaluator. evaluator_name specifies the metric, and data_mapping uses a template syntax to wire inputs. {{item.prompt_text}} and {{item.context_text}} refer to fields in my dataset, while {{target.response}} is a special placeholder for the completion generated by the model under evaluation. I often refer to the Azure documentation on evaluation metrics for the exact mappings.
4. Run the Evaluation
With my data and evaluators defined, the next step is to create an evaluation object and then trigger a run. This is an asynchronous process managed by Azure AI. I typically kick this off and then monitor its progress in the portal.
# I get the OpenAI client from the project client to interact with evaluations.
openai_client = project_client.get_openai_client()
print("Creating evaluation object...")
eval_object = openai_client.evals.create(
name="Financial QA Evaluation",
testing_criteria=testing_criteria, # type: ignore
)
print(f"Evaluation object created (id: {eval_object.id}).")
# I define the data source and the target model to be evaluated.
data_source = {
"type": "azure_ai_target_completions",
"source": data_for_evaluation,
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "user",
"content": [
{"type": "input_text", "text": "Context: {{item.context_text}}\n\nQuestion: {{item.prompt_text}}"}
]
}
],
},
"target": {
"type": "azure_ai_model_deployment",
"name": EVALUATION_TARGET_DEPLOYMENT
},
}
print("Starting evaluation run...")
# Note: This operation will incur costs based on token usage of the underlying LLMs.
# As of early 2026, a representative cost might be approximately
# €0.01 per 1k input tokens and €0.03 per 1k output tokens. (Using $1 \u2248 \u20ac0.92 conversion)
eval_run = openai_client.evals.runs.create(
eval_id=eval_object.id,
name="financial_qa_eval_202604",
data_source=data_source # type: ignore
)
print(f"Evaluation run created and is in progress (id: {eval_run.id}).")
print("Monitor the run status and view results in the Azure AI Foundry portal.")
The evals.create function registers my evaluation definition, and evals.runs.create starts the actual run. It uses the eval_id from the previous step, a human-readable name for the run (which appears in the portal), and a data_source object. This object tells Azure about the input data, how to format it into a prompt for the model, and which deployed model (target) I want to evaluate. The entire run executes in the background on Azure.
5. Analyze Evaluation Results
Once the evaluation run completes in the Azure AI Foundry portal, I can download the results as a JSONL file for detailed analysis. I usually navigate to the "Evaluations" section in my AI Project in the portal, find my run (financial_qa_eval_202604), and then download the evaluation_results.jsonl file into my eval_outputs folder. From there, I use Python to parse and analyze the results using pandas for data manipulation and tabulate for clean markdown formatting.
import pandas as pd
import json
# This step assumes I have downloaded the results from the portal.
results_path = os.path.join(OUTPUT_FOLDER, "evaluation_results.jsonl")
try:
# The results file contains one JSON object per line.
with open(results_path, 'r') as f:
lines = f.readlines()
# Each line has a complex structure; I need to extract the relevant parts.
parsed_results = []
for line in lines:
data = json.loads(line)
# Extract input data and evaluation metrics.
# The exact structure may vary; I usually inspect my JSONL file first.
result_item = data.get('item', {})
metrics = data.get('metrics', {})
parsed_results.append({
'prompt_text': result_item.get('prompt_text'),
'completion_text': data.get('response'), # The generated response
'groundedness_score': metrics.get('groundedness_eval.groundedness'),
'relevance_score': metrics.get('relevance_eval.relevance')
})
results_df = pd.DataFrame(parsed_results)
print("\n--- Detailed Per-Instance Results (first 2 rows) ---")
print(results_df.head(2).to_markdown(index=False))
# Example: Calculate and display aggregate metrics.
print("\n--- Aggregate Evaluation Results ---")
print(f" Average Groundedness: {results_df['groundedness_score'].mean():.2f}")
print(f" Average Relevance: {results_df['relevance_score'].mean():.2f}")
except FileNotFoundError:
print(f"\nAnalysis skipped: Could not find '{results_path}'. Please download it from the Azure AI Foundry portal first.")
This script reads the downloaded JSONL file line by line, parses the complex JSON structure to extract the input prompt, the model's generated completion, and the scores for each metric. I then load this data into a Pandas DataFrame, which makes analysis and aggregation straightforward.
6. Monitor Results in Azure AI Foundry Portal
While the SDK gives me programmatic control, I still rely heavily on the Azure AI Foundry portal. It offers a centralized UI for monitoring my evaluation runs, which is particularly useful for operational oversight.
- Navigate to Azure AI Foundry: Log in to the Azure AI Studio portal.
- Locate My Project: Select your Azure AI Project.
- Find Evaluation Runs: Navigate to the "Evaluations" section. I should see a list of evaluations and their runs, including
financial_qa_eval_202604. - Review Metrics and Details: Clicking on my evaluation run shows me aggregate metrics, detailed per-instance results, and visualizations of performance. It's the primary interface for tracking trends in metrics like groundedness and relevance over time.
Troubleshooting Run Visibility
If an evaluation run isn't visible in the portal, I always double-check that my AZURE_AI_PROJECT_ENDPOINT in the Python script correctly matches the Azure AI Project I'm viewing. I also confirm the evaluation run was created without errors in the first place.
Visualizing the Evaluation Flow
I find that a diagram often helps clarify the evaluation process, especially when multiple components are involved. Here's how I envision the data and control flow for evaluating a generative AI model within Azure AI Foundry:
Production Considerations
Deploying generative AI applications requires careful consideration of security, performance, scalability, and continuous monitoring. These are areas where I spend a lot of my time ensuring robust solutions.
Security Best Practices
- Least Privilege: I always ensure that any service principal or managed identity used by my evaluation pipelines has only the necessary permissions, nothing more.
- Data Protection: Encrypting evaluation datasets at rest and in transit is non-negotiable. I also implement data anonymization for any sensitive data.
- Managed Identities: For production workloads, I exclusively use Azure Managed Identities for authenticating to Azure services. It simplifies credential management and enhances security.
Performance Optimization
- Batch Processing: The evaluation framework is inherently designed to process datasets in batches efficiently, so I leverage that.
- Evaluator Choice: While GPT-4 offers superior reasoning, I've found it can be more expensive and slower. For larger-scale evaluations where cost and latency are concerns, I often opt for GPT-3.5 Turbo if its quality is acceptable for the task.
Scalability Considerations
- Azure Machine Learning Pipelines: Integrating my evaluation logic into Azure Machine Learning pipelines allows for automated, scalable execution in MLOps workflows.
- Regional Deployment: For my European projects, I always deploy Azure AI resources in European regions such as
westeuropeornortheuropeto minimize latency for EU-based users and ensure data residency compliance.
Monitoring Recommendations
- Alerting: Setting up Azure Monitor alerts on key evaluation metrics is crucial. For instance, I might trigger an alert if the average groundedness score drops below a predefined threshold.
- Dashboarding: I create custom dashboards, either in Azure AI Foundry or Azure Monitor, to visualize trends in evaluation metrics over time. This helps me spot regressions quickly.
- Cost Monitoring: Continuously monitoring the cost associated with my LLM calls for evaluation is vital. I establish budgets and cost alerts to prevent unexpected overspending.
Conclusion
Evaluating generative AI models, particularly in a RAG context, is not a one-time task but an an ongoing process. What I've shared here is my approach to integrating evaluation into the development lifecycle using the Azure AI Project SDK and the Azure AI Foundry portal. The ability to programmatically define evaluators, run tests against a deployed model, and then analyze the results is foundational to building reliable and performant AI applications. It's the 'last mile' that ensures the AI I build actually delivers its promised ROI.
My recommendation is to integrate these evaluations directly into your CI/CD pipelines. Set clear metric thresholds, and let your pipeline fail if those thresholds aren't met. This creates a powerful guardrail against model degradation. While built-in evaluators provide a great starting point, don't shy away from exploring custom evaluators for niche requirements.
Key Takeaways
- Azure AI Foundry provides a robust platform for programmatic evaluation of generative AI models, crucial for RAG applications.
- The
azure.ai.projectSDK enables defining evaluation criteria, connecting datasets, and triggering evaluation runs directly from Python. - Key metrics like Groundedness and Relevance are automatically assessed using AI-assisted evaluators hosted within the Azure AI Project.
- Monitoring results in the Azure AI Foundry portal and integrating analysis into CI/CD pipelines are essential for maintaining model quality.
- Cost monitoring of LLM usage during evaluations is critical to manage expenses, with typical costs around €0.01 per 1k input tokens and €0.03 per 1k output tokens (using $1 \u2248 €0.92).