Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

print(“\nPART 5 ── Datasets & experiments ————————————–“)
DATASET = “capital-cities-tutorial”
langfuse.create_dataset(name=DATASET, description=”Capital-city QA benchmark”)
_items = [
(“What is the capital of France?”, “Paris”),
(“What is the capital of Germany?”, “Berlin”),
(“What is the capital of Japan?”, “Tokyo”),
(“What is the capital of Italy?”, “Rome”),
]
for i, (q, a) in enumerate(_items):
langfuse.create_dataset_item(dataset_name=DATASET, id=f”cap-{i}”,
input={“question”: q}, expected_output=a)
def capital_task(*, item, **kwargs):
question = item.input[“question”] if isinstance(item.input, dict) else item.input
return llm_chat([{“role”: “user”, “content”: question}], name=”experiment-answer”)
def accuracy(*, input, output, expected_output, metadata=None, **kwargs):
hit = bool(expected_output) and expected_output.lower() in (output or “”).lower()
return Evaluation(name=”accuracy”, value=1.0 if hit else 0.0,
comment=”exact-match contains check”)
def conciseness(*, input, output, **kwargs):
return Evaluation(name=”char_length”, value=float(len(output or “”)))
def mean_accuracy(*, item_results, **kwargs):
vals = [e.value for r in item_results for e in r.evaluations if e.name == “accuracy”]
avg = sum(vals) / len(vals) if vals else 0.0
return Evaluation(name=”mean_accuracy”, value=avg, comment=f”{avg:.0%} correct”)
dataset = langfuse.get_dataset(DATASET)
result = dataset.run_experiment(
name=”capitals-baseline”,
description=”Baseline run from the Colab tutorial”,
task=capital_task,
evaluators=[accuracy, conciseness],
run_evaluators=[mean_accuracy],
max_concurrency=4,
)
print(result.format())

What's Hot

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab

From Local App to Public Website in Minutes

Anthropic IPO filing marks AI maturing into enterprise utility

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services

What's Hot

Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

Related Posts

News

Company

Services

Subscribe to Updates