AI Evaluation Engineer · LLM Testing Specialist · AI QA Engineer

QA Engineer Specializing in AI Evaluation, RAG Testing & LLM Reliability

I specialize in testing and validating AI-powered systems, focusing on RAG architectures, conversational AI assistants, structured data extraction, bias detection, and AI decision reliability. My goal is to make AI systems measurable, stable, and production-safe.

Berlin, Germany
Sorabh Vasudeva
SV

About

From 14 years of software quality engineering to the frontier of AI evaluation and reliability.

Over the past 14 years, I have shaped quality engineering practices across startups, scale-ups, and enterprise organizations. My work spans building automation infrastructure at Urban Sports Club, leading QA for multi-country mobile releases at Vivy GmbH, and driving regression strategy at Censhare.

As a freelance QA consultant, I partnered with SaaS platforms like Carbon6 and Jobilla to design automation frameworks using Selenium, Cypress, and Playwright—delivering measurable improvements in coverage, cycle time, and defect detection.

Today, I am transitioning into AI Evaluation and AI Reliability Engineering. I evaluate production AI assistants, test RAG architectures, detect hallucinations and bias in LLM outputs, run adversarial red-team scenarios, and build structured evaluation frameworks—ensuring AI systems are measurable, stable, and production-safe.

AI Evaluation & LLM Testing
Red Teaming & AI Safety
RAG & Retrieval Validation
Bias & Fairness Testing
Automation Architecture

Technical Stack

Three pillars of expertise driving AI evaluation and quality engineering.

AI & LLM Tools

  • Promptfoo
  • LLM-as-a-Judge Evaluation
  • Red Teaming Methodologies
  • RAG Evaluation Techniques
  • Vector Database Testing
  • Embedding Validation
  • JSON Schema Validation

Automation

  • JavaScript & Python
  • Cypress & Playwright
  • Selenium & Detox
  • API Testing (REST, Postman)
  • CI/CD Integration for AI Regression
  • Performance Testing (K6)
  • BDD & Cucumber

AI Evaluation Methods

  • Groundedness Testing
  • Hallucination Detection
  • Retrieval Recall@K Evaluation
  • Bias & Fairness Testing
  • Counterfactual Testing
  • Adversarial Testing
  • Consistency Scoring

Experience

A trajectory of increasing scope and technical depth across diverse domains.

Senior QA Engineer

Talon.One

Dec 2025 – Present

Driving quality engineering for a leading promotion and loyalty engine, with a focus on Cypress-based end-to-end automation and AI-assisted testing workflows. Designing and expanding test coverage for complex campaign logic and API integrations while applying AI models to enhance test generation, defect prediction, and regression analysis across the platform.

CypressAI EvaluationAPI AutomationPromotion EngineE2E Testing

Freelance QA Automation Engineer

Jobilla

2022 – 2025

Drove a 30% increase in automation coverage by architecting end-to-end test suites for a recruitment platform. Led QA strategy across API and UI layers, building scalable web automation frameworks with Selenium, Cypress, and Playwright to ensure reliable cross-browser coverage and faster release cycles.

SeleniumCypressPlaywrightPythonK6AI Evaluation

Freelance QA Consultant

Carbon6

2021 – 2022

Strengthened automation coverage for SaaS platforms serving 1,000+ Amazon sellers. Designed UI and API test frameworks that ensured reliable workflows across inventory forecasting, catalog management, and ad campaign modules. Prepared automation insights that contributed to fundraising due diligence.

API TestingUI AutomationSaaSCross-Functional

Lead Quality Assurance Engineer

Vivy GmbH

2021

Led QA for native Android and iOS health app releases across multiple countries. Managed end-to-end release cycles including App Store and Play Store submissions, while improving cross-team collaboration between development, localization, and compliance teams.

Mobile QARelease ManagementHealthcareMulti-Country

Senior QA Automation Engineer

Urban Sports Club GmbH

2019 – 2021

Built the automation infrastructure from scratch, establishing the foundation for scalable test coverage. Integrated API tests into CI pipelines, mentored junior QA engineers, and implemented BDD testing with Cucumber—reducing QA cycle time by 25% across the organization.

Automation ArchitectureCI/CDBDDCucumberMentorship

Senior QA / Scrum Master

Censhare

2017 – 2018

Spearheaded regression testing during a critical migration from legacy to modern web-based architecture. Authored BDD test cases, collaborated with developers and product owners to align coverage with evolving requirements, and improved defect detection rate by 40% through enhanced QA frameworks.

Regression TestingScrum MasterBDDPlatform Migration

Co-Founder & Test Lead

Xtronit Solutions

2014 – 2016

Co-founded a technology solutions venture, leading test strategy and delivery from the ground up. Defined QA processes, built client-facing test frameworks, and managed end-to-end quality across multiple client engagements—combining entrepreneurial ownership with deep technical execution.

EntrepreneurshipTest StrategyClient DeliveryQA Leadership

Senior Software Engineer

Infosys Limited

2009 – 2013

Built a strong engineering foundation across banking, telecom, and enterprise domains. Developed testing methodologies and contributed to large-scale quality assurance initiatives within one of the world’s largest IT services organizations.

Enterprise QABankingTelecom

AI Evaluation Experience

Evaluating a production AI assistant that processes tenant applications through voice, RAG, and structured data extraction.

I evaluate a production AI Assistant that conducts voice conversations with tenants, converts speech to text, retrieves context through RAG (Retrieval-Augmented Generation) with a vector database, extracts structured data (income, employment type, pets, and more), and generates tenant recommendations for landlords.

01

Hallucination Detection

Identifying cases where the AI fabricates information not present in the source data. I validate groundedness of generated responses against retrieved documents and tenant inputs to ensure factual accuracy.

02

Structured Extraction Validation

Testing accuracy of data extraction from conversational inputs—income amounts, employment types, pet ownership, and other tenant attributes. I surface extraction errors, contradictory state handling, and edge cases in parsing.

03

RAG Retrieval Quality

Evaluating retrieval relevance and recall in the RAG pipeline. I measure whether the correct context is retrieved from the vector database and whether the generation model uses it accurately to produce grounded responses.

04

Adversarial Red Teaming

Running red-team scenarios to test system resilience against prompt injection, instruction override, policy manipulation, and social engineering. I probe boundaries to ensure the AI maintains safe, policy-compliant behavior.

05

Bias & Fairness Testing

Detecting potential bias in tenant recommendations through counterfactual testing—changing demographic attributes while keeping all other inputs constant to identify discriminatory patterns in AI decision-making.

06

Consistency & Regression Testing

Testing stability of outputs across multiple runs and LLM model updates. I measure recommendation consistency, detect response drift, and perform regression testing to prevent decision instability after model changes.

Key Achievements

Measurable outcomes delivered across organizations and domains.

30%

Automation Coverage Increase

Expanded test automation coverage within three months by architecting scalable frameworks and targeted test suites.

25%

QA Cycle Time Reduction

Streamlined testing processes and CI/CD integration to accelerate release cycles without compromising quality.

40%

Defect Detection Improvement

Enhanced QA frameworks and regression strategies to surface critical defects earlier in the development lifecycle.

Infrastructure from Scratch

Designed and built complete automation infrastructure at Urban Sports Club, setting the foundation for ongoing quality engineering.

Distributed Team Leadership

Led QA initiatives across geographically distributed teams, aligning quality standards and practices across multiple countries.

Certifications

Verified credentials across project management, agile, and AI engineering.

Building Reliable AI Systems

My philosophy on AI evaluation and why it matters.

AI as Critical Infrastructure

AI systems that make decisions about people must be tested like critical infrastructure. The stakes are too high for untested models to reach production.

Structured Evaluation is Non-Negotiable

Conversational AI requires structured evaluation frameworks—not ad-hoc testing. Systematic evaluation with defined metrics is the only path to reliable AI.

Bias Testing is a Core Requirement

Decision-making AI must be bias-tested and stable. Counterfactual testing and fairness audits should be embedded in every AI quality process.

Regression Testing for LLMs

Model updates can silently degrade performance. Regression testing is essential for LLM-based products to prevent decision instability.

Red Teaming is AI Safety

Red teaming is not optional—it is a core part of AI safety. Adversarial testing must be continuous, structured, and integrated into the development lifecycle.

Get in Touch

Let’s build reliable AI systems together.