AI Evaluation Engineer · LLM Testing Specialist · AI QA Engineer
I specialize in testing and validating AI-powered systems, focusing on RAG architectures, conversational AI assistants, structured data extraction, bias detection, and AI decision reliability. My goal is to make AI systems measurable, stable, and production-safe.
From 14 years of software quality engineering to the frontier of AI evaluation and reliability.
Over the past 14 years, I have shaped quality engineering practices across startups, scale-ups, and enterprise organizations. My work spans building automation infrastructure at Urban Sports Club, leading QA for multi-country mobile releases at Vivy GmbH, and driving regression strategy at Censhare.
As a freelance QA consultant, I partnered with SaaS platforms like Carbon6 and Jobilla to design automation frameworks using Selenium, Cypress, and Playwright—delivering measurable improvements in coverage, cycle time, and defect detection.
Today, I am transitioning into AI Evaluation and AI Reliability Engineering. I evaluate production AI assistants, test RAG architectures, detect hallucinations and bias in LLM outputs, run adversarial red-team scenarios, and build structured evaluation frameworks—ensuring AI systems are measurable, stable, and production-safe.
Three pillars of expertise driving AI evaluation and quality engineering.
A trajectory of increasing scope and technical depth across diverse domains.
Talon.One
Driving quality engineering for a leading promotion and loyalty engine, with a focus on Cypress-based end-to-end automation and AI-assisted testing workflows. Designing and expanding test coverage for complex campaign logic and API integrations while applying AI models to enhance test generation, defect prediction, and regression analysis across the platform.
Jobilla
Drove a 30% increase in automation coverage by architecting end-to-end test suites for a recruitment platform. Led QA strategy across API and UI layers, building scalable web automation frameworks with Selenium, Cypress, and Playwright to ensure reliable cross-browser coverage and faster release cycles.
Carbon6
Strengthened automation coverage for SaaS platforms serving 1,000+ Amazon sellers. Designed UI and API test frameworks that ensured reliable workflows across inventory forecasting, catalog management, and ad campaign modules. Prepared automation insights that contributed to fundraising due diligence.
Vivy GmbH
Led QA for native Android and iOS health app releases across multiple countries. Managed end-to-end release cycles including App Store and Play Store submissions, while improving cross-team collaboration between development, localization, and compliance teams.
Urban Sports Club GmbH
Built the automation infrastructure from scratch, establishing the foundation for scalable test coverage. Integrated API tests into CI pipelines, mentored junior QA engineers, and implemented BDD testing with Cucumber—reducing QA cycle time by 25% across the organization.
Censhare
Spearheaded regression testing during a critical migration from legacy to modern web-based architecture. Authored BDD test cases, collaborated with developers and product owners to align coverage with evolving requirements, and improved defect detection rate by 40% through enhanced QA frameworks.
Xtronit Solutions
Co-founded a technology solutions venture, leading test strategy and delivery from the ground up. Defined QA processes, built client-facing test frameworks, and managed end-to-end quality across multiple client engagements—combining entrepreneurial ownership with deep technical execution.
Infosys Limited
Built a strong engineering foundation across banking, telecom, and enterprise domains. Developed testing methodologies and contributed to large-scale quality assurance initiatives within one of the world’s largest IT services organizations.
Evaluating a production AI assistant that processes tenant applications through voice, RAG, and structured data extraction.
I evaluate a production AI Assistant that conducts voice conversations with tenants, converts speech to text, retrieves context through RAG (Retrieval-Augmented Generation) with a vector database, extracts structured data (income, employment type, pets, and more), and generates tenant recommendations for landlords.
Identifying cases where the AI fabricates information not present in the source data. I validate groundedness of generated responses against retrieved documents and tenant inputs to ensure factual accuracy.
Testing accuracy of data extraction from conversational inputs—income amounts, employment types, pet ownership, and other tenant attributes. I surface extraction errors, contradictory state handling, and edge cases in parsing.
Evaluating retrieval relevance and recall in the RAG pipeline. I measure whether the correct context is retrieved from the vector database and whether the generation model uses it accurately to produce grounded responses.
Running red-team scenarios to test system resilience against prompt injection, instruction override, policy manipulation, and social engineering. I probe boundaries to ensure the AI maintains safe, policy-compliant behavior.
Detecting potential bias in tenant recommendations through counterfactual testing—changing demographic attributes while keeping all other inputs constant to identify discriminatory patterns in AI decision-making.
Testing stability of outputs across multiple runs and LLM model updates. I measure recommendation consistency, detect response drift, and perform regression testing to prevent decision instability after model changes.
Measurable outcomes delivered across organizations and domains.
Expanded test automation coverage within three months by architecting scalable frameworks and targeted test suites.
Streamlined testing processes and CI/CD integration to accelerate release cycles without compromising quality.
Enhanced QA frameworks and regression strategies to surface critical defects earlier in the development lifecycle.
Designed and built complete automation infrastructure at Urban Sports Club, setting the foundation for ongoing quality engineering.
Led QA initiatives across geographically distributed teams, aligning quality standards and practices across multiple countries.
Verified credentials across project management, agile, and AI engineering.
Neue Fische · Part 1 (Data Scientist)
View CredentialNeue Fische · Part 2 (ML Engineer)
View CredentialPMI · PMP
CSM
My philosophy on AI evaluation and why it matters.
AI systems that make decisions about people must be tested like critical infrastructure. The stakes are too high for untested models to reach production.
Conversational AI requires structured evaluation frameworks—not ad-hoc testing. Systematic evaluation with defined metrics is the only path to reliable AI.
Decision-making AI must be bias-tested and stable. Counterfactual testing and fairness audits should be embedded in every AI quality process.
Model updates can silently degrade performance. Regression testing is essential for LLM-based products to prevent decision instability.
Red teaming is not optional—it is a core part of AI safety. Adversarial testing must be continuous, structured, and integrated into the development lifecycle.
Let’s build reliable AI systems together.