Gramian Consulting Group

AI Evaluation Engineer (Knowledge & Research)

🇬🇭 Remote, GH Remote Contract Posted May 4, 2026
LocationRemote, GH
WorkplaceRemote
EmploymentContract
LanguageEnglish
PostedMay 4, 2026
Last verifiedMay 11, 2026

JobGrid listing details

JobGrid.eu keeps the employer description in its original language and adds clear listing facts, freshness, and source context so candidates can evaluate the role before applying.

Key details
1 location, Remote, Contract
Current openings
20 active jobs
Original language
English
Source and freshness
Collected from public career pages and reviewed through JobGrid.eu source availability checks. Last verified: May 11, 2026.
Apply path
JobGrid.eu sends candidates to the original application page and adds non-personal referral parameters.

About Us

Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.

Role overview

We are looking for an AI Evaluation Engineer with a strong research background to design and evaluate complex, multi-agent tasks used to benchmark next-generation AI systems.

In this role, you will work at the intersection of research, data structuring, and AI evaluation, building high-quality tasks that require deep document understanding, structured reasoning, and multi-step synthesis. You will create datasets and evaluation frameworks that test whether AI agents can truly read, reason, and extract knowledge from large-scale unstructured data.

This is a high-precision, detail-oriented role requiring strong analytical thinking, structured problem decomposition, and the ability to translate research content into measurable evaluation tasks.

Commitments Required: 8 hours per day with an overlap of 4 hours with PST.

Employment type: Contractor assignment (no medical/paid leave)

Duration of contract: 5 weeks+

Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam

Interview: take home assessment (60min)

Responsibilities

  • Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections
  • Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
  • Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
  • Design LLM judge prompts that evaluate agent output field-by-field against the oracle
  • Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis)

Before you leave

Leave your email to track this opening and receive relevant alerts. You can also continue without sharing it.