Data Annotator - Agents and Tool Calling Expertise

UpworkCANot specifiedintermediate

Machine LearningPythonImage AnnotationImage SegmentationMachine Learning ModelVideo AnnotationData AnnotationVideo Editing & ProductionGenerative AIDeep LearningLLM Prompt EngineeringAI Agent Developmentn8nAutomationAPI

Background:
Must be familiar with using tool calling LLMs and should be aware of tool schemas and expected responses from an environment for a certain action taken
Must hold a CS degree and must have some background with using + building with agents
Must be at least 18 years of age


Task (two separate tasks):

Given a conversation (total of 100 conversations) in the format of State: [state] Action: [action] and a predicted output of the action on the current state (Next State: [predicted next state]), you must judge the predicted next state on the basis of:

training_utility (would this prediction be useful for training a model?):
= Excellent: correct outcome type, correct schema, realistic values – ideal training signal
= Good: minor value errors but correct structure and outcome type
= Mediocre: correct outcome type but wrong schema or implausible values
= Poor: wrong outcome type (success vs. error) or severely hallucinated structure
= Useless: completely wrong, nonsensical, or would teach wrong behaviors

realism (does the predicted response look like a real API response for this domain?):
= Indistinguishable from a real API response
= Mostly realistic with minor inconsistencies (e.g. slightly wrong field names but still valid)
= Partially realistic but with notable hallucinations (invented fields that do not look realistic or fitting to the tool response)
= Implausible schema or values for this domain
= Completely unrealistic (random/nonsensical data)

outcome_correctness (does the model correctly predict the outcome of the action?):
= Exactly correct outcome type and key facts
= Correct outcome type with minor factual errors
= Correct outcome type but significantly wrong facts
= Wrong outcome type (e.g. predicted success but expected error)
= Completely incorrect


Preference understanding:
Given a set of 100 comparative tool outputs, rank which one you would most likely see in a real trajectory for the given task.

Message if interested.

View Original Listing

Unlock AI intelligence, score breakdowns, and real-time alerts

Upgrade to Pro — $29.99/mo

Client

Spent: $33,119.4Rating: 4.7Verified