62
Data Annotator - Agents and Tool Calling Expertise
UpworkCANot specifiedintermediate
Machine LearningPythonImage AnnotationImage SegmentationMachine Learning ModelVideo AnnotationData AnnotationVideo Editing & ProductionGenerative AIDeep LearningLLM Prompt EngineeringAI Agent Developmentn8nAutomationAPI
Background:
1. Must be familiar with using tool calling LLMs and should be aware of tool schemas and expected responses from an environment for a certain action taken
2. Must hold a CS degree and must have some background with using + building with agents
3. Must be at least 18 years of age
Task (two separate tasks):
Given a conversation (total of 100 conversations) in the format of State: [state] Action: [action] and a predicted output of the action on the current state (Next State: [predicted next state]), you must judge the predicted next state on the basis of:
training_utility (would this prediction be useful for training a model?):
5 = Excellent: correct outcome type, correct schema, realistic values – ideal training signal
4 = Good: minor value errors but correct structure and outcome type
3 = Mediocre: correct outcome type but wrong schema or implausible values
2 = Poor: wrong outcome type (success vs. error) or severely hallucinated structure
1 = Useless: completely wrong, nonsensical, or would teach wrong behaviors
realism (does the predicted response look like a real API response for this domain?):
5 = Indistinguishable from a real API response
4 = Mostly realistic with minor inconsistencies (e.g. slightly wrong field names but still valid)
3 = Partially realistic but with notable hallucinations (invented fields that do not look realistic or fitting to the tool response)
2 = Implausible schema or values for this domain
1 = Completely unrealistic (random/nonsensical data)
outcome_correctness (does the model correctly predict the outcome of the action?):
5 = Exactly correct outcome type and key facts
4 = Correct outcome type with minor factual errors
3 = Correct outcome type but significantly wrong facts
2 = Wrong outcome type (e.g. predicted success but expected error)
1 = Completely incorrect
2. Preference understanding:
Given a set of 100 comparative tool outputs, rank which one you would most likely see in a real trajectory for the given task.
Message if interested.
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/moClient
Spent: $33,119.4Rating: 4.7Verified