62

Data Annotator - Agents and Tool Calling Expertise

UpworkCANot specifiedintermediate
Machine LearningPythonImage AnnotationImage SegmentationMachine Learning ModelVideo AnnotationData AnnotationVideo Editing & ProductionGenerative AIDeep LearningLLM Prompt EngineeringAI Agent Developmentn8nAutomationAPI
Background: 1. Must be familiar with using tool calling LLMs and should be aware of tool schemas and expected responses from an environment for a certain action taken 2. Must hold a CS degree and must have some background with using + building with agents 3. Must be at least 18 years of age Task (two separate tasks): Given a conversation (total of 100 conversations) in the format of State: [state] Action: [action] and a predicted output of the action on the current state (Next State: [predicted next state]), you must judge the predicted next state on the basis of: training_utility (would this prediction be useful for training a model?): 5 = Excellent: correct outcome type, correct schema, realistic values – ideal training signal 4 = Good: minor value errors but correct structure and outcome type 3 = Mediocre: correct outcome type but wrong schema or implausible values 2 = Poor: wrong outcome type (success vs. error) or severely hallucinated structure 1 = Useless: completely wrong, nonsensical, or would teach wrong behaviors realism (does the predicted response look like a real API response for this domain?): 5 = Indistinguishable from a real API response 4 = Mostly realistic with minor inconsistencies (e.g. slightly wrong field names but still valid) 3 = Partially realistic but with notable hallucinations (invented fields that do not look realistic or fitting to the tool response) 2 = Implausible schema or values for this domain 1 = Completely unrealistic (random/nonsensical data) outcome_correctness (does the model correctly predict the outcome of the action?): 5 = Exactly correct outcome type and key facts 4 = Correct outcome type with minor factual errors 3 = Correct outcome type but significantly wrong facts 2 = Wrong outcome type (e.g. predicted success but expected error) 1 = Completely incorrect 2. Preference understanding: Given a set of 100 comparative tool outputs, rank which one you would most likely see in a real trajectory for the given task. Message if interested.
View Original Listing
Unlock AI intelligence, score breakdowns, and real-time alerts
Upgrade to Pro — $29.99/mo

Client

Spent: $33,119.4Rating: 4.7Verified