Fara-7B: An efficient agentic small language model for computer use

Pushing the frontiers of computer-use agents with an open-weight, ultra-compact model,optimizedfor real-world web tasks

Three white line icons on a blue-to-green gradient background: a computer monitor with a globe symbol on the left, a cursor arrow with click lines in the center, and a computer mouse outline on the right.

In 2024,Microsoftintroduced small language models (SLMs) to customers, starting with the release of Phi (opens in new tab) models on MicrosoftFoundry (opens in new tab),aswell as deployingPhi Silica (opens in new tab)on Copilot+PCspowered by Windows 11. Today, we are pleased toannounceFara-7B, our firstagentic SLMdesigned specifically forcomputeruse.

Unlike traditional chat models that generate text-based responses, ComputerUse Agent (CUA) models like Fara-7Bleverage computer interfaces, such as a mouseand keyboard, to complete tasks on behalf of users. With only 7 billion parameters, Fara-7Bachievesstate-of-the-artperformance within its size class and is competitive with larger, more resource-intensive agentic systems that depend on prompting multiple large models. Fara-7B’s small size nowmakes it possibletorun CUA models directly on devices. This results in reduced latency and improved privacy, as user dataremainslocal.

Fara-7B is an experimental release, designed to invite hands-on exploration and feedback from the community. Users can build and test agentic experiences beyond pure research—automating everyday web tasks like filling out forms, searching for information, booking travel, or managing accounts. We recommend running Fara-7B in a sandboxed environment,monitoringits execution, and avoiding sensitive data or high-risk domains. Responsible use isessentialas the model continues to evolve.

Fara-7Boperates byvisually perceivinga webpageandtakesactions likescrolling, typing, and clicking on directly predicted coordinates.Itdoes notrely onseparate models to parse the screen, nor on any additional information likeaccessibility trees,andthususes the same modalities as humans to interact with thecomputer.To train Fara-7B, we developed a novel synthetic data generation pipelinefor multi-stepweb tasks, building on our prior work (AgentInstruct).This data generation pipeline draws fromrealweb pages and taskssourcedfrom human users.

Fara-7B exhibitsstrong performancecompared to existing modelsacrossa diverse set of benchmarks.This includes both existing benchmarks as well as newevaluationswe arereleasingwhichcover usefultasksegments that are underrepresented in common benchmarks, such asfinding job postingsandcomparing prices across retailers. While Fara-7B demonstrates strong benchmark results, even against much larger models, it shares many of their limitations, including challenges with accuracy on more complex tasks, mistakes in following instructions, and susceptibility to hallucinations.These are active areas of research, andwe’recommitted to ongoing improvements as we learn from real-world use.

Fara-7B is now available onMicrosoft Foundry (opens in new tab)andHugging Face (opens in new tab)under an MIT license and is integrated withMagentic-UI, a research prototype from Microsoft Research AI Frontiers (opens in new tab). We are also sharing a quantized and silicon-optimized version of Fara-7B, is available to install and run onCopilot+ PCs powered by Windows 11, for turnkey experimentation.Thecommunitycan simply download the pre-optimized model and run it in their environment.

By making Fara-7B open-weight, we aim to lower the barrierto experimentingwithand improvingCUA technology for automating routine web tasks, such as searching for information,shopping,andbooking reservations.

Figure 1: Comparing WebVoyager accuracy and cost of Fara-7B to other computer use agents (CUAs) or agents that prompt LLMs with accessibility trees (SoM Agent w/ Ax Tree). Cost is computed by multiplying the average number of input and output tokens each model consumes by price per token. Both Fara-7B and UI-TARS-1.5-7B are based on Qwen-2.5-VL-7B, for which the lowest inference price from https://openrouter.ai/ is (0.2/)0.2 per 1M input/output tokens. Even though both models are priced equally, Fara-7B is more efficient, completing tasks with only ~16 steps on average compared to ~41 for UI-TARS-1.5-7B. OpenAI computer-use-preview accessed November 2025 via the Responses API. — Figure1:ComparingWebVoyageraccuracy and costofFara-7B to othercomputeruse agents (CUAs)or agents that prompt LLMs with accessibility trees (SoMAgent w/ Ax Tree).Cost is computedbymultiplyingtheaveragenumber ofinputandoutput tokenseach modelconsumesbyprice per token.BothFara-7B and UI-TARS-1.5-7Bare basedonQwen-2.5-VL-7B,for which thelowestinference pricefromhttps://openrouter.ai/is(0.2/)0.2per 1Minput/outputtokens.Even though both models are priced equally, Fara-7B is moreefficient,completing taskswithonly~16steps onaveragecomparedto~41for UI-TARS-1.5-7B.OpenAI computer-use-preview accessed November 2025 via the Responses API.

Developing Fara-7B

CUA multi-agent synthetic data generation

A key bottleneckforbuilding CUA models is a lack of large-scale, high-qualitycomputer interaction data. Collecting such data withhuman annotatorsis prohibitively expensive as a singleCUA task can involvedozensof steps,each of whichneeds to beannotated.Ourdata generation pipeline(Figure 2)avoids manual annotation and instead relies on scalable synthetic data sourced frompubliclyavailable websitesandcustomtask prompts.We build thispipelineon top oftheMagentic-Oneframework, and it involves three main stages:

Figure 2: Data Generation workflow from proposing tasks from various seeds like URLs to solving those tasks with the Magentic-One multi-agent framework to generate demonstrations for training, and finally verifiying/filtering completed trajectories — Figure2:Data Generation workflow from proposing tasks from various seeds like URLstosolvingthose tasks withtheMagentic-One multi-agent framework to generate demonstrations for training, and finallyverifiying/filteringcompletedtrajectories

Task Proposal.We generate a broad set of synthetic tasks that mirror common user activities on theweb.To ensure coverage and diversity, tasks are“seeded” byawebindex of public URLsclassified into various categories e.g., shopping, travel, restaurants, etc. This enablestaskgenerationtargetinga particular skill, like “book 2 tickets to see the Downton Abbey Grand Finale at AMC Union Square, NYC.”from aURL like this (opens in new tab)classified as “movies”.As another strategy, we devised a wayto generate tasks fromrandomlysampledURLs.Each task starts with a general prompt and is iteratively refined as anLLMagent explores the website and gathersmore information about it. We are releasing a held-out subset of these tasks as a benchmark (“WebTailBench”), described in the Evaluation section below.

TaskSolving.Once synthetic tasks are generated, a multi-agent system built onMagentic-Oneattemptstocompletethem to generate demonstrations for supervised finetuning. The multi-agent system uses anOrchestratoragent to create a plan and direct aWebSurferagent to take browser actions and reports results. The Orchestrator monitors progress, updating plans as needed, and can end tasks or engage a UserSimulator agent if user input isrequired, allowing for multi-turn completion.Eachtask and corresponding sequence of observations, actions, and agent thoughtsformsa“trajectory”.

Trajectory Verification. Before using any tasks for training, three verifier agents evaluate if a task was “successful”: The Alignment Verifier checks if the trajectory of actions match the task’s intent; the Rubric Verifier defines completion criteria and scores the trajectory against them; and the Multimodal Verifier reviews screenshots and responses to confirm visual evidence supports successful completion. Trajectories failing these standards are removed.

Weultimatelytrainthis versionofFara-7Bon a dataset of145,000trajectoriesconsisting of1millionstepscovering diverse websites, task types, and difficulty levels.Additionally, we includetrainingdata for several auxiliary tasks, includinggrounding foraccurateUI element localization, captioning, and visual question answering.

Training Fara-7B

Usingonecompute usemodeliseasier thanamulti-agent system, particularly when it comes todeployment. Therefore, wedistill the complexities ofour multi-agentsolving system into a single modelthat canexecute tasks.Fara-7Bis a proof-of-concept that small models caneffectivelylearn from complex, multi-agent systemswith lots of bells and whistles.

As shown in Figure 3, Fara-7B is trained to execute user tasks by perceiving only browser window screenshots (without relying on accessibility trees), and predicting single-step actions. For each step, the context used to make its prediction contains all user messages, the complete action history, and the latest three screenshots.

In its prediction,Fara-7Boutputs a reasoning message (“thinking” about the next action) followed by a tool call. The available tools include standardPlaywright (opens in new tab)mouse and keyboard actions, such asclick(x,y)andtype(), and browser-specific macro-actions likeweb_search()andvisit_url().

Fara-7B usesQwen2.5-VL-7B (opens in new tab)as its base model due to itsstrong performanceon grounding tasks and its ability to support long contexts (up to 128k tokens).Welinearize the solving pipeline’strajectoriesinto a sequence of “observe-think-act” stepsthat are suitable for training with supervised finetuning loss.We did not use reinforcement learning to achievetheresultswe report below.

Figure 3: Operation of Fara-7B as a standalone, native computer use agent running on-device. Because Fara-7B is small, and none of its context needs to leave your personal device, it paves the way for personal and private agentic computing — Figure3:Operation of Fara-7B as a standalone, native computer use agentrunning on-device. Because Fara-7B is small, and none of its context needs to leave your personal device, it paves the way for personal and private agentic computing

Evaluations

We evaluate Fara-7B and comparable baselines on canonical public benchmarks including WebVoyager (opens in new tab), Online-Mind2Web (opens in new tab), and Deepshop (opens in new tab), as well as a new benchmark we developed named WebTailBench, specifically focusing on 11 real-world task types underrepresented or missing in existing benchmarks like booking movie/event tickets, restaurant reservations, comparing prices across retailers,applying for jobs,finding real estate, and more complex multi-step tasks.

Evaluation ofwebagents can be trickybecause the web is constantlychanging,and many websites even block detected bots,which is why wedevelopedatestharnessthatrelies onBrowserbase (opens in new tab)to standardize how browser sessions are managed.In Table 1 below, we report a notion of task success rate(%)definedby each benchmark’s officialLLM-as-judge evaluator;WebTailBenchsuccess iscomputed using the same Task Verification pipeline that filteredourtraining data.We find thatFara-7Bisstate-of-the-art,evenoutperforming native computer useagentslike UI-TARS-1.5-7B, or much largermodelslike GPT-4o prompted to act like a computer use agent withSet-Of-Marks (opens in new tab)(SoMAgent).

		WebVoyager	Online-Mind2Web	DeepShop	WebTailBench
SoMAgents	SoMAgent (GPT-4o)	65.1	34.6	16.0	30.0
SoMAgents	GLM-4.1V-9B-Thinking	66.8	33.9	32.0	22.4
Computer Use Models	OpenAIcomputer-use-preview	70.9	42.9	24.7	25.7
	UI-TARS-1.5-7B	66.4	31.3	11.6	19.5
	Fara-7B	73.5	34.1	26.2	38.4

Table 1:Performance comparison across four web benchmarks:WebVoyager, Online-Mind2Web,DeepShop, andournewly introduced WebTailBench.Results are reported asTask Succes Rate / Accuracy(%) and are averaged over 3 runs.OpenAI computer-use-preview accessed November 2025 via the Responses API.

In Figure 1, weexpand ontheWebvoyagerresults by giving each model up to three chances to complete a task, and report “pass@K”. We also consideron the x-axis thecost of running each model if one were to pay market rates for input/output tokens consumed. Fara-7B breaks ground on a new pareto frontier, showing that on-device computer use agents are approaching the capabilities of frontier models.

We partnered with a trusted external group,Browserbase, to independently evaluate Fara-7B using human annotators. The model achieved62% onWebVoyager (see detailed reports inBrowserbasebloghere (opens in new tab)). These results were generated in the same environment with identical settings and human verification of each task, making them directly comparable. Note thatBrowserbase’sstandardWebVoyagerscores do not use retries whenenvironmenterrors occur; the results referenced here include retries and should not be compared directly to the non-retry scores. Going forward, we are collaborating withBrowserbaseto hostWebTailBenchhuman evaluations to help the community build reliable and reproducible assessments for computer use agents.

Safety

Agents capable of operating computers present challengesdistinctfromchat-only models,including newoutlets ofusermisuse, model misbehavior,and unintendedconsequences ofactions,andexternalrisks like prompt injections or online scams.CUAstake action withreal-world consequences, so ensuringrobust safety measures is essential to their responsible deployment.Transparency and user control sit at the core of Fara-7B’s design. Although we have incorporated several safety measures, Fara-7Bremainsa research preview, and we continue to advance our approach to safety for computer use agents, an active area of work across the entire AI community.

Fara-7B processes browser screenshots, user task instructions, and a history of actions taken during each session and collects only what is necessary to complete the user’s requested task. Noadditionalsite data—such asaccessibilitytrees or external scaffolding—is accessed; Fara-7B interacts with the computer in the same way a human would, relying solely on what is visible on the screen.

All actions taken by the agent are logged and auditable, allowing users to review andmonitorevery step.For added safety, Fara‑7B is intended to run in sandboxed environments, giving users full oversight and the ability to intervene or haltactions at any time. These safeguards ensure that privacy, transparency, and user control remain at the core of every interaction.

Toaddressmisuse, we trained Fara-7B on a mixture of public safety data and internally generated tasks that itought to refusebased onMicrosoft’s Responsible AI Policy.We evaluatedFara-7B’s ability to refuse harmful tasksonWebTailBench-Refusalswhich consists of111 red-teaming tasksshowing a high refusal rateof 82%.ThemodelalsounderwentMicrosoft’srigorousred teaming process, where we focused on the model rejecting harmful tasks and risky tasks, such as harmful content, jailbreaking attempts, ungroundedresponses,and prompt injections. For further details, check out our technical report (opens in new tab).

To mitigate the risk of Fara-7B taking unintended actions,all ofFara-7B’straining data enforces both recognizing and stopping at “Critical Points” when executing a task. A Critical Point(seeOperator System Card (opens in new tab))is any situation that requires the user’s personal data or consent before engaging in a transaction or irreversible action like sending an email. Upon reaching a Critical Point, Fara-7Bshouldrespond by informing theuseritcannotproceedwithout their consent.

For guidance on how to use our model safely, and the security considerations to be mindful of when using our model, please refer to ourModelcard (opens in new tab).

How to use

Fara-7Bis available on (opens in new tab)MicrosoftFoundry (opens in new tab)and (opens in new tab)Hugging Face (opens in new tab).We are also releasing the implementation of Fara-7BinMagentic-UI,so thatuserscantryitin a contained environmentthrough the inference code provided. Additionally, users can download the model for Copilot+PCspowered by Windows 11from theAIToolkit in VSCode andrun it all on-device,taking advantage ofNPU hardware acceleration.

Looking forward

Our current releaseis an experimental CUA modelthat achievesstate-of-the-artresults for its size,purely usingsupervised fine-tuning.We believe even stronger CUAmodels capable of running on-device are possiblethroughimprovedmultimodal base models and through Reinforcement Learningonlive and sandboxed environments.These early daysare about learning from the community and driving real-world experimentation to shape what comes next.Ifyou’dlike to join us and help shape the future of SLMs,pleaseapply for open roles.

Acknowledgements:

We thank Gustavo de Rosa, Adam Fourney, Michael Harrison, Rafah Hosn, Neel Joshi, Ece Kamar, John Langford, Maya Murad, Sidhartha Sen, Pratyusha Sharma, and Lili Wu for their valuable help, insightful discussions, and continued support throughout this work.

We also thank Pashmina Cameron, Karthik Vijayan, Vicente Rivera, Chris Dern, Sayan Shaw,SunghoonChoi, AndreyRybalchenko, and Vivek Pradeep for their efforts in making the model available on Copilot+ PCs through the AI Toolkit.

Categories

Recent Posts

Fara-7B: An efficient agentic small language model for computer use

Pushing the frontiers of computer-use agents with an open-weight, ultra-compact model,optimizedfor real-world web tasks

Developing Fara-7B