▲Optimizing Tool Selection for LLM Workflows with Differentiable Programmingviksit.substack.com

117 points by viksit 1 days ago | 36 comments

viksit 1 days ago [-]

I was experimenting with how local, learnable routers can reduce token overhead, and lower costs, and decided to publish a post about it. The main goal is to delegate tool calls via a PyTorch based learner and examples of how to integrate this into a DSPy pipeline. Feedback welcome!

rybosome 20 hours ago [-]

Thanks for the informative and inspiring post! This is definitely cool, and I can imagine very useful.

However I do want to mention that the “recommended” flow these days isn’t to separate out a tool request in the way you have. Eg instead of asking an LLM to route a tool, extracting that, running the tool, passing output back to the LLM, etc. - you simply pass the tool definitions, prompt, structural output expectations, and let the LLM (and your caller library) manage the tool use loop.

That’s how these modern LLMs are trained in post-training, and so I suspect it’s likely you’ll get different (and potentially worse?) results in trying to subvert this with a small, local model.

It comes with all the downsides you mentioned to let the LLM do this, but is also more likely to be in-distribution, and it’s easier to compose multiple tool calls.

Anyway, thanks for sharing! I’d love to see evals on a task where it compares the result when an LLM is involved in tool selection versus when it is handed tool output only - if I’m wrong about quality degradation then there’s a lot to like about your local tool routing.

viksit 17 hours ago [-]

great point, appreciate the comment. totally agree with your framing, though i think there’s still a gap in how tool use is handled today.

quick note: it doesn’t have to be an rnn. i’ve got a follow-up example coming that uses a transformer-style ToolController with self attention, more expressive routing, etc.

but here’s the thing — when you rely on few-shot bootstrapping the LLM, you never end up updating the model's priors. even after 100k tool calls, you’re still stuck in the same polluted context window and its all stateless.

this gets worse fast with more than 3–4 tool calls, especially when there’s branching logic (e.g., if api1 > 5, go left, else right).

what this approach offers is: backprop through tool calls. you can tune prompts and update priors across the full workflow, end to end. trying to develop this intuition a bit more, and would love feedback.

thanks for the suggestion on the eval — will post that comparison soon.

rybosome 6 hours ago [-]

That’s cool, I’d love to see the advanced ToolController when it’s available!

Great points about not updating priors. I also thought about it a bit more and realized that there’s a way you can largely mitigate the out-of-distribution inference requests after local tool selection, if you wanted to.

The tool use loop in an inference framework builds up history of each interaction and sends that along with each subsequent request. You could create “synthetic history”, where you send the LLM history containing the prompt, your local tool selection masquerading as though the LLM generated it, and the tool response. This would be in-distribution but still rely on your local tool routing.

If this works well enough, then I think your approach is very powerful once you’ve decided on a task and set of tools and are able to commit to training on that. Definitely want to try this myself now.

Looking forward to seeing more! I take it your substack is the best place to follow along?

krohling 23 hours ago [-]

I think this is a creative approach. I wonder how the success rates for that little RNN compare to the success rates of the primary LLM, especially for complex queries or complex tool calls. At some point you have to scale that network up large enough to get better results. Eventually you've come back around and you might as well use an LLM. I think a similar approach with potentially better results (depends on the application) could be accomplished by using that same dataset to finetune a small language model. It'd be interesting to see some success rate comparisons.

viksit 17 hours ago [-]

thank you, appreciate the comment! thats a great point -- as I'm developing this intuition, I'm designing an eval which does a comparison of the openAI example there + tool call using a simple RNN + one that uses an encoder model. would love more feedback (on blog / X etc) when I post.

ctxc 22 hours ago [-]

Nit - code screenshots are a PITA to read on mobile!

viksit 17 hours ago [-]

ty for the feedback, yes, balancing bad code blocks on substack vs making it look pretty lol. I'll post code next time.

zitterbewegung 22 hours ago [-]

Can you put all of the code into a gist or something?

viksit 17 hours ago [-]

yes apologies, the code rendering in substack wasn't great, but I'll put this in a gist!

bGl2YW5j 21 hours ago [-]

Creative. You’ve given me some ideas. Thanks!

joe_the_user 23 hours ago [-]

My question is whether you have managed to make this work, perform a specific complex task, in some real world situation.

pcwelder 15 hours ago [-]

You've essentially just trained your own LM instead of using a pretrained large LM.

Speaking generically -- any place in your workflow you feel the task is not hard, you can use smaller and cheaper LM.

Smaller LMs come with accuracy reduction, particularly in tail cases. So in the real world this doesn't work out.

Also is gumble softmax usage intentional? It looks like a straightforward classifier that just needs regular softmax.

bigmadshoe 4 hours ago [-]

This is super cool!

From the article:

  Each LLM call incurs latency, cost, and token overhead. More subtly, it compounds context:
  every step includes not only the original query, but intermediate outputs and scratchpad logic from earlier prompts. 
  This creates a growing burden on both inference and model performance.

I was working with agents over a year ago before the common workflows had really been set in stone. At that time we were heavily doctoring the context to give a very streamlined representation of what had occurred during a given run to the LLM. Is this not standard practice?

Garlef 23 hours ago [-]

Is selection really the issue?

You'd still need to figure out what payload to give to the tool based on your context.

But I guess depending on your business case it might be worth it. It's not something I'd do from the beginning, though.

phanimahesh 18 hours ago [-]

This is a bigger problem than it looks like at first glance. For isecases where llm + tool calls make more sense compared to say llm assisted codegen, figuring out the tool arguments is nontrivial. Where it is relatively easy I think codegen is a better option wrt amortised running costs

viksit 17 hours ago [-]

this is a great point, ty.

in my mind the biggest difference is llms that are invoked during a workflow, and llms that are invoked when _creating_ code (codegen).

for the former, tools could be well defined till they are small in number, but at some point, the system needs to examine a library of tools, understand how to call it and integrate it, and at its peak, even create new tools to talk to systems not already present in that library (codegen).

viksit 17 hours ago [-]

it’s not just about selection. say you’ve got 100k tool calls — in the current hosted llm setup, you don’t actually learn anything new about your data to improve future tool accuracy.

this gets worse when you’re chaining 3–4+ tools. context gets noisy, priors stay frozen and there's prompt soup..

my intuition here is: you can learn the tool routing and the llm prompts before and after the call. (can always swap out the rnn for a more expressive encoder model and backprop through the whole thing).

super useful when you’re building complex workflows -- it gives you a way to learn the full pipeline, not just guess and hope.

bGl2YW5j 21 hours ago [-]

I don’t think the problem is “how to optimise tool selection for the LLM”. I think the real problem is using an LLM to do tool selection at all. This is control flow and I believe should be handled with hardcoded rules and/separation of concerns.

If LLMs could handle determinism better, I’d say having a single chat-based entrypoint into a plethora of services makes sense. But as they stand, it doesn’t make sense. Simpler control flow and constraining the number and type of downstream services that sit behind a single interface I think is the way to go.

That said, I agree we should keep the ambition to move to the one size fits all approach.

viksit 17 hours ago [-]

+1 on the control flow point.

I think of an llm as a differentiable interpreter of a program. it should do decision making (tool selection, argument routing), branching logic via weights + gates etc.

so as a differentiable state machine:

- each state == a stage in your workflow

- transitions == tool calls

- encode this as a rnn or graph

and learn transitions and actions via supervision or RL

shusaku 22 hours ago [-]

Yes I think once you’ve got an LLM in the loop it’s easy to be lazy and just use it to make all decisions. But it’s good to step back and think if there is a cheaper way, I mean even some hardcoded logic can do the job.

j45 21 hours ago [-]

Very true. Making a non-deterministic system make determinations is also harder for it to do.

Right tool for the step to the right extent.

Feels like soft skills for software development.

crazylogger 15 hours ago [-]

I can see this makes sense for simple { user_query -> search -> llm_answer } usage, where tool use is only a means to retrieve background info.

For complex real-world agent flows though, tool use is often the only thing that the LLM is expected to do. Like in a coding agent:

```

User: Develop a program to ...

Agent: Bash("touch main.py") > 0, ""

Agent: Edit("main.py", initial_patch) > 0, ""

Agent: Bash("python main.py") > 1, "SyntaxError: ..."

Agent: Edit("main.py", fix_patch) > 0, ""

Agent: Bash("python main.py") > 0, "OK"

Agent: FINISH

```

Here, tool selection (+ writing the arguments) is actually the whole job. It's also easy to see that if you omit even one of the tool use records in the middle, the agent wouldn't work at all.

jaksa 16 hours ago [-]

Figuring out which tool to call is trivial, passing the correct arguments is the difficult and error prone part. Smarter agents would even use a varying amount of tool calls until they get the desired response.

viksit 16 hours ago [-]

(author here, put the code in a gist here for reference)

https://gist.github.com/viksit/c67d1d960c4cec89488290496defb...

nphard85 17 hours ago [-]

Very interesting. How does this approach work for complex agentic workflows where the LLM is expected to orchestrate across multiple tools (such as when using MCP)? Or is this mainly for simple cases like the ones presented in the blog post?

viksit 17 hours ago [-]

+1 thanks for mentioning MCP!

re: different tools (apis vs mcps). in my mind, there should be no real difference at what kind of tools is called at this moment since I model this as a softmax over a label set of tools.

that said, an idea I want to investigate is whether tools can live in a learned embedding space, where selection isn’t a softmax over discrete labels but a nearest-neighbor or attention mechanism over continuous vectors.

this is the intuition I'm developing as we speak and in some of my other comments on this thread (see differentiable state machine comment).

lgas 17 hours ago [-]

The work described appears as if it would handle a complex set of multiple tools just fine, but you do train the controller on a specific tool set, so you would presumably need to train (or at least something like "fine tune") a controller for each toolset you wanted to use.

viksit 16 hours ago [-]

for sure, there's a way here where I think we ought to be able to learn multiple tool calls and prompts together with real world data. investigating that next.

digitcatphd 14 hours ago [-]

this is smart, but I think NVIDIA's paper on fine tuning small language models presents a sightly more efficient approach

apsears 19 hours ago [-]

I have been thinking a lot about tool selection lately, and something that I keep repeating to myself is: "the LLM has intuition, but I have data".

I guess that applies when you're not able to fine-tune the LLM you're using. Presumably Anthropic has a lot of data too.

viksit 17 hours ago [-]

+1 - the biggest issue is not being able to fine tune the llm to learn the specifics of how to make a tool call better over time, which an approach like this can bring to the table.

tomlue 22 hours ago [-]

you could also propagate loss into the tools themselves.

viksit 17 hours ago [-]

+1 - you can propagate the loss for a workflow across prompts + tools, which would make it much better to do resilient workflows. or "agents" as everyone calls them now ;)

arthurcolle 18 hours ago [-]

huge research area

viksit 17 hours ago [-]

this is my goal :) appreciate the feedback.

Loading comments...

viksit 1 days ago [-]

rybosome 20 hours ago [-]

Thanks for the informative and inspiring post! This is definitely cool, and I can imagine very useful.

It comes with all the downsides you mentioned to let the LLM do this, but is also more likely to be in-distribution, and it’s easier to compose multiple tool calls.

viksit 17 hours ago [-]

great point, appreciate the comment. totally agree with your framing, though i think there’s still a gap in how tool use is handled today.

quick note: it doesn’t have to be an rnn. i’ve got a follow-up example coming that uses a transformer-style ToolController with self attention, more expressive routing, etc.

this gets worse fast with more than 3–4 tool calls, especially when there’s branching logic (e.g., if api1 > 5, go left, else right).

thanks for the suggestion on the eval — will post that comparison soon.

rybosome 6 hours ago [-]

That’s cool, I’d love to see the advanced ToolController when it’s available!

Looking forward to seeing more! I take it your substack is the best place to follow along?

krohling 23 hours ago [-]

viksit 17 hours ago [-]

ctxc 22 hours ago [-]

Nit - code screenshots are a PITA to read on mobile!

viksit 17 hours ago [-]

ty for the feedback, yes, balancing bad code blocks on substack vs making it look pretty lol. I'll post code next time.

zitterbewegung 22 hours ago [-]

Can you put all of the code into a gist or something?

viksit 17 hours ago [-]

yes apologies, the code rendering in substack wasn't great, but I'll put this in a gist!

bGl2YW5j 21 hours ago [-]

Creative. You’ve given me some ideas. Thanks!

joe_the_user 23 hours ago [-]

My question is whether you have managed to make this work, perform a specific complex task, in some real world situation.

pcwelder 15 hours ago [-]

You've essentially just trained your own LM instead of using a pretrained large LM.

Speaking generically -- any place in your workflow you feel the task is not hard, you can use smaller and cheaper LM.

Smaller LMs come with accuracy reduction, particularly in tail cases. So in the real world this doesn't work out.

Also is gumble softmax usage intentional? It looks like a straightforward classifier that just needs regular softmax.

bigmadshoe 4 hours ago [-]

This is super cool!

From the article:

  Each LLM call incurs latency, cost, and token overhead. More subtly, it compounds context:
  every step includes not only the original query, but intermediate outputs and scratchpad logic from earlier prompts. 
  This creates a growing burden on both inference and model performance.

Garlef 23 hours ago [-]

Is selection really the issue?

You'd still need to figure out what payload to give to the tool based on your context.

But I guess depending on your business case it might be worth it. It's not something I'd do from the beginning, though.

phanimahesh 18 hours ago [-]

viksit 17 hours ago [-]

this is a great point, ty.

in my mind the biggest difference is llms that are invoked during a workflow, and llms that are invoked when _creating_ code (codegen).

viksit 17 hours ago [-]

it’s not just about selection. say you’ve got 100k tool calls — in the current hosted llm setup, you don’t actually learn anything new about your data to improve future tool accuracy.

this gets worse when you’re chaining 3–4+ tools. context gets noisy, priors stay frozen and there's prompt soup..

super useful when you’re building complex workflows -- it gives you a way to learn the full pipeline, not just guess and hope.

bGl2YW5j 21 hours ago [-]

That said, I agree we should keep the ambition to move to the one size fits all approach.

viksit 17 hours ago [-]

+1 on the control flow point.

I think of an llm as a differentiable interpreter of a program. it should do decision making (tool selection, argument routing), branching logic via weights + gates etc.

so as a differentiable state machine:

- each state == a stage in your workflow

- transitions == tool calls

- encode this as a rnn or graph

and learn transitions and actions via supervision or RL

shusaku 22 hours ago [-]

j45 21 hours ago [-]

Very true. Making a non-deterministic system make determinations is also harder for it to do.

Right tool for the step to the right extent.

Feels like soft skills for software development.

crazylogger 15 hours ago [-]

I can see this makes sense for simple { user_query -> search -> llm_answer } usage, where tool use is only a means to retrieve background info.

For complex real-world agent flows though, tool use is often the only thing that the LLM is expected to do. Like in a coding agent:

```

User: Develop a program to ...

Agent: Bash("touch main.py") > 0, ""

Agent: Edit("main.py", initial_patch) > 0, ""

Agent: Bash("python main.py") > 1, "SyntaxError: ..."

Agent: Edit("main.py", fix_patch) > 0, ""

Agent: Bash("python main.py") > 0, "OK"

Agent: FINISH

```

Here, tool selection (+ writing the arguments) is actually the whole job. It's also easy to see that if you omit even one of the tool use records in the middle, the agent wouldn't work at all.

jaksa 16 hours ago [-]

viksit 16 hours ago [-]

(author here, put the code in a gist here for reference)

https://gist.github.com/viksit/c67d1d960c4cec89488290496defb...

nphard85 17 hours ago [-]

viksit 17 hours ago [-]

+1 thanks for mentioning MCP!

re: different tools (apis vs mcps). in my mind, there should be no real difference at what kind of tools is called at this moment since I model this as a softmax over a label set of tools.

this is the intuition I'm developing as we speak and in some of my other comments on this thread (see differentiable state machine comment).

lgas 17 hours ago [-]

viksit 16 hours ago [-]

for sure, there's a way here where I think we ought to be able to learn multiple tool calls and prompts together with real world data. investigating that next.

digitcatphd 14 hours ago [-]

this is smart, but I think NVIDIA's paper on fine tuning small language models presents a sightly more efficient approach

apsears 19 hours ago [-]

I have been thinking a lot about tool selection lately, and something that I keep repeating to myself is: "the LLM has intuition, but I have data".

I guess that applies when you're not able to fine-tune the LLM you're using. Presumably Anthropic has a lot of data too.

viksit 17 hours ago [-]

+1 - the biggest issue is not being able to fine tune the llm to learn the specifics of how to make a tool call better over time, which an approach like this can bring to the table.

tomlue 22 hours ago [-]

you could also propagate loss into the tools themselves.

viksit 17 hours ago [-]

+1 - you can propagate the loss for a workflow across prompts + tools, which would make it much better to do resilient workflows. or "agents" as everyone calls them now ;)

arthurcolle 18 hours ago [-]

huge research area

viksit 17 hours ago [-]

this is my goal :) appreciate the feedback.