How I program with LLMs – Ars Technica

Generative models can be powerfully useful—if you’re willing to adapt your approach.
This piece was originally published on David Crawshaw’s blog and is reproduced here with permission.This article is a summary of my personal experiences with using generative models while programming over the past year. It has not been a passive process. I have intentionally sought ways to use LLMs while programming to learn about them. The result has been that I now regularly use LLMs while working, and I consider their benefits to be net-positive on my productivity. (My attempts to go back to programming without them are unpleasant.)Along the way, I have found oft-repeated steps that can be automated, and a few of us are working on building those into a tool specifically for Go programming: sketch.dev. It’s very early, but so far, the experience has been positive.I am typically curious about new technology. It took very little experimentation with LLMs for me to want to see if I could extract practical value. There is an allure to a technology that can (at least some of the time) craft sophisticated responses to challenging questions. It is even more exciting to watch a computer attempt to write a piece of a program as requested and make solid progress.The only technological shift I have experienced that feels similar to me happened in 1995, when we first configured my LAN with a usable default route. I replaced the shared computer in the other room running Trumpet Winsock with a machine that could route a dialup connection, and all at once, I had the Internet on tap. Having the Internet all the time was astonishing and felt like the future. Probably far more to me in that moment than to many who had been on the Internet longer at universities because I was immediately dropped into high Internet technology: web browsers, JPEGs, and millions of people. Access to a powerful LLM feels like that.So I followed this curiosity to see if a tool that can generate something mostly not wrong most of the time could be a net benefit in my daily work. The answer appears to be “yes”—generative models are useful for me when I program. It has not been easy to get to this point. My underlying fascination with the new technology is the only way I have managed to figure it out, so I am sympathetic when other engineers claim LLMs are “useless.” But as I have been asked more than once how I can possibly use them effectively, this post is my attempt to describe what I have found so far.There are three ways I use LLMs in my day-to-day programming:As this is about the practice of programming, this has been a fundamentally qualitative process that is hard to write about with quantitative rigor. The closest I will get to data is to say that it appears from my records that for every two hours of programming I do now, I accept more than 10 autocomplete suggestions, use LLM for a search-like task once, and program in a chat session once.The rest of this is about extracting value from chat-driven programming.Let me try to motivate the skeptical. A lot of the value I get out of chat-driven programming is that I reach a point in the day when I know what needs to be written, I can describe it, but I don’t have the energy to create a new file, start typing, and then start looking up the libraries I need. (I’m an early-morning person, so this is usually any time after 11 am for me, though it can also be any time I context-switch into a different language/framework/etc.) LLMs perform that service for me in programming. They give me a first draft with some good ideas and several of the dependencies I need—and often some mistakes. Often, I find fixing those mistakes is a lot easier than starting from scratch.This means chat-based programming may not be for you. I am doing a particular kind of programming, product development, which could be roughly described as trying to bring programs to a user through a robust interface. That means I am building a lot, throwing away a lot, and bouncing around between environments. Some days, I mostly write typescript, some days mostly Go. I spent a week in a C++ codebase last month exploring an idea and just had an opportunity to learn the HTTP server-side events format. I am all over the place, constantly forgetting and relearning. If you spend more time proving your optimization of a cryptographic algorithm is not vulnerable to timing attacks than you do writing the code, I don’t think any of my observations here will be useful to you.Give an LLM a specific objective and all the background material it needs so it can craft a well-contained code review packet and expect it to adjust as you question it. There are two major elements to this:The ideal task for an LLM is one where it needs to use a lot of common libraries (more than a human can remember, so it is doing a lot of small-scale research for you), working to an interface you designed, or making it produce a small interface you can verify as sensible quickly, and it can write readable tests. Sometimes this means choosing the library for it if you want something obscure (though with open source code, LLMs are quite good at this).You always need to pass an LLM’s code through a compiler and run the tests before spending time reading it. They all produce code that doesn’t compile sometimes (always making errors I find surprisingly human—every time I see one, I think, “There but for the grace of God go I.”) The better LLMs are very good at recovering from their mistakes; often, all they need is for you to paste the compiler error or test failure into the chat, and they fix the code.There are vague trade-offs we make every day around the cost of writing, the cost of reading, and the cost of refactoring code. Let’s take Go package boundaries as an example. The standard library has a package “net/http” that contains some fundamental types for dealing with wire format encoding, MIME types, etc. It contains an HTTP client and an HTTP server. Should it be one package or several? Reasonable people can disagree! So much so that I do not know if there is a correct answer today. What we have works; after 15 years of use, it is still not clear to me that some other package arrangement would work better.The advantages of a larger package include centralized documentation for callers, easier initial writing, easier refactoring, and easier sharing of helper code without devising robust interfaces for them (which often involves pulling the fundamental types of a package out into yet another leaf package filled with types). The disadvantages include the package being harder to read because many different things are going on (try reading the net/http client implementation without tripping up and finding yourself in the server code for a few minutes), or it being harder to use because there is too much going on in it. For example, I have a codebase that uses a C library in some fundamental types, but parts of the codebase need to be in a binary widely distributed to many platforms that does not technically need the C library, so it has more packages than you might expect in the codebase isolating the use of the C library to avoid cgo in the multi-platform binary.There are no right answers here. Instead, we are trading off different types of work that an engineer will have to do (upfront and ongoing). LLMs influence those trade-offs:Let me work an example to combine a few of the discussed ideas:Write a reservoir sampler for the quartiles of floats.First off is package structure. Were I doing this before LLMs, I would have chosen to have some sort of streamstat package that contained several algorithms, maybe one per file. This does not seem to be a unique opinion; here is an open source quantile package following that model. Now, I want just this one algorithm in its own package. Other variants or related algorithms can have their own package.Next up, what do we get from an LLM? The first pass is not bad. That prompt, with some details about wanting it in Go, got me quartile_sampler.go:The core interface is good, too:Great! There are also tests.An aside: this may be the place to stop. Sometimes I use LLM codegen as a form of specialized search. E.g., I’m curious about reservoir sampling, but want to see how the algorithm would be applied under some surprising constraint—for example, time-windowed sampling. Instead of doing a literature search, I might amend my prompt for an implementation that tracks freshness. (I could also ask it to include references to the literature in the comments, which I could manually check to see if it’s making things up or if there’s some solid research to work from.)I often spend 60 seconds reading some generated code, see an obvious trick I hadn’t thought of, then throw it away and start over. Now I know the trick is possible. This is why it is so hard to attribute value generated by LLMs. Yes sometimes it makes bad code, gets stuck in a rut, makes up something impossible (it hallucinated a part of the monaco API I wish existed the other day), and wastes my time. It can also save me hours by pointing out something relevant I don’t know.Back to the code. Fascinatingly, the initial code produced didn’t compile. In the middle of the Quartiles implementation, there was the line:That is a fine line; sorted is a slice defined a few lines earlier. But the value is never used, so gopls (and the Go compiler if you run go build) immediately says:This is a very easy fix. If I paste the error back into the LLM, it will correct it. Though in this case, as I’m reading the code, it’s quite clear to me that I can just delete the line myself, so I do.Now the tests. I got what I expected. In quartile_sampler_test.go:Exactly the sort of thing I would write! I would run some cases through another implementation to generate expected outputs and copy them into a test like this. But there are two issues with this.The first is the LLM did not run these numbers through another implementation. (To the best of my knowledge. When using a sophisticated LLM service, it is hard to say for sure what is happening behind the scenes.) It made them up, and LLMs have a reputation for being weak at arithmetic. So this sort of test, while reasonable for a human to write because we base it on the output of another tool—or, if we are particularly old-school, do some arithmetic ourselves—is not great from an LLM.The second issue is we can do better. I am happy we now live in a time when programmers write their own tests, but we do not hold ourselves to the same standards with tests as we do with production code. That is a reasonable trade-off; there are only so many hours in the day. But what LLMs lack in arithmetical prowess, they make up for in enthusiasm.Let’s ask for an even better test.In the tests, implement the simplest, most readable version of the standard code for quartiles over a fixed set of known values in a slice. Then pass the test cases through the standard code and the reservoir sampler and confirm they are within an epsilon of each other. Structure the comparison code such that it can be used in a fuzz test, too.This got us some new test code:The original test from above has been reworked to use checkQuartiles, and we have something new:This is fun because it’s wrong. My running gopls tool immediately says:Pasting that error back into the LLM gets it to regenerate the fuzz test such that it is built around a func(t *testing.T, data []byte) function that uses math.Float64frombits to extract floats from the data slice. Interactions like this point us toward automating the feedback from tools; all it needed was the obvious error message to make solid progress toward something useful. I was not needed.Doing a quick survey of the last few weeks of my LLM chat history shows (which, as I mentioned earlier, is not a proper quantitative analysis by any measure) that more than 80 percent of the time there is a tooling error, the LLM can make useful progress without me adding any insight. About half the time, it can completely resolve the issue without me saying anything of note. I am just acting as the messenger.There was a programming movement some 25 years ago focused on the principle “don’t repeat yourself.” As is so often the case with short snappy principles taught to undergrads, it got taken too far. There is a lot of cost associated with abstracting out a piece of code so it can be reused; it requires creating intermediate abstractions that must be learned, and it requires adding features to the factored-out code to make it maximally useful to the maximum number of people, which means we depend on libraries filled with useless distracting features.The past 10–15 years has seen a far more tempered approach to writing code, with many programmers understanding that it’s better to reimplement a concept if the cost of sharing the implementation is higher than the cost of implementing and maintaining separate code. It is far less common for me to write on a code review, “This isn’t worth it, separate the implementations.” (Which is fortunate, because people really don’t want to hear things like that after they have done all the work.) Programmers are getting better at trade-offs.What we have now is a world where the trade-offs have shifted. It is now easier to write more comprehensive tests. You can have the LLM write the fuzz test implementation you want but didn’t have the hours to build properly. You can spend a lot more time writing tests to be readable because the LLM is not sitting there constantly thinking, “It would be better for the company if I went and picked another bug off the issue tracker than doing this.” So the trade-off shifts in favor of having more specialized implementations.The place where I expect this to be most visible is language-specific REST API wrappers. Every major company API comes with dozens of these (usually low-quality) wrappers written by people who aren’t actually using their implementations for a specific goal and are instead trying to capture every nook and cranny of an API in a large and complex interface. Even when it is done well, I have found it easier to go to the REST documentation (usually a set of curl commands) and implement a language wrapper for the 1 percent of the API I actually care about. It cuts down the amount of the API I need to learn upfront, and it cuts down how much future programmers (myself) reading the code need to understand.For example, as part of my recent work on sketch.dev, I implemented a Gemini API wrapper in Go. Even though the official wrapper in Go has been carefully handcrafted by people who know the language well and clearly care, there is a lot to read to understand it:My simplistic initial wrapper was 200 lines of code total—one method, three types. Reading the entire implementation is 20 percent of the work of reading the documentation of the official package, and if you try to dig into its implementation, you will discover that it is a wrapper around another largely code-generated implementation with protos and grpc and the works. All I want is to cURL and parse a JSON object.There obviously comes a point in a project where Gemini is the foundation of the entire app, where nearly every feature is used, where building on gRPC aligns well with the telemetry system elsewhere in your organization, and where you should use the large official wrapper. But most of the time, it’s so much more time consuming to do so because we almost always want only some wafer-thin sliver of whatever API we need to use today. So custom clients, largely written by a GPU, are far more effective for getting work done.So I foresee a world with far more specialized code, fewer generalized packages, and more readable tests. Reusable code will continue to thrive around small, robust interfaces and otherwise will be pulled apart into specialized code. Depending how well this is done, it will lead to either better software or worse software. I would expect both, with a long-term trend toward better software by the metrics that matter.As a programmer, my instinct is to make computers do work for me. It’s a lot of work getting value out of LLMs—how can a computer do it?I believe the key to solving a problem is not to overgeneralize. Solve a particular problem and then expand slowly. So instead of building a general-purpose UI for chat programming that is just as good at COBOL as it is for Haskell, we want to focus on one particular environment. The bulk of my programming is in Go, so what I want is easy to imagine for a Go programmer:A few of us have built an early prototype of this: sketch.dev.The goal is not to be a “Web IDE” but rather to challenge the notion that chat-based programming even belongs in what is traditionally called an IDE. IDEs are collections of tools arranged for people. It is a delicate environment where I know what is going on. I do not want an LLM spewing its first draft all over my current branch. While an LLM is ultimately a developer tool, it is one that needs its own IDE to get the feedback it needs to operate effectively.Put another way, we didn’t embed goimports into sketch for it to be used by humans but to get Go code closer to compiling using automatic signals so that the compiler can provide better error feedback to the LLM driving it. It might be better to think of sketch.dev as a “Go IDE for LLMs.”This is all very recent work with a lot left to do, e.g. git integration so we can load existing packages for editing and drop the results on a branch. We also need better test feedback and more console control. (If the answer is to run sed, run sed. Be you the human or the LLM.) We are still exploring, but we’re convinced that focusing an environment for a particular kind of programming will give us better results than the generalized tool.David Crawshaw is a co-founder (and former CTO) of Tailscale, lives in the Bay Area, and is building sketch.dev. He has been programming for 30 years and is planning on another 30.Ars Technica has been separating the signal from
the noise for over 25 years. With our unique combination of
technical savvy and wide-ranging interest in the technological arts
and sciences, Ars is the trusted source in a sea of information. After
all, you don’t need to know everything, only what’s important.

Source: https://arstechnica.com/ai/2025/01/how-i-program-with-llms/

How I program with LLMs – Ars Technica

More Stories

It’s Doom, but wine in an art gallery instead of monsters in hell – Polygon

Before You Get Too Far, Change These 8 iOS 18.2 Settings – CNET

Public beta 2 for macOS 15.3, iPadOS 18.3, and more just released – 9to5Mac

Leave a Reply Cancel reply

It’s Doom, but wine in an art gallery instead of monsters in hell – Polygon

Before You Get Too Far, Change These 8 iOS 18.2 Settings – CNET

Public beta 2 for macOS 15.3, iPadOS 18.3, and more just released – 9to5Mac

NVIDIA GeForce RTX 5080 Uses 30 Gbps GDDR7 Memory Dies From Multiple Partners Starting With Samsung – Wccftech

More Stories

It’s Doom, but wine in an art gallery instead of monsters in hell – Polygon

Before You Get Too Far, Change These 8 iOS 18.2 Settings – CNET

Public beta 2 for macOS 15.3, iPadOS 18.3, and more just released – 9to5Mac

Leave a Reply Cancel reply

You may have missed

It’s Doom, but wine in an art gallery instead of monsters in hell – Polygon

Before You Get Too Far, Change These 8 iOS 18.2 Settings – CNET

Public beta 2 for macOS 15.3, iPadOS 18.3, and more just released – 9to5Mac

NVIDIA GeForce RTX 5080 Uses 30 Gbps GDDR7 Memory Dies From Multiple Partners Starting With Samsung – Wccftech