Yes, we’re well past code completion. Cursor, Claude Code, etc. reason across entire codebases now. But this space moves fast enough that genuinely interesting ideas get buried before most people understand them. I wanted to slow down and actually learn this one before it fades into background noise.

Introduction

Code editors we use daily are usually desktop apps. Because of that, it never really crossed my mind to think about how features like code completion actually work under the hood. It just works, you accept the suggestion and move on.

That changed when I was using Modal’s notebook to work on an assignment for my CUDA course. Unlike a desktop editor, it runs in the browser.

I’ve been reverse engineering Modal’s UI, UX and architecture for a while now. I study everything they build and try to understand the decisions behind it, from a frontend engineer’s perspective. When I noticed the autocomplete suggestions working in a way that felt interesting, I did what I always do. I opened the Chrome DevTools network tab.

What I Found in the Network Tab

The completions API request payload looked roughly like this:

{
  "cellId": "d674e9a7-03bd-4150-bc43-d53b1fe91951",
  "cellContext": [
    "# Welcome to Modal notebooks!\n\nWrite Python code and collaborate in real time..."
  ],
  "prefix": "%%writefile reductionSUM.cu\n#include <stdio.h>\n#include <iostream>\n#include <chrono>\n#include <cuda_runtime.h>\n\n#define BLOCK_SIZE 8\n\n__global__ void reductionKernel(int* buffer, int* globalResult, int N){\n\n  __shared__ int sharedBuffer[BLOCK_SIZE];\n\n\n  int threadIndexG = blockIdx.x * blockDim.x + threadIdx.x;\n  int threadIndexL = threadIdx.x;\n\n  if (threadIndexG < N) {\n    ",
  "suffix": "\n  } else {\n    sharedBuffer[threadIndexL] = 0;\n  }"
}

And the response:

{
  "completion": "`sharedBuffer[threadIndexL] = buffer[threadIndexG];`"
}

Two fields: prefix for everything before the cursor, suffix for everything after. At first it seemed weird to me. Why send the suffix at all?

After some ChatGPT-ing I came across the term: Fill-in-the-Middle (FIM).

The FIM Pattern

Instead of training a model to predict P(middle | prefix), FIM models are trained on P(middle | prefix, suffix). During training, chunks of code are split into three parts and shuffled into the <prefix>...<suffix>...<middle> format so the model learns this task explicitly.

The standard next-token prediction approach, where you give the model code and it continues from the end, completely breaks down when you’re writing in the middle of a file. The model has no idea what comes after the cursor, so it might complete in a direction that conflicts with what’s already there.

FIM solves this by giving the model the full picture: what came before, what comes after, fill in what’s missing.

After reading more about it I realized it’s quite a popular protocol. Models like Codex, DeepSeek Coder, and Code Llama all support FIM natively. But I wanted to try this with a general-purpose chat model, so I built a small demo.

Building the Demo

The complete source code is here. I will walk through the core parts of the implementation.

The goal was to understand the API payload and protocol, how to form it correctly on every user input, and how to handle the response. A full-width code editor, and when a suggestion is ready it floats up as a toast at the bottom with Tab to accept and Esc to dismiss. That’s it for the UI. The interesting parts are everything else.

1. The API Payload

When the user pauses typing, I split the editor content at the cursor position:

const prefix = code.slice(0, cursorPosition)
const suffix = code.slice(cursorPosition)

Then I send both to the model as a chat completion:

{
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user",   content: buildUserMessage(prefix, suffix) }
  ],
  max_tokens: 200,
  temperature: 0.2
}

Low temperature because this is code. You want deterministic, not creative.

2. Indicating the Cursor Position

The user message does two things. First, it gives the model the full file context with FIM tags. Then it zooms into the immediate context around the cursor. For large files, you want the model focused on the right spot, not reading noise from the top of the file.

function buildUserMessage(prefix: string, suffix: string) {
  const beforeCursor = prefix.split('\n').slice(-2).join('\n')
  const afterCursor  = suffix.split('\n').slice(0, 2).join('\n')

  return [
    `<prefix>${prefix}</prefix>[CURSOR]<suffix>${suffix}</suffix>`,
    `The [CURSOR] marker is where you must insert. Immediate context around [CURSOR]:`,
    `...${beforeCursor}[CURSOR]${afterCursor}...`,
  ].join('\n\n')
}

The [CURSOR] marker makes the insertion point explicit. Without it, the model can get confused about exactly where the prefix ends and the suffix begins.

3. The System Prompt

This is where most of the work is. Chat models are trained to be helpful. Without clear instructions, they will try to complete the entire file, repeat context, add comments, explain themselves. None of that is what you want here.

When writing the prompt I kept three things in mind:

Define the role. Tell the model exactly what it is and what its job is
Give a concrete example. It removes ambiguity faster than any amount of prose instructions
Constrain the output format. Instead of markdown code fences (which models sometimes forget or wrap inconsistently), I used a custom XML tag. This is what Anthropic calls artifacts: structured tags that extract a precise piece from the response, no fragile string parsing needed

You are a TypeScript code completion assistant.

The cursor position is marked with [CURSOR] in the user message.
Complete ONLY what belongs at that exact position.

Rules:
- Only complete what is at the very end of the prefix.
- The suffix already exists. Do NOT repeat or overlap with it.
- If nothing is missing at the cursor, return empty.
- No completion is often the best choice. Do not force a suggestion.
- Output ONLY the insertion text, no explanation, no markdown.

Example:
Input:
<prefix>const total = items.reduce((sum, item) => sum + </prefix>
<suffix>, 0)</suffix>

Output:
<complete>item.price</complete>

Respond with this format only: <complete>insertion text here</complete>

That rule no completion is often the best choice matters a lot. Without it the model tries to suggest something on every pause, even when the cursor is between two tokens that are already complete. You end up with noise.

4. The Bits That Make It Production Ready

Two small things that make a big difference in practice.

Debounce the API call. Fire only after the user pauses. I used 300ms.

Abort in-flight requests. If the user types again before the previous request completes, cancel it. Otherwise stale suggestions from three keystrokes ago show up and overwrite fresher ones.

Here is how both fit together. Instead of globals, I wrapped everything in a factory function so the timer and controller are private to each fetcher instance:

export function createCompletionFetcher(apiKey: string) {
  const groq = new Groq({ apiKey, dangerouslyAllowBrowser: true, maxRetries: 0 })
  let debounceTimer: ReturnType<typeof setTimeout> | null = null
  let abortController: AbortController | null = null

  return (prefix: string, suffix: string, onResult: (completion: string) => void) => {
    clearTimeout(debounceTimer ?? undefined)
    abortController?.abort()

    debounceTimer = setTimeout(async () => {
      abortController = new AbortController()
      try {
        const res = await groq.chat.completions.create(
          { model: MODEL, messages: [...], max_tokens: 200, temperature: 0.2 },
          { signal: abortController.signal }
        )
        const completion = parseCompletion(res.choices[0]?.message?.content ?? '')
        if (completion) onResult(completion)
      } catch (err) {
        if ((err as Error).name !== 'AbortError') console.error(err)
      }
    }, DEBOUNCE_MS)
  }
}

createCompletionFetcher is called once, and the returned function is what you use on every keystroke. The timer and controller live inside the closure, so there are no globals and no state leaking out.

Conclusion

The interesting thing about FIM is not the model, it is the protocol. What makes code completion feel magical is how you frame the problem: split the document at the cursor, package both halves as context, constrain the output format so the insertion is always precise.

You can apply this with any capable LLM. The model does not need a special FIM fine-tune. It just needs a clear contract: here is what came before, here is what comes after, give me only what is missing.

That’s it for this one, see you in the next one :)

Dive Deeper

If you wish to go into this rabbit hole, here are some interesting links:

How GitHub Copilot is getting better at understanding your code They reported a 10% relative boost in suggestion acceptance just from adopting FIM.
Codestral by Mistral Mistral’s 22B model built specifically for code, with first-class FIM support across 80+ languages. Good look at what a production FIM model actually looks like.
Hard Problems in AI Coding Tools (Cursor, 2024) Just hit Tab and the cursor moves across files, editor, terminal. No friction, stupid low latency. One of the most impressive ideas I’ve come across in this space.

Fill-in-the-Middle: How Code Completion Actually Works