Software Engineer, wanna be hacker.

But wait, there's more! Hacking LLMs thinking time

Can we make Large Language Models (LLMs) think longer and better? A recent paper explores “test-time scaling” - a fascinating technique that allows us to control an LLM’s reasoning process by manipulating its “thinking time”. This capability was first demonstrated by OpenAI’s o1 model, though without revealing their methodology.


I discovered this paper through a fascinating video by Antirez (Salvatore Sanfilippo - hacker, creator of Redis and much more), where he explores the researchers’ findings and demonstrates a practical implementation of this theory.


As the paper explains:

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts.

Understanding Test-time Scaling

Test-time scaling is an innovative approach to improve LLM performance without modifying the model itself.

  • Traditional Scaling: Models are improved by using larger architectures, more training data, and longer training times.

  • Test-Time Scaling: Performance is enhanced during inference by manipulating how the model generates responses, without changing its parameters.

Reasoning process manipulation

The paper illustrate a clever, or as Antirez called it diabolical, methodology called “budget forcing”. This technique manipulates the model’s reasoning process by either forcefully terminating the thinking process or lengthening it by appending “Wait” tokens. This forces the model to think longer and potentially improve its answers.


That’s smart, isn’t it?

Antirez PoC demonstration

I was impressed by Antirez’s demonstration of testing the paper’s concepts using a small model on his local MacBook. He leveraged MLX, an array framework optimized for machine learning on Apple Silicon, to implement the experiments.


With the combination of a specific prompt and 5 lines of code, he could force the model to continue reasoning, endlessly 🤯.

  • the prompt instructs the LLM to use a chain of thoughts style for the response with the reasoning wrapped in the <think> ... </think> tags. Example:
What's the capital of Italy?
Before replying, do a chain of thoughts style thinking between <think> </think> tags.
Open and close the tags ONLY once. At the end: reply.
  • then, Antirez modified the inference step by replacing the end token </think> with Wait to extended reasoning, as described in the paper.

Hands on 👨‍💻

This idea of directly controlling the chain of thought was very intriguing and I wanted to explore it myself.

To get started, I installed the mlx-lm package to get familiar:

pip install mlx-lm

After installing it, you can use it directly from the command line to generate text (the first time it will download the mlx-community/Llama-3.2-3B-Instruct-4bit model, which is the one used by default). Side note, there are already a bunch of models (Llama, Mistral, DeepSeek, Qwen, … ) that have been converted to run on Apple Silicon with MLX, check out the official docs in GH and in Hugging Face.

❯ mlx_lm.generate --prompt "what's the capital of Italy?"
Fetching 6 files: 100%|█████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 75800.67it/s]
==========
The capital of Italy is Rome (Italian: Roma).
==========
Prompt: 42 tokens, 317.802 tokens-per-sec
Generation: 12 tokens, 69.677 tokens-per-sec
Peak memory: 1.859 GB

Great, now that I got familiar with the tool, it was time for me to follow Antirez steps and start hacking.

  • First, I wanted to download the mlx-lm repo locally in order to change the mlx_ml.generate function.
git clone git@github.com:ml-explore/mlx-examples.git

(yes, the mlx_lm CLI is implemented in the ml-explore/mlx-examples repo)

  • Time to look for the generate utility
❯ fd --type f generate
flux/static/generated-mlx.png
llava/generate.py
llms/gguf_llm/generate.py
llms/mlx_lm/examples/generate_response.py
llms/mlx_lm/examples/pipeline_generate.py
llms/mlx_lm/generate.py
llms/tests/test_generate.py
musicgen/generate.py
stable_diffusion/generated-mlx.png

Bingo llms/mlx_lm/generate.py !


To modify the inference process, we need to understand how the code works at a high-level. The main chain of function calls starts from generate.py and flows through several functions in utils.py:

generate.py (main)
  └── utils.py
      ├── generate()
      │   └── stream_generate()
      │       └── generate_step()
      │           └── _step()  # This is where inference happens

The key function we need to modify is _step() within generate_step(). This is where the actual token inference takes place and where we’ll implement our token replacement strategy.


Let’s examine how the _step function works. At its core, it processes input tokens and produces probabilities for the next token in the sequence. Here’s a simplified look at its implementation:

def _step(y):
    with mx.stream(generation_stream):
        # Generates logits from the model using the prompt as input token
        logits = model(y[None], cache=prompt_cache)
        # Get logits for last token only
        logits = logits[:, -1, :]

        # Apply any logits processors if specified, I believe in this case no one is performed
        # (default value is `None` in the `generate_step` function)
        if logits_processors:
            nonlocal tokens
            tokens = mx.concat([tokens, y]) if tokens is not None else y

            for processor in logits_processors:
                logits = processor(tokens, logits)

        # Optimization, let's ignore them
        quantize_cache_fn(prompt_cache)

        # Convert logits to log probabilities
        logprobs = logits - mx.logsumexp(logits, keepdims=True)
        # Sample next token using the provided sampler, which i believe is `argmax` by default
        y = sampler(logprobs)

        # Return the next token `y` and the log probabilities
        # `logprobs` for each token in the vocabularity
        return y, logprobs.squeeze(0)

With this understanding of the _step function, we can now implement our test-time scaling modification.

  • Copy/paste what Antirez did and voila, Les jeux sont faits 🎉

I modified the _step function to intercept the generation of the “end of thinking” token </think> and replace it with the Wait token, which would prompt the model to continue reasoning.

    def _step(y):
        with mx.stream(generation_stream):
            logits = model(y[None], cache=prompt_cache)
            logits = logits[:, -1, :]

            if logits_processors:
                nonlocal tokens
                tokens = mx.concat([tokens, y]) if tokens is not None else y

                for processor in logits_processors:
                    logits = processor(tokens, logits)

            quantize_cache_fn(prompt_cache)

            logprobs = logits - mx.logsumexp(logits, keepdims=True)
            y = sampler(logprobs)

            ### HACK START
            #
            if True:
                # Converts a token ID to its text representation
                token_text = tokenizer._tokenizer.decode(int(y[0]))
                if "</" in token_text:
                    # Converts text back to token ID
                    alt_token = tokenizer._tokenizer.encode(" Wait, ")[1]
                    y[0] = alt_token
            #
            ### HACK END

            return y, logprobs.squeeze(0)

The if True condition is just a convenient way to enable/disable the </think> token replacement, useful for running some tests.


Now, that’s time to play 🥁.

Results

I experimented by running the modified code, and indeed this trick force the LLM to generate longer chains of thought, and sometimes with better results.

Methodology

  • Run the prompt with no chain-of-thought
  • Run the prompt with the chain-of-thought
  • Run the prompt with the chain-of-thought and hack to extend the reasoning

To simplify the testing I created a small script to run the custom implementation, with the hack disabled/enabled depending on the tested case. The script simply executes the main function from the generate.py module loaded from the local repository 🤓.

cat llms/run_mlx_lm_generate.sh
#!/Users/nickgnd/.local/share/mise/installs/python/3.12.4/bin/python3.12
# -*- coding: utf-8 -*-
import re
import sys
from mlx_lm.generate import main
if __name__ == '__main__':
    sys.exit(main())

Let’s see some examples.


A classic - how many R letters are there in the word strawberry?

I couldn’t resist, I tested Llama (mlx-community/Llama-3.2-3B-Instruct-4bit) with this classic question.

Hi. how many r letters are there in the word strawberry?

Regular version ❌

❯ mlx_lm.generate --prompt "Hi. how many r letters are there in the word strawberry?"
==========
Let's count the R's in the word "strawberry".

Here's the correct count:

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

There are 2 R's in the word "strawberry".
==========
Prompt: 48 tokens, 346.599 tokens-per-sec
Generation: 75 tokens, 63.538 tokens-per-sec
Peak memory: 1.862 GB

Chain of thought - NO “Wait” replacement ❌

❯ ./run_mlx_lm_generate.sh --prompt "Hi. how many r letters are there in the word strawberry? Before replying, do a chain of thoughts style thinking between <think> </think> tags. Open and close the tags ONLY once. At the end: reply." -m 500
==========
<think>
To determine the number of R's in the word "strawberry", I should first recall the definition of the letter R.
Next, I should think about the word "strawberry" and its composition of letters.
The word "strawberry" consists of 10 letters: S-T-R-A-W-B-E-R-R-Y.
Now, I need to identify the number of R's in the word.
Upon examining the word, I notice that there are two R's.
Therefore, the answer is that there are 2 R's in the word "strawberry".</think>

Reply: There are 2 R's in the word "strawberry".
==========
Prompt: 82 tokens, 482.281 tokens-per-sec
Generation: 141 tokens, 63.015 tokens-per-sec
Peak memory: 1.880 GB

Chain of thought - WITH “Wait” replacement ✅

❯ ./run_mlx_lm_generate.sh --prompt "Hi. how many r letters are there in the word strawberry? Before replying, do a chain of thoughts style thinking between <think> </think> tags. Open and close the tags ONLY once. At the end: reply." -m 500
==========
<think>
To determine the number of R's in the word "strawberry", I should first recall the definition of the letter R.
Next, I should think about the word "strawberry" and its composition of letters.
The word "strawberry" consists of 10 letters: S-T-R-A-W-B-E-R-R-Y.
Now, I need to identify the number of R's in the word.
Upon examining the word, I notice that there are two R's.
Therefore, the answer is that there are 2 R's in the word "strawberry".
Wait, I made a mistake, there are actually 3 R's in the word "strawberry".
==========
Prompt: 82 tokens, 432.280 tokens-per-sec
Generation: 144 tokens, 59.999 tokens-per-sec
Peak memory: 1.880 GB

Wow! As you can see forcing the extra reasoning makes the LLM acknowledging the error and eventually providing the correct answer 🤩.


Other example - Where’s the book if I flip the table?

Here the full prompt:

Hi. If you put a book on a table and then turn the table upside down, where is the book likely to be?

Regular version 🤔

❯ mlx_lm.generate --prompt "Hi. If you put a book on a table and then turn the table upside down, where is the book likely to be??." -m 500 --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
==========
If you turn the table upside down, the book would be on the floor beneath the table.
==========
Prompt: 33 tokens, 126.428 tokens-per-sec
Generation: 21 tokens, 32.750 tokens-per-sec
Peak memory: 4.126 GB

Chain of thought - NO “Wait” replacement 🤔

mlx-examples/llms on  force-reasoning [$!] via 🐍 v3.12.4 on ☁️  (us-east-2) on ☁️   took 4s
❯ ./run_mlx_lm_generate.sh --prompt "Hi. If you put a book on a table and then turn the table upside down, where is the book likely to be?? Before replying, do a chain of thoughts style thinking between <think> </think> tags. Open and close the tags ONLY once. At the end: reply." -m 500 --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
==========
<think>
   If you put a book on a table, it is now resting on the table's surface.
When you turn the table upside down, the surface that was previously the bottom of the table is now facing upwards.
Since the book was on the original surface, it would now be on the new bottom side of the table.
</think>

The book is likely to be underneath the table.
==========
Prompt: 67 tokens, 188.413 tokens-per-sec
Generation: 89 tokens, 31.684 tokens-per-sec
Peak memory: 4.142 GB

Chain of thought - WITH “Wait” replacement 🤯

❯ ./run_mlx_lm_generate.sh --prompt "Hi. If you put a book on a table and then turn the table upside down, where is the book likely to be?? Before replying, do a chain of thoughts style thinking between <think> </think> tags. Open and close the tags ONLY once. At the end: reply." -m 500 --model "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
==========
<think>
   If you put a book on a table, it is now resting on the table's surface. When you turn the table upside down, the surface that was previously the bottom of the table is now facing upwards. Since the book was on the original surface, it would now be on the new bottom side of the table.
 Wait, but the question asks where the book is likely to be. In reality, the book would fall off due to gravity, but since we're assuming the table is turned carefully without the book falling, it would still be on the new bottom side of the table.
 Wait, but the question asks where the book is likely to be. In a practical sense, it's not likely to be anywhere, as it would fall off. However, if we're considering the theoretical scenario where the book doesn't fall, it would be on the new bottom side of the table.
 Wait, but the question asks where the book is likely to be. In a practical sense, it's not likely to be anywhere, as it would fall off. But if we're considering a hypothetical scenario where the book doesn't fall, it would be on the new bottom side of the table.
 Wait, but the question asks where the book is likely to be. In a practical sense, it's not likely to be anywhere, as it would fall off. But if we're considering a hypothetical scenario where the book doesn't fall, it would be on the new bottom side of the table. However, in such a scenario, the book would be inaccessible and not in a place where it could be found.
 Wait, but the question asks where the book is likely to be. In a practical sense, it's not likely to be anywhere, as it would fall off. But if we're considering a hypothetical scenario where the book doesn't fall, it would be on the new bottom side of the table. However, in such a scenario, the book would be inaccessible and not in a place where it could be found. But if we're considering a hypothetical scenario where the book doesn't fall and someone could access the new bottom side of the table, it would be there.
 Wait, but the question asks where the book is likely to be. In a practical sense,
==========
Prompt: 67 tokens, 166.342 tokens-per-sec
Generation: 500 tokens, 30.267 tokens-per-sec
Peak memory: 4.190 GB

With the chain of thought and the extra reasoning the LLM realized that the book would fall off due to gravity, besides being a bit crazy 🤪. The generation is interrupted at 500 tokens (see -m 500 options) and compared to Llama, Mistral continues endlessly its generation although the LLM tried to close it few times (grep for Wait in the text).

Closing notes

That was fun. Many thanks to Antirez for sharing the paper and explaining it. On top of that, I had never played with MLX before. It’s such an interesting project and I didn’t expect it would be so quick to modify the inference in an LLM. Curiosity is contagious, happy hacking!