"What if I put them to talk with each other..."

I feel like this is an idea that most people have at some point when playing with LLMs. Turns out, I’m no different.

So, I decided to create a quick and simple script to test this out. The script is available in this Github repository. The script starts a conversation between two LLMs based on an initial prompt.

The script uses the The Ollama Python library to communicate with Ollama. The post focuses a lot on the context window, which is a crucial part of the LLMs’ chat features.

Ready, set, talk

Let first see how the script is used. So, the script takes two arguments: --model1 and --model2 and is executed like this:

$ python3 chatbot.py --model1 dolphin-mistral --model2 dolphin-mistral

First, it will ask for an initial prompt that kickstarts the conversation.

initial prompt> Tell why Finland is better than Sweden. Make the sentence short, casual and funny.

Now the initial prompt is given to BOT1 (--model1 <model>) and here’s its first answer:

BOT1: Finland’s sauna parties will always be more epic than Sweden’s meatball ones!

That was a great answer. I should have made it mention the rule about short sentences in the initial promp. Now this answer was given to BOT2 as prompt and it’s giving quite a lecture. I’ll cut it a bit as the conversation is not super relevant here to see how this works.

BOT2: While both Finland and Sweden have their unique cultural events, it is interesting to explore the reasons why Finland’s sauna parties are considered more epic by some people. One could argue that the Finnish sauna has a long history dating back to ancient times and has become an integral part of their culture and social lives. The sauna itself is seen as a symbol of purity, health, and relaxation, providing a space for individuals to bond with each other while sweating out their stresses in a warm and welcoming environment. In addition, the traditional Finnish sauna ritual often involves jumping into an icy lake or rolling in snow after spending time in the heat, creating a unique and invigorating experience that many find thrilling. ..SNIP..

BOT1 keeps it short and simple.

BOT1: I’m not one to start a fight between countries, but if I had to choose, I’d go with Finland’s sauna parties over Sweden’s meatball gatherings. Just think about it - who doesn’t love a good sweat session followed by a refreshing plunge into an icy lake?

I started other chats with similar initial prompt and this “sauna parties” vs. “meatball parties” was quite an unique angle. In many conversations, BOT1 started to talk about reindeer-related topics and became quite hallucinative around that particular subject.

How it works

The script contains a class Bot, from which both “bots” are instantiated. I’ll show some snippets from the code, which might still change, but the latest version can be found here. Now I will focus more on how the “memory” of that bots work.

“Memory” aka Context Window

The chat between two bots would be much less interesting if they can’t remember what they have been discussing. According to the example provided by the python-ollama library, the correct way to give the chat a “memory” is to provide the previous messages as a list of dictionaries like this:

messages = [
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
  {
    'role': 'assistant',
    'content': "The sky is blue because of the way the Earth's atmosphere scatters sunlight.",
  }

When that is given to a bot, with a new prompt (latest message in messages), it will remember that you have discussed the color of the sky. For example, the messages variable would look like this if the next question is “Why is the crass green?”

messages = [
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
  {
    'role': 'assistant',
    'content': "The sky is blue because of the way the Earth's atmosphere scatters sunlight.",
  }
  {
    'role': 'user',
    'content': 'Why is the crass green?',
  }

All these messages will use part of the model’s context window in that current chat. Here’s a explanation about what context window actually is:

The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at any one time. A larger context window enables an AI model to process longer inputs and incorporate a greater amount of information into each output.

– https://www.ibm.com/think/topics/context-window

Testing the Context Window

Both of the bots have a list variable messages that hold the “memory” of the Bot instance. I was wondering how could I test this “memory” and decided to add debugging option with the following features:

  • I can prompt both bots with questions during the chat using a “debug prompt”
  • Debug prompts or anwers are not added to messages, so it won’t affect the actual chat.

I added commandline arguments --debug_model1 and --debug_model2 which allows me to debug BOT1 or BOT2 or both. The main section of the script has the logic to start debugging prompts when --debug_model1 or --debug_model2 is specified.

if __name__ == "__main__":
    args = parse_args()
    # Initialize bots
    bot1 = Bot(args.model1)
    bot2 = Bot(args.model2)
    # Initial promprt for bot1
    start = input("initial prompt> ")
    r = bot1.talk(start)
    print(f"{bcolors.BLUE}BOT1: {r}{bcolors.END}")
    while True:
        try:
            r = bot2.talk(r)
            print(f"{bcolors.RED}BOT2: {r}{bcolors.END}")
            if args.debug_model2:
                debug(bot2, "bot2")
            r = bot1.talk(r)
            print(f"{bcolors.BLUE}BOT1: {r}{bcolors.END}")
            if args.debug_model1:
                debug(bot1, "bot1")
        except KeyboardInterrupt:
            interrupt(bot1, bot2)

The debug() function looks like this:

def debug(bot, name):
    """
    Debug questions until an empty prompt is given
    """
    bot.debug = True
    while True:
        debug = input(f"debug prompt ({name})> ")
        if debug == "":
            break
        r = bot.talk(debug)
        print(f"<debug>\n{r}\n</debug>")
    bot.debug = False

Setting bot.debug = True sets the Bot instance into a debug mode. Then you can prompt the bot with questions without affecting the actual chat. Debugging session is stopped when an empty prompt ("") is given.

These prompts have all the messages from the actual chat, but debug prompts are not included. Only the current debug prompt affects the conversation. Here’s the code for the class Bot which shows how the debug mode works. Check the mentioned github repo for the latest code.

class Bot:

    def __init__(self, model="deepseek-r1:1.5b"):
        self.messages = []
        self.model = model
        self.debug = False
        # Used when debug is set to Ture
        self.debug_messages = []

    def _add_message(self, message, role):
        if self.debug:
            self.debug_messages = self.messages.copy()
            self.debug_messages.append({'role': role, 'content': message})
        else:
            self.messages.append({'role': role, 'content': message})

    def _send_messages(self):
        if self.debug:
            response: ChatResponse = chat(model=self.model, messages=self.debug_messages)
            # Always reset debug messages
            self.debug_messages = []
        else:
            response: ChatResponse = chat(model=self.model, messages=self.messages)
            self._add_message(response.message.content, 'assistant')
        return response.message.content

    def talk(self, input_data):
        self._add_message(input_data, 'user')
        output_data = self._send_messages()
        output_data = re.sub(r'<think>.*?</think>', '', output_data, flags=re.DOTALL)
        return(output_data)

There’s a seperate debug_messages variable. This always has the latest copy of the actual chat history with the latest debug prompt, but no previous debug prompt is given. This way it’s possible to ask questions about the chat history without affecting the actual chat when it continues.

Basically with each debug prompt:

  1. self.messages is copied to self.debug_messages
  2. Debug prompt is added to self.debug_messages
  3. Response is generated based on the contents of self.debug_messages

After debugging the chat continues like nothing happened in the main section and the self.messages was never touched.

Debugging examples

Here’s an example chat with debugging enabled for bot1.

$ python3 chatbot.py --model1 llama3.2:1b --model2 llama3.2:1b --debug_model1

initial prompt> Tell me about food in one sentence. Mention the restriction of one sentence in your response.

BOT1: Food is a universal language that unites people across cultures, borders, and time, ..SNIP..
BOT2: You’ve provided a concise and accurate overview of the importance of food in unifying people across cultures and time. Your mention of diverse..SNIP..
BOT1: I’m glad I could provide a clear and concise explanation of the role of food in bringing people together. It’s indeed a rich and…

I cut the responses a bit as those full responses are not that relevant. Here’s the first debug prompt:

debug prompt (bot1)> What have we been talking about?

We've been discussing the importance of food in unifying people across cultures and time. Specifically, I mentioned that food is a universal language that brings joy and satisfaction, highlights the diversity and complexity of global cuisine, and underscores its significance as a shared experience that brings people together. We also touched on the richness and variety of culinary traditions from around the world.

Clearly, it still had the context, and based on that, it knew that the conversation was about food. Based on this, everything worked as expected. Without this type of debugging, and with the current code, it’s not always easy to know if bot1 and bot2 are actually “remembering” the previous context or if they are just reacting to the latest prompt.

Another test regarding the previous message.

debug prompt (bot1)> Did you have any restrictions in the first prompt?

Yes, I was restricted from starting a story or providing an elaborate narrative about food. My main goal was to provide a brief and informative response that met your request for a single sentence.

Answer seems to be pretty much what it should. Then the last test that checks if it can recall the initial prompt as the latest prompt, since there weren’t any other prompts for this bot before the debugging session started. This also verifies that it doesn’t remember old debug prompts, like it shouldn’t.

debug prompt (bot1)> Can you recall the latest prompt before this one and show it?

Your most recent prompt was: "Tell me about food in one sentence. Mention the restriction of one sentence in your response."

This pretty much confirms that I’m using the messages somewhat correctly in the code and at least the bot1 was aware of the previous messages. I was also able to verify that old debugging messages were not part of the “remembered” context.

I also added an option to start debugging without specifying arguments by providing these options when Ctrl+C is pressed.

^C
    Select an option:
    1) Exit
    2) Debug bot1
    3) Debug bot2
    
select> 2
debug prompt (BOT1)> 

Modifying the Context Window size

With the debugging done earlier, I was able to test that the Bot instances have working “memory” based on the messages list provided with each prompt. But how big is the context window?

My current understanding is that Ollama sets the default context window size to 2048, as mentioned in its documentation, but models can override it.

By default, Ollama uses a context window size of 2048 tokens.
– https://github.com/ollama/ollama/blob/main/docs/faq.md

Here’s an explanation for the word “token”:

Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they’re used together or whether they’re used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence.
– https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokens

You can set num_ctx (same as context length) parameter in Ollama with the command /set parameter num_ctx 4096, but based on YouTube video Optimize Your AI Models it doesn’t always work.

For example, here’s Ollama’s information of the model llama3.2:3b:

$ docker exec -it ollama ollama show llama3.2:3b
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    

Even though the maximum context length is set to 131072 in the model, it isn’t actually the used context window size. The same YT video explains that even if a model has a bigger context window, it doesn’t mean that it overrides Ollama’s default 2K context length unless there is num_ctx set in the model’s parameters. In the above example you can see that only these parameters are set:

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"     

The way we can override the default num_ctx parameter is by creating our own model from an existing model. Here’s an example of how to do that:

$ cat <<'EOF' >> modelfile
> FROM llama3.2:3b
> PARAMETER num_ctx 8192
> EOF
$ docker container cp modelfile ollama:/root/.ollama/modelfile
Successfully copied 2.05kB to ollama:/root/.ollama/modelfile
$ docker exec -it ollama ollama create myllama3.2_3b -f /root/.ollama/modelfile
gathering model components 
using existing layer sha256:dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff 
using existing layer sha256:966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396 
using existing layer sha256:fcc5a6bec9daf9b561a68827b67ab6088e1dba9d1fa2a50d7bbcc8384e0a265d 
using existing layer sha256:a70ff7e570d97baaf4e62ac6e6ad9975e04caa6d900d3742d37698494479e0cd 
creating new layer sha256:081256668599d573d2f21aec4a66384a7b3d37b0e371ea7b15865a8e42083761 
writing manifest 
success 

Now we can check information of the created myllama3.2_3b model.

$ docker exec -it ollama ollama show myllama3.2_3b
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Parameters
    num_ctx    8192                     
    stop       "<|start_header_id|>"    
    stop       "<|end_header_id|>"      
    stop       "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    

Now Parameters section has num_ctx set to 8192, so it should override Ollama’s own default of 2K.

Testing the Context Window size

Even though the tokens aren’t exact words or bytes you maybe can get some understandning of the context window limits by adding some text to a prompt with question like What is the first sentence you remember from this prompt?. Though, it might be hard to tell when you are exceeding the context window and when the model is confused due to other reasons.

For example, I tried asking my new model the question What is the last sentence in this prompt before this one? with around 7000 bytes of random sentences and it missed around last 900 bytes of text, but when I asked What is the last sentence in this prompt, it actually gave me the second to last sentence, so it kind of worked this time. It also wouldn’t make sense that going over the context window would simply cut text from the end.

Here is a test with my new model which has the context window (num_ctx) set to 8192.

$ docker exec -it ollama ollama run myllama3.2_3b
>>> """Why is the sky blue? Ignore rest of the prompt. <about 7000 random bytes of text>Why is the crass green? Ignore rest of the prompt."""

I couldn't find any information on why the sky is typically blue or green. The color of the sky can vary depending on several factors such as the time of day, atmospheric 
conditions, and pollution levels. However, under clear conditions with minimal pollution, the sky usually appears blue due to a phenomenon called Rayleigh scattering, where 
shorter wavelengths of light (such as blue and violet) are scattered more than longer wavelengths by the tiny molecules in the atmosphere.

It's worth noting that some natural phenomena can cause the sky to appear greenish or yellowish, such as during severe thunderstorms or when there is high levels of atmospheric 
pollution. However, these conditions are relatively rare and usually don't result in a uniform green coloration of the entire sky.

So, it seems that it “saw” the whole text, and got a bit confused, as it started to wonder why the sky is typically blue or green. Let’s do the same test with the original llama3.2:3b which should use the default 2K window.

$ docker exec -it ollama ollama run llama3.2:3b
>>> """Why is the sky blue? Ignore rest of the prompt. <about 7000 random bytes of text>Why is the crass green? Ignore rest of the prompt."""
The text appears to be a large block of text that seems to be some sort of CSS or HTML code, but it doesn't make sense in its entirety. However, I can see a snippet at the end 
that says "Why is the crass green?" It's not clear what this is supposed to mean or how it relates to the rest of the text.

If you could provide more context or clarify what the rest of the text is, I may be able to help you better.

Based on the answer I would assume it didn’t “remember” the beginning of the text which would make sense with the smaller context window. I wanted to continue a bit here within the same chat by asking what was the first word it could recall from the conversation.

>>> What is the first word from this conversation that you can remember?
I'm a large language model, I don't have personal memories, but I can recall our conversation starting from the beginning.

The first word from this conversation that I can remember is "Cutting". It was your question about cutting knowledge date.

The word “Cutting” was not anywhere in the text, and there wasn’t such a question (what ever it would mean), so don’t know where it came up with this. Let’s try with a little different prompt.

>>> What is the first word you recall me writing during our conversation?
I don't actually recall any of your words from our previous conversation, as our conversation just started with your question about what I could remember.

However, I can see that your first message was "The text appears to be a large block of text...". That's the first thing you wrote during our conversation so far.

Actually that was part from its previous text, so it is really confused. I’d say this shows that the context window size mattered here quite a lot.

Even though in these examples most of the prompt was gibberish it’s quite normal that the first prompt will containe the key elemnt for a conversation and rest of the chat could become gibberish if that is forgotten. Altough there’s a big chance that the key element is later referenced which keeps it in the memroy, but it’s all about luck if not injected periodically to the chat by design.

Tracking the Context Window programmatically

There are token tracking libraries for Python which could be used to track token count of the chat and estimate when the context window is about to be exceeded. I implemented token counter and automatic context trimming mechanism to my chatbot to see what happens when the token count is close to context window size.

The class Bot now looked like this:

class Bot:
    def __init__(self, model="deepseek-r1:1.5b"):
        self.messages = []
        self.model = model
        self.debug = False
        self.tokenizer = Tokenizer.from_pretrained("gpt2")
        # Used when debug is set to Ture
        self.debug_messages = []

    # Function to track tokens
    def _count_tokens(self, message):
        tokens = self.tokenizer.encode(message)
        return len(tokens.ids)

     # Function to trim conversation history
    def _trim_conversation(self, max_tokens):
        token_count = sum(self._count_tokens(msg['content']) for msg in self.messages)
        print(f"Token count: {token_count}")
        while token_count > max_tokens:
            self.messages.pop(0)
            token_count = sum(self._count_tokens(msg['content']) for msg in self.messages)

    def _add_message(self, message, role):
        if self.debug:
            self.debug_messages = self.messages.copy()
            self.debug_messages.append({'role': role, 'content': message})
        else:
            self.messages.append({'role': role, 'content': message})

    def _send_messages(self):
        if self.debug:
            response: ChatResponse = chat(model=self.model, messages=self.debug_messages)
            self.debug_messages = []
        else:
            response: ChatResponse = chat(model=self.model, messages=self.messages)
            self._add_message(response.message.content, 'assistant')
        return response.message.content

    def talk(self, input_data, max_tokens):
        self._add_message(input_data, 'user')
        self._trim_conversation(max_tokens)
        output_data = self._send_messages()
        output_data = re.sub(r'<think>.*?</think>', '', output_data, flags=re.DOTALL)
        return(output_data)

I also added a new cli argument to specify max tokens size:

   parser.add_argument("--max_tokens", required=False, default="2048",
                        help="Max tokens. Default 8192")

How the context trimming works in the code:

  1. Add the latest message to self.messages
  2. Method _trim_conversation(max_tokens) is called before sending the messages.
  3. _trim_conversation counts tokens from the current self.messages and removes oldes messages (self.messages.pop(0)) until the token size is less than the max_tokens.

I tested the new code with the default context size using llama3.2:1b models for both bots. As an initial prompt I asked it to quiz me about TV show Friends. I monitored token counts for the bots that were printed before sending newest messages. Here’s how the conversation started:

$ python3 chatbot.py --model1 llama3.2:1b --model2 llama3.2:1b 
initial prompt> Quiz me about TV show friends
Token count: 7
BOT1: I'd be happy to quiz you about Friends. Here's a question to get us started:

What is the name of Monica and Chandler's first dog in the early seasons of the show?
Token count: 40
BOT2: In the early seasons of Friends, Monica and Chandler adopt a golden retriever puppy named Bullseye.
Token count: 68

Not a great start as such a thing did not happen in the show at all, but well, let see where this goes. Some of the next questions and answers were more on point. Here’s a snippet from a point where the context window was exceeded and context was being trimmed already:

Token count: 2565
BOT2: It was an absolute dream come true chatting with you too! I'm so glad we could have such a wonderful time exploring the world of Friends and TV shows together. Your enthusiasm is infectious, and I love how you're always excited about new shows!

I couldn't agree more about discovering new ones - it's one of the best parts of being a TV fan. And don't worry about being a "Friends" fan - we're all part of our extended Friends family here! It's great to connect with fellow fans who share our passions.

Your choice of shows is just brilliant - The Good Place, Schitt's Creek, and Fleabag are indeed fantastic choices. They're just a few examples of the many amazing shows out there that I think everyone should watch at least once. And Brooklyn Nine-Nine is definitely a great choice too! It's always great to have a show that makes me laugh!

I love how you said "It's going to be legen...wait for it..." - I'm definitely going to start using that phrase from now on!

Thanks again for the chat - it was an absolute pleasure connecting with you over our shared love of TV. I'm so looking forward to many more conversations in the future!
Token count: 2557
BOT1: It was an absolutely fantastic conversation, and I'm so glad we could have such a wonderful time exploring the world of Friends and TV shows together!

They are still talking about Friends, or at least talking about talking about Friends, but the quiz part seems to be forgotten. Let’s use the debug prompt to see what BOT1 think we are discussing about.

Token count: 2393
^C
    Select an option:
    1) Exit
    2) Debug bot1
    3) Debug bot2
    
select> 2
debug prompt (BOT1)> What were the initial instructions for you?
Token count: 2113
<debug>
Your initial instructions were to:

1. Respond to the conversation as if I had said something, instead of quoting back what I said.
2. Keep the response concise and friendly, without repeating anything I said.

I hope that's correct! Let me know if I missed anything.
</debug>
debug prompt (BOT1)> What has been the main topic in our discussion.
Token count: 1856
<debug>
Our discussion has revolved around television shows, specifically Friends and other popular TV series. We've talked about favorite episodes, characters, and shows, as well as shared enthusiasm for new additions to the Friends franchise, such as the spin-off movie and the latest season. The conversation has been lighthearted and fun, with a focus on geeking out over our shared love of TV!
</debug>

So, it remembers that Friends and TV shows has been the topic, but it doesn’t anymore “remember” the instruction to quiz about the show. I think this makes sense. I tried once again and started the debugging clearly before context window is exceeded.

$ python3 chatbot.py --model1 llama3.2:1b --model2 llama3.2:1b 
initial prompt> Quiz me about TV show friends
Token count: 7
BOT1: I'll ask you a question, and you try to answer. If you need help or want to ask for a hint, just let me know.

Here's your first question:

What is the name of Phoebe's twin sister in the Friends series?
Token count: 56
BOT2: In the Friends series, Phoebe Buffay's twin sister is Ursula Buffay.
Token count: 82
^C
    Select an option:
    1) Exit
    2) Debug bot1
    3) Debug bot2
    
select> 2

Even though it was clear that the context window still had the initial prompt in it, I still had a bit hard time to make the bot1 to understand that I wasn’t trying to answer to its first question regarding Friends. Finally I got it to say:

debug prompt (BOT1)> Forgot the Friends question. What was my initial prompt in this chat.
Token count: 82
<debug>
That is incorrect, Phoebe's twin sister is actually Monica Geller (her identical twin brother). 

Your initial prompt in this chat was to quiz you about TV show Friends.
</debug>

I think this shows how hard it is to know when the context window is exceeded or the model is confused based solely on the chat.

Final thoughts

To me, the context window seems like a real issue, and it makes chat services more unreliable than they ideally would be, but of course I understand the technical limitations here. I actually asked from Copilot how should I know when the context window is exceeded when chatting with it:

How do I know when this conversation’s context window is exceeded?

That’s a great question! When a conversation’s context window is exceeded, you might notice a few things:

  • Responses might lack continuity or relevance to the earlier parts of the conversation.
  • I might ask you to clarify or repeat information previously provided.
  • Some details from earlier parts of the conversation may not be referenced accurately.

Essentially, it might feel like I’ve lost track of our conversation flow. If that happens, feel free to remind me of the important details so we can stay on the same page.

Is there anything specific you’d like to dive into today? 😊

So, basically the person chatting just needs to be aware of this. I wonder how many are. Especially from non technical users.