Semantic Kernel: Optimising Chat History : jamie_maguire

November 23, 2024 November 23, 2024

Semantic Kernel: Optimising Chat History
by: jamie_maguire
blow post content copied from Jamie Maguire
click here to view original post

When building agents using Semantic Kernel that integrate with SLMs (Small Language Models) /LLMs (Large Language Models), you typically send and receive multiple responses.

A common pattern is to add all user and agent message to the ChatHistory object.

In the past, I have done this by implementing a service class to handle these in .NET web applications.

You can see an example of a controller which creates the chat history object for unique sessions for a webchat experience here:

[HttpPost]
  public async Task<IActionResult> Post([FromBody] ChatRequest chatRequest)
  {
      // Use a fixed session ID for simplicity, or generate a unique one per user/session
      var sessionId = "default-session";
      var history = _chatHistoryService.GetOrCreateHistory(sessionId);


      // Add user input
      history.AddUserMessage(chatRequest.Message);

      var openAIPromptExecutionSettings = new OpenAIPromptExecutionSettings
      {
          ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions,
      };

      // Get the response from the AI
      var result = await _chatCompletionService.GetChatMessageContentAsync(
          history,
          executionSettings: openAIPromptExecutionSettings,
          kernel: _kernel
      );

      // Add the message from the agent to the chat history
      history.AddAssistantMessage(result.Content ?? string.Empty);

      return new JsonResult(new { reply = result.Content });
  }

In the above code, you will see that we add user messages to the history when they are received.

We add add the agent/assistant reply to the history when they are received by the model.

We also send the history with each request in the above code.

Whilst this is ok for short interactions, more complex interactions can involve multiple iterations and prompts.

This can increase the likelihood of the code hitting a language model maximum number of tokens (words) for a given context window, thereby meaning the model cannot process the request and will return an error.

I have personally seen this on client projects.

Ultimately, we need to find ways to reduce the size of the request sent to a SLM or LLM whilst ensuring the AI still behaves in a way that we expect.

Managing Chat History

Several approaches can be implemented to help you reduce the chat history during your agentic AI development.

The Semantic Kernel Team recently published a blog post and shared a GitHub repo detailing some examples of how this can be achieved.

The main concepts in that blog post centred around:

Sending only the last number of N messages
Limiting messages based on token count
Summarising older messages

You can see the interface for chat history reduction contains only 1 method:

/// <summary>
/// Interface for reducing the chat history before sending it to the chat completion provider.
/// </summary>
public interface IChatHistoryReducer
{
    /// <summary>
    /// Reduce the <see cref="ChatHistory"/> before sending it to the <see cref="IChatCompletionService"/>.
    /// </summary>
    /// <param name="chatHistory">Instance of <see cref="ChatHistory"/>to be reduced.</param>
    /// <param name="cancellationToken">Cancellation token.</param>
    /// <returns>An optional <see cref="IEnumerable{ChatMessageContent}"/> which contains the reduced chat messages or null if chat history can be used as is.</returns>
    Task<IEnumerable<ChatMessageContent>?> ReduceAsync(ChatHistory chatHistory, CancellationToken cancellationToken);
}

.. an implementation :

/// <summary>
/// Implementation of <see cref="IChatHistoryReducer"/> which trim to the specified max token count.
/// </summary>
/// <remarks>
/// This reducer requires that the ChatMessageContent.MetaData contains a TokenCount property.
/// </remarks>
public sealed class MaxTokensChatHistoryReducer : IChatHistoryReducer
{
    private readonly int _maxTokenCount;


    /// <summary>
    /// Creates a new instance of <see cref="MaxTokensChatHistoryReducer"/>.
    /// </summary>
    /// <param name="maxTokenCount">Max token count to send to the model.</param>
    public MaxTokensChatHistoryReducer(int maxTokenCount)
    {
        if (maxTokenCount <= 0)
        {
            throw new ArgumentException("Maximum token count must be greater than zero.", nameof(maxTokenCount));
        }

        this._maxTokenCount = maxTokenCount;
    }


    /// <inheritdoc/>
    public Task<IEnumerable<ChatMessageContent>?> ReduceAsync(ChatHistory chatHistory, CancellationToken cancellationToken = default)
    {
        var systemMessage = chatHistory.GetSystemMessage();
        var truncationIndex = ComputeTruncationIndex(chatHistory, systemMessage);

        IEnumerable<ChatMessageContent>? truncatedHistory = null;

        if (truncationIndex > 0)
        {
            truncatedHistory = chatHistory.Extract(truncationIndex, systemMessage: systemMessage);
        }
        return Task.FromResult<IEnumerable<ChatMessageContent>?>(truncatedHistory);
    }

    #region private

    /// <summary>
    /// Compute the index truncation where truncation should begin using the current truncation threshold.
    /// </summary>
    /// <param name="chatHistory">ChatHistory instance to be truncated</param>
    /// <param name="systemMessage">The system message</param>
    private int ComputeTruncationIndex(ChatHistory chatHistory, ChatMessageContent? systemMessage)
    {
        var truncationIndex = -1;
        var totalTokenCount = (int)(systemMessage?.Metadata?["TokenCount"] ?? 0);
        for (int i = chatHistory.Count - 1; i >= 0; i--)
        {
            truncationIndex = i;
            var tokenCount = (int)(chatHistory[i].Metadata?["TokenCount"] ?? 0);
            if (tokenCount + totalTokenCount > this._maxTokenCount)
            {
                break;
            }

            totalTokenCount += tokenCount;
        }

        // Skip function related content
        while (truncationIndex < chatHistory.Count)
        {
            if (chatHistory[truncationIndex].Items.Any(i => i is FunctionCallContent || i is FunctionResultContent))
            {
                truncationIndex++;
            }
            else
            {
                break;
            }
        }

        return truncationIndex;
    }
    #endregion
}

Note how in the above, function calling methods are excluded from truncation.

Here we see how this could be integrated with a ChatCompletions service such as OpenAI:

public sealed class ChatCompletionServiceWithReducer(IChatCompletionService service, IChatHistoryReducer reducer) : IChatCompletionService
{
    private static IReadOnlyDictionary<string, object?> EmptyAttributes { get; } = new Dictionary<string, object?>();

    public IReadOnlyDictionary<string, object?> Attributes => EmptyAttributes;

    /// <inheritdoc/>
    public async Task<IReadOnlyList<ChatMessageContent>> GetChatMessageContentsAsync(
        ChatHistory chatHistory,
        PromptExecutionSettings? executionSettings = null,
        Kernel? kernel = null,
        CancellationToken cancellationToken = default)
    {
        var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
        var reducedHistory = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);

        return await service.GetChatMessageContentsAsync(reducedHistory, executionSettings, kernel, cancellationToken).ConfigureAwait(false);
    }

    /// <inheritdoc/>
    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
        ChatHistory chatHistory,
        PromptExecutionSettings? executionSettings = null,
        Kernel? kernel = null,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        var reducedMessages = await reducer.ReduceAsync(chatHistory, cancellationToken).ConfigureAwait(false);
        var history = reducedMessages is null ? chatHistory : new ChatHistory(reducedMessages);

        var messages = service.GetStreamingChatMessageContentsAsync(history, executionSettings, kernel, cancellationToken);
        await foreach (var message in messages)
        {
            yield return message;
        }
    }
}

At the time of writing, the Semantic Kernel blog post samples are not officially supported by the Semantic Kernel API.

My understanding is the SK Team will be collaborating with the .NET Team to create a set of abstractions and get them added to the Microsoft.Extensions.AI namespace.

You can find the Semantic Kernel blog post here. The related GitHub repo is available here.

Don’t Over Optimise

When trying to optimise it’s important not to over-optimise.

Messages that you don’t want to remove from that chat history include:

system prompts -these define how the agent should behave
function-calling -these provide essential information related to actions the agent has taken, or must consider
key topics and contextual – that define the user’s goals or context
high priority – critical to the conversation’s success. Use labels to identify these messages
clarifications and confirmations -to ensure smooth, coherent dialogue
summarised inputs -to condense and replace lengthy history where possible
policy and/or security -to maintain compliance and transparency
feedback from the human – for better adaptation and learning

Preserving these will ensure your agent is useful and behaves how you expect. To help you identify some of the above, you can consider injecting calls to additional AI Service such as Azure AI Language / Text Analytics.

Use the capabilities found in these endpoints to perform Named Entity Recognition (NER), or Sentiment Analysis to detect and label a message.

Use Document Summarisation to summarise content in messages, thereby letting you further optimise and enrich data being sent between the Semantic Kernel and language model integration.

Further Thoughts

You might choose to go deep and remove redundant tokens within a stream of text.

Redundant tokens are also known as “stop words”.

Some of my earlier blog posts discuss NLP (natural language processing) in detail:

Stop words in computing terms, are words which are filtered out prior to or after processing of natural language data and text.

Unfortunately, there is not one definitive list of stop words which can be used and any group of words can be chosen as stop words in terms of sentiment analysis. They are sometimes known as “noise words”.

Search engines for example do not record common stop words in order to save disk space or to speed up searches. (Sullivan) – i.e. search engines “stop” looking at them.

A common list of stop words could contain something like the following:

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

Find the entire blog post from 2016 where I talk about the difficulties of sentiment analysis and solutions here.

Dotnet Reader

Semantic Kernel: Optimising Chat History : jamie_maguire

Managing Chat History

Don’t Over Optimise

Further Thoughts

Further Reading and Resources

Post a Comment