Semantic Kernel: Implementing 100% Local RAG Using Phi-3 With Local Embeddings : jamie_maguire

September 01, 2024 September 01, 2024

Semantic Kernel: Implementing 100% Local RAG Using Phi-3 With Local Embeddings
by: jamie_maguire
blow post content copied from Jamie Maguire
click here to view original post

In an earlier blog post, we saw how to run the small language model, Phi-3, on your local machine using the ONNX Runtime.

This let you create a simple agent hosted within a console application.

The agent had no access to specific data and would use existing generative AI capabilities to answer human prompts.

To increase the usefulness of your AI agents, you can ground them in your own custom data.

A pattern often used to help achieve this is Retrieval-Augmented Generation, or RAG for short.

RAG modifies interactions with a language model, and lets the model responds to human prompts with references to your own data.

Most of the examples we’ve seen in recent months show how to implement RAG by consuming cloud services such as OpenAI or Azure Search. That might not be suitable for use cases -it wasn’t for me on certain projects.

In this blog post we see how to implement a 100% local RAG solution using Semantic Kernel. We also learn about some basic RAG concepts such as vectors and embeddings.

Volatile Memory Store

The volatile memory store is a simple embeddings store. You can use create collections of memories, add, remove, update, and fetch memories. Memories are stored as Embeddings.

Learn more about Volatile Memory here.

Embeddings

An embedding is a representation of data.

When creating agents, this data normally consists of words and sentences. This data is then converted into numerical vectors that make it easier for the machine to understand and infer meaning.

When the data is converted to numerical form, it can be used more optimally in other tasks such as finding the similarities between words or sentences, clustering similar content, or retrieving relevant information based on a human or agent prompt.

A common approach to help measure the similarity between words is cosine similarity. Learn more about cosine similarity here.

Embedding Generator

Embeddings (numerical vectors) are created by an Embedding Generator.

Some examples of these include the BERT ONNX Text Embedding Generation Service, or an OpenAI’s TextEmbedding API / service such as text-embedding-ada-002.

Either of these will convert text into numerical vectors (embeddings) that can be stored in a memory store.

You can see an example of how this can look with Semantic Kernel here:

  var memoryWithCustomDb = new MemoryBuilder()
      .WithOpenAITextEmbeddingGeneration("text-embedding-ada-002", apiKey)
      .WithMemoryStore(new VolatileMemoryStore())
      .Build();

Semantic Text Memory

The Semantic Text Memory class gives you methods to save, fetch, and search for information in a Semantic Memory Store (such as a Volatile Memory Store).

Two of the key methods are:

SaveInformationAsync
SaveReferenceAsync

The method SaveInformationAsync saves information into the semantic memory, keeping a copy of the source information. This method expects a memory record with the following parameters:

string collection,
string text,
string id,
string? description = null,
string? additionalMetadata = null,

e.g.

var x = 
await memory.SaveInformationAsync(“codingfacts”, id: "info1", text: "C# is a programming language.", kernel: kernel);

This method is best if you want to load content and subsequent query the textual contents.

For example, you can load an entire html file with code examples into memory. When interacting with the agent, it can formulate a good response to a coding problem, using in-memory vectorised data.

The method SaveReferenceAsync saves information into the semantic memory, keeping only a reference to the source information.

This method expects data with the following parameters:

string collection,
string text,
string externalId,
string externalSourceName,
string? description = null,
string? additionalMetadata = null,

You might use this to load external data in URLs such as guides or other assets. Querying this type of in-memory data will return the asset alongside a probability score of the asset matching your prompt/query.

Bringing It All Together

Bringing all these pieces together, we need to do the following to implement a 100% local RAG solution:

Load the Phi-3 model
Create the Semantic Kernel
Add the Chat Completions Service and Embeddings Generator Service
Setup a Volatile Memory Store
Setup the Memory and add it to the Store
Add memories to the memory

Simple, right?

A Problem

I had been reading and following Arafat Tehsin’s fantastic blog whilst working on this and then encountered a problem.

Specifically, on the final step (adding memories to the memory).

You can see this here:

After some digging, it turns out, there is an error Semantic Kernel ONNX Connector when you try to use Local Embeddings.

Specifically, the line .AddLocalTextEmbeddingGeneration(); when setting up the kernel.

I shared the screenshot with Arafat who raised in issue GitHub (thankyou!) with Microsoft which is currently being investigated.

The Solution

A temporary solution to the issue above is to manually load the embeddings capability -using a different model. Shout to Jose Luis Latorre Millas and David Puplava for this.

You manually add the local embeddings model by using the following code:

var textModelPath = @"E:\Source\Models\bge-micro-v2\onnx\model.onnx";

var bgaVocab= @"E:\Source\Models\bge-micro-v2\vocab.txt";


// Load the model and services
var builder = Kernel.CreateBuilder();

builder.AddBertOnnxTextEmbeddingGeneration(textModelPath, bgaVocab);

var kernel = builder.Build();

// Create services such as chatCompletionService and embeddingGeneration
var embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

For reference, we have the entire code listing:

var modelPath = @"Models\\Phi-3-mini-4k-instruct-onnx\\cpu_and_mobile\\cpu-int4-rtn-block-32-acc-level-4";

    var modelId = "localphi3onnx";
    var textModelPath = @"\Models\bge-micro-v2\onnx\model.onnx";
    var bgaVocab= @"E:\Source\Models\bge-micro-v2\vocab.txt";


    var builder = Kernel.CreateBuilder();

    builder.AddOnnxRuntimeGenAIChatCompletion(modelId, modelPath);
    builder.AddBertOnnxTextEmbeddingGeneration(textModelPath, bgaVocab);

    var kernel = builder.Build();


    // Create services such as chatCompletionService and embeddingGeneration
    var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();
    var embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

    // Setup a memory store and create a memory out of it
    var memoryStore = new VolatileMemoryStore();
    var memory = new SemanticTextMemory(memoryStore, embeddingGenerator);

    // Loading it for Save, Recall and other methods
    kernel.ImportPluginFromObject(new TextMemoryPlugin(memory));
    string MemoryCollectionName = "MyCustomDataCollection";

    // Start the conversation
    while (true)
    {
        // Get user input
        Console.Write("User > ");
        
        var question = Console.ReadLine()!;

        // Enable auto function calling
        OpenAIPromptExecutionSettings executionSettings = new()
        {
            ToolCallBehavior = ToolCallBehavior.EnableKernelFunctions,
            MaxTokens = 1000
        };


        // Invoke the kernel with the user input
        var response = kernel.InvokePromptStreamingAsync(

            promptTemplate: @"in as few words as possible, answer this Question: 
                             Answer the question using the memory content: ",
            arguments: new KernelArguments(executionSettings)
            {
                { "input", question },
                { "collection", MemoryCollectionName }
            }
            );


        Console.Write("\nAssistant > ");

        string combinedResponse = string.Empty;
        await foreach (var message in response)
        {
            //Write the response to the console
            Console.Write(message);
            combinedResponse += message;
        }
        Console.WriteLine();
    }

Demo – Interacting with the Agent and Reading from Memory

Here, we supply the prompt: User > who developed c#? . The agent replies with the text: Microsoft developed C# in 2000 :

We can check this is accurate by looking at the custom data that was saved to the agent memory:

public static IEnumerable<CodingFact> GetCustomFacts()
  {
      var facts = new CodingFact[]
      {
      new("C# was developed by Microsoft and released in 2000.", "C# History", "Developer: Microsoft"),

      new(".NET Framework, supporting C#, is an open-source development platform.", "Platform", ".NET: Open-source"),

      new("C# is a statically-typed language.", "Typing", "Type System: Statically-Typed"),

      new("C# supports object-oriented and component-oriented programming.", "Programming Paradigms", "Paradigms: OOP, COP"),

      new("LINQ in C# allows querying data in a declarative manner.", "LINQ", "Feature: Declarative Queries"),

      new("C# has built-in garbage collection for automatic memory management.", "Memory Management", "Feature: Garbage Collection"),

      new("The async and await keywords in C# are used for asynchronous programming.", "Asynchronous Programming", "Keywords: async, await"),

      new("C# supports generics for type-safe data structures.", "Generics", "Feature: Type-Safety"),

      new("Delegates in C# are type-safe and similar to function pointers in C++.", "Delegates", "Comparison: Function Pointers"),

      new("C# uses try, catch, and finally blocks for exception handling.", "Exception Handling", "Keywords: try, catch, finally")

      };

      return facts;
  }

To further verify the agent is leveraging in-memory content (embedded as vectors), we can ask it something that will not be in the training set.

We can ask the agent, does Jamie Maguire have a blog?:

The agent has no knowledge of this.

We can add this to the agents memory however:

new("Jamie Maguire does indeed have a blog. You can find it at www.jamiemaguire.net.","Jamie Maguire Blog", "URL: www.jamiemaguire.net")

After rerunning the agent, we get a response that answers the prompt with accurate information:

Perfect.

Shout Outs

Shout out to the following people in helping arrive at this solution:

Bruno Capuano – for his original blog.
Arafat Tehsin – for his original blog
Jose Luis Latorre Millas – for identifying the local embeddings fix
David Puplava – for identifying the local embeddings fix

Summary

In this blog post, we’ve seen how to implement 100% local RAG using Phi-3 and Local Embeddings.

We’ve saw the agent in action.

In a future blog post, we will explore working with agent memory in more detail. This an evolving area with several options.

Dotnet Reader