Semantic Kernel: Implementing 100% Local RAG Using Phi-3 With Local Embeddings : jamie_maguire

Semantic Kernel: Implementing 100% Local RAG Using Phi-3 With Local Embeddings
by: jamie_maguire
blow post content copied from  Jamie Maguire
click here to view original post


In an earlier blog post, we saw how to run the small language model, Phi-3, on your local machine using the ONNX Runtime.

This let you create a simple agent hosted within a console application.

The agent had no access to specific data and would use existing generative AI capabilities to answer human prompts.

To increase the usefulness of your AI agents, you can ground them in your own custom data.

A pattern often used to help achieve this is Retrieval-Augmented Generation, or RAG for short.

 

RAG modifies interactions with a language model, and lets the model responds to human prompts with references to your own data.

Most of the examples we’ve seen in recent months show how to implement RAG by consuming cloud services such as OpenAI or Azure Search. That might not be suitable for use cases -it wasn’t for me on certain projects.

In this blog post we see how to implement a 100% local RAG solution using Semantic Kernel.  We also learn about some basic RAG concepts such as vectors and embeddings.

Other topics include:

  • Semantic Kernel Volatile Memory Store
  • Embeddings Generation
  • Semantic Text Memory
  • Adding Embeddings to Semantic Text Memory
  • Challenges with the existing Semantic Kernel SDK

 

A demo of the agent in action is also included.

~

Volatile Memory Store

The volatile memory store is a simple embeddings store.  You can use create collections of memories, add, remove, update, and fetch memories.  Memories are stored as Embeddings.

Learn more about Volatile Memory here.

~

Embeddings

An embedding is a representation of data.

When creating agents, this data normally consists of words and sentences.  This data is then converted into numerical vectors that make it easier for the machine to understand and infer meaning.

When the data is converted to numerical form, it can be used more optimally in other tasks such as finding the similarities between words or sentences, clustering similar content, or retrieving relevant information based on a human or agent prompt.

 

A common approach to help measure the similarity between words is cosine similarity.  Learn more about cosine similarity here.

~

Embedding Generator

Embeddings (numerical vectors) are created by an Embedding Generator.

Some examples of these include the BERT ONNX Text Embedding Generation Service, or an OpenAI’s TextEmbedding API / service such as text-embedding-ada-002.

Either of these will convert text into numerical vectors (embeddings) that can be stored in a memory store.

 

You can see an example of how this can look with Semantic Kernel here:

  var memoryWithCustomDb = new MemoryBuilder()
      .WithOpenAITextEmbeddingGeneration("text-embedding-ada-002", apiKey)
      .WithMemoryStore(new VolatileMemoryStore())
      .Build();

~

Semantic Text Memory

The Semantic Text Memory class gives you methods to save, fetch, and search for information in a Semantic Memory Store (such as a Volatile Memory Store).

Two of the key methods are:

  • SaveInformationAsync
  • SaveReferenceAsync

 

 

The method SaveInformationAsync saves information into the semantic memory, keeping a copy of the source information.  This method expects a memory record with the following parameters:

  • string collection,
  • string text,
  • string id,
  • string? description = null,
  • string? additionalMetadata = null,

 

e.g.

var x = 
await memory.SaveInformationAsync(“codingfacts”, id: "info1", text: "C# is a programming language.", kernel: kernel);

 

This method is best if you want to load content and subsequent query the textual contents.

For example, you can load an entire html file with code examples into memory.  When interacting with the agent, it can formulate a good response to a coding problem, using in-memory vectorised data.

 

The method SaveReferenceAsync saves information into the semantic memory, keeping only a reference to the source information.

 

This method expects data with the following parameters:

  • string collection,
  • string text,
  • string externalId,
  • string externalSourceName,
  • string? description = null,
  • string? additionalMetadata = null,

 

You might use this to load external data in URLs such as guides or other assets.  Querying this type of in-memory data will return the asset alongside a probability score of the asset matching your prompt/query.

~

Bringing It All Together

Bringing all these pieces together, we need to do the following to implement a 100% local RAG solution:

  1. Load the Phi-3 model
  2. Create the Semantic Kernel
  3. Add the Chat Completions Service and Embeddings Generator Service
  4. Setup a Volatile Memory Store
  5. Setup the Memory and add it to the Store
  6. Add memories to the memory

 

Simple, right?

~

A Problem

I had been reading and following Arafat Tehsin’s fantastic blog whilst working on this and then encountered a problem.

Specifically, on the final step (adding memories to the memory).

You can see this here:

 

After some digging, it turns out, there is an error Semantic Kernel ONNX Connector when you try to use Local Embeddings.

Specifically, the line .AddLocalTextEmbeddingGeneration(); when setting up the kernel.

I shared the screenshot with Arafat who raised in issue GitHub (thankyou!) with Microsoft which is currently being investigated.

~

The Solution

A temporary solution to the issue above is to manually load the embeddings capability -using a different model.    Shout to Jose Luis Latorre Millas  and David Puplava for this.

 

You manually add the local embeddings model by using the following code:

var textModelPath = @"E:\Source\Models\bge-micro-v2\onnx\model.onnx";

var bgaVocab= @"E:\Source\Models\bge-micro-v2\vocab.txt";


// Load the model and services
var builder = Kernel.CreateBuilder();

builder.AddBertOnnxTextEmbeddingGeneration(textModelPath, bgaVocab);

var kernel = builder.Build();

// Create services such as chatCompletionService and embeddingGeneration
var embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

 

For reference, we have the entire code listing:

var modelPath = @"Models\\Phi-3-mini-4k-instruct-onnx\\cpu_and_mobile\\cpu-int4-rtn-block-32-acc-level-4";

    var modelId = "localphi3onnx";
    var textModelPath = @"\Models\bge-micro-v2\onnx\model.onnx";
    var bgaVocab= @"E:\Source\Models\bge-micro-v2\vocab.txt";


    var builder = Kernel.CreateBuilder();

    builder.AddOnnxRuntimeGenAIChatCompletion(modelId, modelPath);
    builder.AddBertOnnxTextEmbeddingGeneration(textModelPath, bgaVocab);

    var kernel = builder.Build();


    // Create services such as chatCompletionService and embeddingGeneration
    var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();
    var embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

    // Setup a memory store and create a memory out of it
    var memoryStore = new VolatileMemoryStore();
    var memory = new SemanticTextMemory(memoryStore, embeddingGenerator);

    // Loading it for Save, Recall and other methods
    kernel.ImportPluginFromObject(new TextMemoryPlugin(memory));
    string MemoryCollectionName = "MyCustomDataCollection";

    // Start the conversation
    while (true)
    {
        // Get user input
        Console.Write("User > ");
        
        var question = Console.ReadLine()!;

        // Enable auto function calling
        OpenAIPromptExecutionSettings executionSettings = new()
        {
            ToolCallBehavior = ToolCallBehavior.EnableKernelFunctions,
            MaxTokens = 1000
        };


        // Invoke the kernel with the user input
        var response = kernel.InvokePromptStreamingAsync(

            promptTemplate: @"in as few words as possible, answer this Question: 
                             Answer the question using the memory content: ",
            arguments: new KernelArguments(executionSettings)
            {
                { "input", question },
                { "collection", MemoryCollectionName }
            }
            );


        Console.Write("\nAssistant > ");

        string combinedResponse = string.Empty;
        await foreach (var message in response)
        {
            //Write the response to the console
            Console.Write(message);
            combinedResponse += message;
        }
        Console.WriteLine();
    }

~

Demo – Interacting with the Agent and Reading from Memory

Here, we supply the prompt: User > who developed c#? .  The agent replies with the text: Microsoft developed C# in 2000 :

 

We can check this is accurate by looking at the custom data that was saved to the agent memory:

public static IEnumerable<CodingFact> GetCustomFacts()
  {
      var facts = new CodingFact[]
      {
      new("C# was developed by Microsoft and released in 2000.", "C# History", "Developer: Microsoft"),

      new(".NET Framework, supporting C#, is an open-source development platform.", "Platform", ".NET: Open-source"),

      new("C# is a statically-typed language.", "Typing", "Type System: Statically-Typed"),

      new("C# supports object-oriented and component-oriented programming.", "Programming Paradigms", "Paradigms: OOP, COP"),

      new("LINQ in C# allows querying data in a declarative manner.", "LINQ", "Feature: Declarative Queries"),

      new("C# has built-in garbage collection for automatic memory management.", "Memory Management", "Feature: Garbage Collection"),

      new("The async and await keywords in C# are used for asynchronous programming.", "Asynchronous Programming", "Keywords: async, await"),

      new("C# supports generics for type-safe data structures.", "Generics", "Feature: Type-Safety"),

      new("Delegates in C# are type-safe and similar to function pointers in C++.", "Delegates", "Comparison: Function Pointers"),

      new("C# uses try, catch, and finally blocks for exception handling.", "Exception Handling", "Keywords: try, catch, finally")

      };

      return facts;
  }

 

To further verify the agent is leveraging in-memory content (embedded as vectors), we can ask it something that will not be in the training set.

We can ask the agent, does Jamie Maguire have a blog?:

 

The agent has no knowledge of this.

We can add this to the agents memory however:

new("Jamie Maguire does indeed have a blog. You can find it at www.jamiemaguire.net.","Jamie Maguire Blog", "URL: www.jamiemaguire.net")

 

After rerunning the agent, we get a response that answers the prompt with accurate information:

 

Perfect.

~

Shout Outs

Shout out to the following people in helping arrive at this solution:

~

Summary

In this blog post, we’ve seen how to implement 100% local RAG using Phi-3 and Local Embeddings.

We’ve saw the agent in action.

In a future blog post, we will explore working with agent memory in more detail.  This an evolving area with several options.

~

Further Reading and Resources

You can learn more about Phi-3 and the ONNX Runtime here:

 

Enjoy what you’ve read, have questions about this content, or would like to see another topic? Drop me a note below.

You can schedule a call using my Calendly link to discuss consulting and development services.

 

JOIN MY EXCLUSIVE EMAIL LIST
Get the latest content and code from the blog posts!
I respect your privacy. No spam. Ever.

September 01, 2024 at 07:53PM
Click here for more details...

=============================
The original post is available in Jamie Maguire by jamie_maguire
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce