But it worked in the notebook!

Displaying Imagen images in web applications 

In addition to generating text, many LLMs can generate images.  Of course they’re fun to look at and share, but you can also use them to personalize web pages and  show your product in a variety of worlds the customer creates.  We’re only starting to see all the possible uses of generated images. 

It’s easy to get started with Imagen, Google’s text-to-image model.  There’s a link on that page that lets you try out Imagen in your browser.  If you’re interested in seeing code for image generation, a quick start is available in this notebook. Displaying an image named myImage in a notebook is as easy as myImage.show() .  

Encouraged by this, you may be ready to add Imagen to a web application immediately.  More good news–there’s a Codelab that walks you through the steps to do this.  And if you’re familiar with deploying to Cloud Run and using Flask, most of that lab shouldn’t be too hard.  (And if you’re not familiar with them, it’s a great way to learn more.)  Creating an image comes down to creating a model, asking it to generate an image given a prompt, and showing the image:

generation_model = ImageGenerationModel.from_pretrained("imagegeneration@006")

responses = generation_model.generate_images(
    prompt="A watercolor painting of autumn leaves falling in the wind",
)[0]

response.show()

But then reality strikes.  When you go to use the show() method that worked so smoothly in a notebook, it doesn’t work.  And sure enough, the documentation for the GeneratedImage makes that clear:

So, we need to find an alternative to show that will work in a non-notebook environment.  Fortunately, we have a lot of possibilities.  Let’s explore them.

Our constraints

Besides creating an image that will displayed, what other constraints do we have?  Some possible requirements are:

  • Allow multiple users to use the application simultaneously.
    • This means that when we create an image, we won’t just be able to save it to a file with a fixed name, since many if there are multiple users, they would all want to use that name.  Who knows which image they’d get?
  • Provide for user privacy.
    • If someone is using the app to look at potential designs for a the next great water bottle, we don’t want someone else to be able to retrieve that image accidentally.
  • Reduce storage needs.
    • You could create keys for each user and store their images with that key as part of the image name, but this may take up a lot of space in your app’s storage.
  • The image needs to be converted to a form that can be displayed by the HTML command: <img src="{{image_uri}}"> .  So we’ll need to have a URL for the image.

In short, we need a way to store a user image so it is accessible only to the user who requests it, but we don’t want to store user images.  Sounds tricky.

But it can be done!

Data URL to the rescue

There is a less common, though still completely standard, way to make this work.  Instead of giving the img tag  a URL that needs to be fetched from some server, give it a URL that contains the entire image itself.  Doing that requires using a data URL

Instead of passing a location, data URLs send all of the data in the URL.  An example is below (from [1]):



The beginning indicates this URL holds data originally in the format image/png and base 64 encoded in the remainder of the URL.  This encodes the icon Base64 Icon

Yes, most modern browsers can handle very long URLs.  So, instead of having to store the image and send its URL, with the privacy and space issues this may lead to, we’ll generate an image and then convert it to Base64 and send it directly to the function that will render the WWW page.

Using the Data URL

There are still complications that we’ll have to deal with.  When you examine the the documentation for the GeneratedImage, the only method that looks useful is save :

This requires a location to save the image to.  Which sounds like we’re back to our original problem, but it turns out, Python has us covered in this case.  The tempfile library provides the ability to create a variety of temporary files and directories, including the NamedTemporaryFile.  We can use the name of that temporary file with the GeneratedImages’s save as shown below:

with tempfile.NamedTemporaryFile("wb") as f:
    filename = f.name
    response.save(filename, include_generation_parameters=False)
    # process the saved file here, before it goes away

The generated image is saved to the temporary file.  We don’t care what the name is, just that it has a unique name.  Now we can create the Base64 encoding so we can send the image to the template to be displayed.  To do this, we’ll need to:

  • open the file we just wrote
    with open(filename, "rb") as image_file:
  • read it in
    binary_image = image_file.read()
  • get its Base64 encoding
    base64_image = base64.b64encode(binary_image).decode("utf-8")
  • create the data URL holding this encoding:
    image_url = f"data:image/png;base64,{base64_image}"

Since all of this will be included in the with statement, the temporary file will be closed and automatically cleaned up when this is done.  The final code to get a data URL (image_url) from a generated image (response) is:

with tempfile.NamedTemporaryFile("wb") as f:
    filename = f.name
    response.save(filename, include_generation_parameters=False)
    # process the saved file here, before it goes away
    with open(filename, "rb") as image_file:
        binary_image = image_file.read()
        base64_image =    base64.b64encode(binary_image).decode("utf-8")
        image_url = f"data:image/png;base64,{base64_image}"

The final step is to render an HTML template that will display the image in image_url.

And that’s it

At least for being able to get a URL from a generated image.  If you want to go through all the steps to create a web app to get a prompt from a user and display the image generated for the prompt, take a look at this Codelab.  

I look forward to seeing how you can use this ability in your apps![1]https://www.learningtree.com/blog/encoding-image-css-html/, Retrieved 4 November 2024.

A storied introduction to RAG

One of my nightmares is being on Jeopardy! and having the worst board ever.  Along with categories like Minor League Right Fielders of the 1970s and The Periodic Table in Portuguese  I’d expect to see LLM Abbreviations.  While I do know LLM stands for Large Language Model, there seems to be a new LLM-related abbreviation every day. 

It’s time to face my fears and take on one of those abbreviations–RAG.  But instead of getting into technical details right away, let’s tell a story that will hopefully make RAG a lot less daunting.

What is RAG?

The abbreviation RAG itself is not that bad; it stands for Retrieval-Augmented Generation.  That’s a pretty good description.  Before generating an answer, RAG will retrieve documents that are pertinent to the question and send them along to the LLM to get the final answer.

RAG is one way to improve how LLMs generate their output.  There are a lot of other alternatives to using RAG, such as 

  • retraining an entire model from scratch (a very expensive operation) 
  • working to improve the prompts sent to the model 
  • tuning the model with more test cases specific to your use case

So, when is RAG a better choice than these?  

  • RAG gives the LLM access to current information.  Since retraining is so expensive, it isn’t done often and some popular LLMs may not have been trained for years.  
  • RAG gives the LLM access to specific information for a company.  A bakery may use a general LLM to help with planning, but by using RAG, it can add exact information about the bakery’s tools, recipes, and supplies.   
  • Since the user is providing information they believe to be true, RAG tends to reduce hallucinations.

A story

I’ve had the privilege of knowing Dr. Shannon Duvall for years and one of her research areas is storytelling in computer science[1].  Stories can make complex topics more concrete and, on first glance, RAG is a complex topic.  So, let’s start with a story.

Chris owns a travel agency and Pat works there.  

Chris finished travel agent school in 2022 and has been too busy to keep up with what’s new.

Fortunately, the travel agency gets lots of brochures from around the world that have more recent information.  Now, these brochures are in lots of foreign languages and they use similar words to say the same thing (like “on the water”, “beachfront”, “walk from your room to the ocean”, “sur la mer”, “на берегу океана” etc.).  So Pat has encoded them into a bunch of numbers, based on their attributes.  Pat uses a very involved process to do this encoding.  There’s no clear pattern between the document contents and the encoding, but it is amazingly consistent!

When Madison, a customer, comes and asks for advice, Pat encodes the request using the same process and looks at the brochures’ encodings and finds the ones that match the best.  Again, Pat has a fancy way of checking for how well brochures match, but we trust Pat to do the match well.

Once Pat has all the best brochures ready, they are given to Chris, who also gets Madison’s request.  Chris has a rule that any response must include information from at least one of those brochures.   Based on all of this, Chris responds to Madison.

Comments on the story

On first read, this story may seem to have nothing to do with computing or LLMs, but let’s break it down.

  • Chris represents the LLM that will ultimately generate a recommendation for Madison.
  • But Chris’s knowledge is outdated, so will need to be augmented.
  • There are lots of travel brochures that can be used to help Chris make a current decision and Pat will decide which of these are pertinent.
  • Fortunately, the information for each of the brochures has been given a numeric encoding that is stored in the travel agency’s computer before Madison ever showed up.
  • So Pat will run a simple program to encode Madison’s request and determine which of the brochures are the closest matches.  
  • These brochures are then given to Chris along with the original request.  Using this additional, pertinent material, Chris can make a better recommendation to Madison. 

So, how does that relate to doing retrieval-augmented response generation?

First, RAG is not a one step process.

  • You need to first create a set of authoritative material and create an encoding from it.  This does not need to be done for each request of the LLM.
  • Then you need to encode the request and use it to retrieve the pertinent documents.
    • Be sure you’re using the same encoding system for both the main set of documents and the request!
  • Then you can send the query along with the documents to the “main” LLM to create a response.

RAG doesn’t change the underlying model.  Because of this, it has a relatively low cost per request, just needing to encode single prompt.  

And there’s not one way to do it.  You still need to pick a place to save the embeddings (the vector space) and decide on how to determine how close two points in that vector space are (the similarity measure).  If you have a lot of embeddings (at least 5,000), it may increase efficiency to create a vector index to improve your search speed.  And you still have your choice of the main LLM to feed these intermediate results into.

Why use numbers?

Why do we need to translate the documents into numbers?  Put simply, computers are very, very good at processing numbers and nowhere near as good at dealing with text.  By converting to numbers, we play into the strengths of computing.  (My apologies to any cyber-archaeologists who read this in 20 to 100 years and laugh at the claim that computers are not as good at handling text as they are at numbers.)  Converting to numbers removes differences in wording and languages.   

If you look into embeddings, you’ll usually find examples like plotting foods on a “spicy” to “bland” axis (and it may go to 2 or 3 dimensions, adding in “hot” or “cold” and “healthy” or “indulgent”).  I’m not going to do that here because it can be misleading.  First, most embeddings are done in hundreds of dimensions.  Second, usually the axes do not have clear meaning (like “temperature” or “spiciness”).   But you can see a simple example of the embeddings for individual words at the Embedding Projector, which visualizes high dimensional embeddings for 10,000 common English words and the closest words to them.  I searched for “dog” and while I expected some of the results (like “pet”), others were more surprising (like “hat”).

Next steps

Hopefully, this introduction has let you see RAG is not as daunting a topic as it may have seemed.  Understanding something and implementing it can be vastly different.  Come back for future posts to get into the details of find and embedding data and passing it on to an LLM

[1] Shannon Duvall. 2008. Computer science fairy tales. J. Comput. Sci. Coll. 24, 2 (December 2008), 98–104.

Sampling in LLMs

There’s been a lot written about Large Language Models (LLMs) and ways to adjust them to your particular needs.  Sampling is one of these ways that is fairly easy to use that allows the user to vary the output of an LLM such as Gemma.  This post will describe some common techniques of sampling available on many different LLMs, using a simple, hypothetical training set. It will also show code for doing this with a much larger training set with Gemma.

Why sample?

LLMs work much like predictive text in editors: given a preceding group of words, an LLM will select the next word/phrase based on the sample it has been trained on.  It takes the context and using the training data will select the most likely next word or phrase.  For example, if the context is “have a happy” it is likely to be followed by “birthday” or “new year” and very unlikely to be followed by “xylophone” since “have a happy xylophone” is not a frequently occurring phrase.

To make LLMs seem more creative, the most common prediction may not always be the one selected.  There are a number of ways to guide the LLM in making a selection.  This process is known as sampling.  

We’ll look at a number of ways to do sampling, both in general and on the LLM Gemma in the notebook available on GitHub.    Gemma is a family of open lightweight generative AI models built on the same technology as Gemini, Google’s largest and most capable LLM.  It is designed to be easy to customize.   

Getting some data

To make the ideas here more concrete, let’s analyze a real text, The Little Red Hen.  This is an English folk tale with just under 1400 words.  

Yes, this is a massive simplification.  But sometimes starting simple works better.

We can look at how often certain word pairs occur in the text.  Not surprisingly, the word “red” appears 16 times and each time is followed by the word “hen”.   The word “little” also appears 16 times, but is only followed by “red” 14 times.  The other times it’s followed by “fluff-balls” and “body,” one time each.

The word “the” is more interesting.  It appears 109 times and is followed by 58 different words.  45 of these words only appear after “the” once and 6 appear after “the” 2 times each.  The other 7 words are:

wordcountfrequency
little109.2%
pig98.3%
cat98.3%
rat87.3%
wheat87.3%
barnyard43.7%
bread43.7%
Frequency values have been rounded.

These 7 words appear after the word “the” almost 48% of the time “the” appears.  Let’s look at how different sampling methods select from these choices.

Greedy sampling

Greedy sampling is quite simple–for the next token, pick the one with the highest frequency.  In our example, the word selected to follow “the” will always be “little” since it has the highest frequency. 

This is the default in Gemma.  Using Keras, a deep learning API for Python, you can create a Gemma model and set its sampler to greedy using the code below:

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.compile(sampler="greedy")

Greedy sampling is simple enough, but doesn’t lead to much variety in responses.  If you’re working in the Gemma notebook, you can try it out using the code below.  You should notice that you get the same response each time you run it.

print(gemma_lm.generate('Are cats or dogs better?', max_length=32))

Top K Sampling

Top K sampling can give more variety.  Instead of selecting the response with the highest frequency, the user can specify a K value–the number of tokens with the highest frequencies to select from.  So, for k = 5,  in our example above, one of the tokens “little”, “pig”, “cat”, “rat”, or “wheat” would be selected.  Since “little” is most frequent, it would be selected most often, but all 5 would be possible responses.  The relative percentages of these 5 are shown below.

wordcountfrequency
little1022.7%
pig920.4%
cat920.4%
rat818.1%
wheat818.1%
Frequency values have been rounded.

To use the Top K sampler in Gemma, you need to first create the sampler since you need to specify a value for k.  Once that’s done, you can recompile the model without having to recreate it, making it easier to experiment with different samplers:

sampler = keras_nlp.samplers.TopKSampler(k=5)
gemma_lm.compile(sampler=sampler)

If you tried that, you would notice some variety in the response you get.  You can also try using different values of k to see how they affect the output.

Top P Sampling

Top P sampling is similar to Top K, but instead of specifying how many tokens to include in the pool, Top P specifies what percentage of tokens to include, based on how frequent the tokens are.  So, for a Top P sample of the data with p = 25%, tokens would be taken from the most frequent (“little”) to the next most frequent (“pig” and “cat”) until a total of 25% frequency has been met.  These three words have a combined frequency of 25.8% .   

Using a Top P sampler in Gemma works much like a Top K sampler. 

sampler = keras_nlp.samplers.TopPSampler(p=0.25)
gemma_lm.compile(sampler=sampler)

Try it again and see how the output changes as you rerun the code and change the value of p.

Random Sampling

Random sampling includes all possible values in determining the next token, using the probability of each token as the chance of selecting it.  So there would be a 9.2% chance of selecting “little” after “the” and a 0.9% chance of selecting “big” since the combination “the big” appears once in the text.  

Temperature

In addition to changing the type of sampling done, a function can be applied to the counts, making the differences between the values either more or less significant.  The temperature is a value included in this function and can be sent as a parameter to almost all types of sampling.  You can think of this function as raising a base (based on the temperature) to the count power.

Consider what happens with just the 5 most frequent values. When you use them as the exponent on 2, the values end up widely spread apart, with the largest (at 39.6%) more than 60 times the frequency of the smallest (at 0.6%).  But when the base is 1.2, the largest frequency will only be about 3 times the frequency of the smallest.

WordCount2^countdistribution1.2^countdistribution
little10102439.5%6.1921.2%
pig951219.8%5.1617.6%
cat951219.8%5.1617.6%
rat82569.9%4.3014.7%
wheat82569.9%4.3014.7%
barnyard4160.6%2.077.1%
bread4160.6%2.077.1%
sum52259229.26

Temperature can be added as a parameter to any of the samplers we’ve used thus far as in:

sampler = keras_nlp.samplers.TopPSampler(p=0.25, temperature=0.7)
gemma_lm.compile(sampler=sampler)

What’s Next

In this tutorial, you learned how to modify the output of Gemma by using different sampling techniques. Here are a few suggestions for what to learn next: