Limit the amount of requests to your API Gateway. Do it advanced!

JESPROTECH

Welcome to my video about the AI Rate Limiting plugin for the Kong Gateway. This plugin allows us to control traffic to our AI configuration with the AI proxy configuration. If you missed the AI-Proxy video, and you are figuring out what to do with it or just want to know more, then perhaps one good thing to do is to checkout the first video I made about Kong's AI plugins over here:  https://youtu.be/6Z8wWX-liBs . This particular vide, the AI Advanced Rate Limiting plugin, makes more sense to use, if we are using multiple LLM models under different routes common services. It allows for a seamless configuration and better cost controls. I tell all the basics in the video. I hope you enjoy the video and be sure to stay tech, keep programming, be kind and have a good on everyone!   ---   Chapters:   00:00:00 Start  00:00:33 Introduction  00:03:19 Configuring Mistral AI  00:04:59 The AI Advanced Rate Limiting Plugin in Detail  00:09:19 Interpreting Rate Limiting Headers  00:10:32 Checking out the configuration in Kong Konnect  00:11:07 Talking about Kong Semantic AI Cache  - https://youtu.be/b3dAMZOhr58  00:11:31 Checking out the example  00:12:10 Getting back to the AI Semantic Cache plugin  - https://youtu.be/b3dAMZOhr58  00:15:02 Launching examples and interpreting results  00:20:39 End notes and conclusion  00:21:22 See you in the next video!  00:22:11 Disclaimer   ---   Soundtrack   - https://soundcloud.com/joaoesperancinha/slow-guitar-15-jesprotech   ---   Related videos:   - https://youtu.be/Kw5GZnMnVhw  - https://youtu.be/rJKbAzjb5lQ  - https://youtu.be/z3Y4NQgjGLE  - https://youtu.be/KE3VTYtLvnI  - https://youtu.be/6Z8wWX-liBs  - https://youtu.be/vRH4qLZ7tz8  - https://youtu.be/Yhv19le0sBw   --- Source code   - https://github.com/jesperancinha/kong-test-drives  - https://github.com/jesperancinha/jeorg-cloud-test-drives   ---   Sources:   - https://docs.konghq.com/hub/kong-inc/ai-rate-limiting-advanced/   ---    As a short disclaimer, I'd like to mention that I'm not associated or affiliated with any of the brands eventually shown, displayed, or mentioned in this video.   ---   All my work and personal interests are also discoverable on other different sites:   - My Website - https://joaofilipesabinoesperancinha.nl/  - Reddit - https://www.reddit.com/user/jesperancinha  - Credly - https://www.credly.com/users/joao-esperancinha/badges  - Pinterest - https://nl.pinterest.com/jesperancinha/  - Facebook -  https://www.facebook.com/joaofisaes/  - Spotify -  https://open.spotify.com/user/jlnozkcomrxgsaip7yvffpqqm  - Daily Motion - https://www.dailymotion.com/jofisaes  - Bluesky - https://bsky.app/profile/jesperancinha.bsky.social   ---   If you have any questions about this video please put a comment in the comment section below and I will be more than happy to help you or discuss any related topic you'd like to discuss.

Transcript

00:30The AI Rate Limiting Advanced plugin allows us to, well, control the rate of our requests

00:39to our API, but this one is specific for AI.

00:44So if you are used to dealing with the Rate Limiting plugin or with the Rate Limiting

00:48Advanced plugin, this is an improvement of that.

00:52And before I continue talking about this plugin, let me just go with you through the documentation

00:59for it.

01:00If we go to the introduction, we will find that it explains that the AI Rate Limiting Advanced

01:05plugin provides rate limiting for the providers used by any AI plugins.

01:10And my question when I started reading this documentation was, wait a minute, we already

01:15have rate limiting plugins in the Kong API gateway.

01:19There are at least two of them that are very known.

01:24One is the Rate Limiting and the other one is the Rate Limiting Advanced plugins.

01:29But there are more.

01:31There is one service protection which allows us to define rate limits to the services and

01:37it is exclusive to services.

01:39Then we've got Rate Limiting which allows to configure rate limiting for the consumer group,

01:45for the consumer, for routes, for services, and the same goes for the Rate Limiting Advanced.

01:53Now, I have spoken a bit about the rate limiting plugins in an article, this one over here, where

01:59I discussed more details about other plugins that can potentially be used in the Kong gateway.

02:06And I explained that using the OSS version of Kong.

02:09So there, I just used the local container and simply configured the plugins to figure out

02:15how to work with the rate limiting plugins.

02:18But the AI Rate Limiting Advanced is something different.

02:22This plugin is specific for AI, specific for our requests to our large language models.

02:29And there are multiple configurations that we can do with this.

02:32We can even connect this plugin to a Redis database.

02:35And there are advantages for that.

02:38In general lines, it means that we can, for example, make sure that the rate limiting is

02:43proportional to the different large language models that we may have configured in Kong

02:48Connect or in our Kong Enterprise Edition.

02:51The purpose here is to understand how this plugin works, regardless of using the Redis database

02:57or not, because in this case, it is not mandatory.

03:00What is mandatory for this plugin to work is the AI proxy.

03:05And the AI proxy, we can configure it, as we have seen in this video, to connect to multiple

03:12different large language models.

03:13We have seen how to configure this with Gemini.

03:16And we have seen how to configure this with Mistrial.

03:20And for Mistrial, we did a special configuration.

03:23We went to the website, like over here, and we created an API.

03:29Now, I have already created another API for this example, so that we don't have to go again

03:33through the configuration of Mistrial.

03:35In this case, we can just go to the API here and just get one API key and then use that API

03:43key in our example.

03:46Another thing that it's important for us to even start considering the use of the AI rate

03:51limiting advanced configuration plugin is the fact that we need to understand how many large

03:58language models we want to use in our services and routes.

04:02For example, we can configure the AI proxy plugin per route or per service.

04:07If we configure it per route, we will have a good advantage when we try to use the AI rate

04:13limiting configuration plugin.

04:14And the reason for that is that we can then configure, for example, two different routes

04:20with two different AI proxy plugins.

04:22And in the service, on the service level, we can then configure the AI rate limiting advanced

04:29plugin, the one we are seeing today, and make sure that it will affect the two different

04:35routes.

04:35Because with that plugin, we can configure rate limiting both for one large language model

04:41and another and how many we want to configure because we give that in as an array.

04:46So keep that in mind.

04:48If we have AI proxy configured in multiple routes, we can configure them all at the same time if

04:54they belong to one single service.

04:56Because that is what we saw just now in the configuration of the AI rate limiting advanced

05:03plugin, where it says that we can use this on a service route consumer and consumer group

05:10level.

05:10And that is just one example of how can we use the configuration of this plugin to affect

05:16multiple different routes.

05:19But to understand this better, let's first have a look at the configuration and what it

05:24provides and what are we going to configure in our example.

05:27If we go to the configuration reference of the AI rate limiting advanced configuration plugin,

05:32we can see that we've got all of these different nodes that we configure.

05:36But important and mandatory is that we configure the LLM providers node.

05:42The LLM providers node is where we say to this plugin, which one of the LLM providers is it going to

05:50affect, which one of the routes that are being forwarded by the AI proxy plugin are going to be

05:55affected by this rate limiting.

05:56Now, if we configure this by service or by route, we will affect either the service and all the routes bound to that service or the route that we want to configure.

06:07And the important bit here, and it's important that we focus exactly on what this actually does, is we've got three different properties per configuration that we do for one particular large language model.

06:22The first one is the window size.

06:25This means the size of a window in seconds.

06:29If you think about it, that means that we are going to affect the rate that we are going to make the requests for a certain window.

06:36And you probably noticed that I am talking just about rate, not talking about the number of requests.

06:43That is because the AI rate limiting plugin does not configure per request.

06:48It configures per token size and per cost.

06:52And what this means is that a token is a word, for example.

06:56A letter is also part of the account for how much it costs to, for example, ask a question to the LLM model.

07:06This is something that depends on the LLM model itself, and it depends on how we configure it.

07:12And so, it is not a number that we can control directly.

07:18It is a number that we can know if we know exactly how our large language model works.

07:25But we probably can do better if we make the requests and make some experiments in our application to figure out what requests have what size.

07:36Because we want, of course, to make sure that we limit the requests that we are sending to the LLM.

07:40We also need to give in the name of our LLM model.

07:44So, here is where we specify OpenAI, Azure, or Anthropic, or Cohere, or Mistral, Wama 2, Bedrock, Gemini, Hugging Face, or Request Prompt.

07:53These are different large language model types that we can use from different brands.

07:58And so, we've got window size, name, but probably the one that is more interesting here is the limit.

08:05What do we mean by limit?

08:06As I mentioned before, this is about costs.

08:08This is about the number of tokens, the token size, the number of letters, and it is about also the different large language models.

08:16Important thing here is that we know what kind of limit do we want to give for a certain window.

08:22Let's say, for example, an arbitrary number, 60.

08:25If we say that we can only send a 60 cost request in a window size of 30 seconds,

08:32then that means that in that 30 second sized window, I can only send up to 60.

08:41And that means that I can send 2, 3, 4, 5 requests, just as long as the sum of all the costs of those requests does not exceed 60.

08:50If I send 1 request that exceeds 60 in those 30 seconds, then that means the amount of time that I have to wait will be accounted for

08:59and will be calculated in function of how many of those tokens I have exceeded in terms of the limit.

09:06But this is something we will see in practice when we run our example.

09:09Important here also to understand is that when we are using the AI Rate Limiting Advanced plugin,

09:15we are also making sure that we are using another functionality that it gives,

09:22which is these headers that will provide us information on what is the status of the request to our LLM model

09:29and what is the status of the rate.

09:32For example, we can see here that they are talking about the Rate Limiting Reset,

09:37the Rate Limit Retry.

09:38After we are talking about the Rate Limit for Azure, in this case, we are seeing here a 30.

09:45And when we are establishing the Rate Limiting, when we are establishing the Window Size,

09:50the Window Size will be reflected over here.

09:54We will see here, for example, if we say 30, that will be 30 seconds.

09:57But if we say, for example, one minute, that means 60 seconds.

10:01If we configure 60 there in the total amount of seconds, we will see here minute instead of 30.

10:06And so on, there are other kind of headers that we receive back

10:10that will inform us of the current state of the Rate Limit for one particular large language model.

10:18And this could be one LLM in one route, or two or three or four routes of the same type in one service.

10:24But of course, our example is much more simpler than that.

10:28What I have already prepared for us is in ConConnect,

10:32I already have two different plugins configured for our route and service.

10:36So let's see how this all is configured together.

10:39If we go here, for example, to the Gateway service,

10:41we see already that I have one service configured with this host and port.

10:45This is not necessary in terms of the definition of the host and port,

10:53but we do need to define something here so that this works.

10:56Because this will be proxied by the AI proxy,

10:59and therefore the URL here takes no effect.

11:01When we go here to our routes, we see that this is just a simple route.

11:05And as we have seen in this video over here, the previous video to this one,

11:09we see that this route is simply a route to MistRow.

11:15So here we are just going to use this path so that we can access MistRow.

11:23But the routes are accessible now because I have these plugins configured.

11:29And that is what we are going to see.

11:30If we go to the example project, Conctest Drives,

11:34I already have a file there called README AI Rate Limiting Advanced.

11:39In this file, I have already an automated configuration system

11:44that we can use to give it our environment variables.

11:48We can then call these scripts safely from our command line

11:52to make sure that we get these plugins configured.

11:55And so let me show you that.

11:57So the first one is the AI proxy.

12:00And for the AI proxy, I have configured here MistRow.

12:02All of these variables is something that we need to give into the command line.

12:06If you want to know more about how this works,

12:09I've explained that already in this video over here.

12:11Make sure to watch that one.

12:12It will be all in the description.

12:13Make sure also to follow the chapters in the description

12:16so that you know exactly which video is this one and where to find it.

12:20So this one is just a simple AI proxy plugin you have seen before.

12:25We've got here the provider MistRow.

12:27We also use the MistRow format of OpenAPI.

12:29We say that the upstream URL is chat completions.

12:32And we are simply using this as a chatbot.

12:35We are using our LLM model simply as a chatbot.

12:39And then we enter the configuration of the topic of this video,

12:43which is the configuration for the AI Rate Limiting Advanced plugin.

12:47And here we have the complete configuration where

12:50we are specifying that any configuration for this particular route,

12:57which we could have configured here services.

13:01But because we are just showing how it works,

13:04we will configure it only for one route.

13:06This route will have this plugin attached to it,

13:10which means that it will have this LLM provider configuration

13:14where the MistRow LLM model that we are using with our AI proxy plugin

13:21will be limited to 10, which is the weight of the question

13:25or the cost of the questions we are making,

13:29which is the sum of the tokens and the letters

13:32and the calculation that needs to give it a cost

13:35and the window size of 2 minutes, 120 seconds.

13:41So if we now go to our command line

13:44from IntelliJ, because this is already configured locally,

13:48I already have here everything configured

13:50and I can show you, I already have here at the top

13:54from this point onwards,

14:00I've got here the container running

14:02and I can show it to you also.

14:04If I do Docker PS,

14:06you'll find that I have here a Kong gateway running locally

14:09as a data plane node for our Kong Connect website.

14:13So now this is connected to Kong Connect

14:15and we are configuring this container

14:18from that point to here.

14:23Having said that, we can now test our configuration

14:27and let's just ask one question

14:30that will take a long input,

14:31which is what was the first popular breakthrough

14:34of Dire Straits,

14:35which led them to become one of the most known bands

14:39in the last decades?

14:41So this is a very long question

14:43and it will certainly cost more than 10

14:45because I've got here one word, one word, one word.

14:49I've got more than 10 words

14:50and I've got words that are longer

14:52than two or three or four characters

14:54and I've got different words

14:55and I've got also punctuations.

14:58I've got all of this stuff.

15:00So this will have a cost.

15:01Which cost do you think that this question will have?

15:04I know it by heart

15:05because I've done this example before,

15:07but if I run this now,

15:08how much do you think this will cost?

15:11And do you think that, okay, it was too quick.

15:16I was going to ask if you think

15:17that we were going to get a question anyways.

15:20Well, it does answer.

15:21It says Dire Straits' first popular breakthrough

15:22was their single Soldiers of Squing

15:25from their self-titled debut album released in 1978.

15:30So that's a long time ago,

15:31but this is just one answer.

15:34I'm not sure if this is correct or not.

15:36I believe it is,

15:37but the important thing

15:39that we want to have a look at here

15:41is how are the headers

15:44that we want to have a look at behaving.

15:46So we see that we've got an XAI rate limit

15:49limit 120 mistrial of 10.

15:52This is true

15:53because this is how big our requests are.

15:57The rate limit remaining says here 10.

16:0110, that is how much the cost can be.

16:05We can see here the rate limit by size query cost.

16:10Our whole query took 290.

16:13So if you said anything less than this,

16:15you were wrong.

16:15If you said anything above this,

16:18you were wrong.

16:18If you said something around this,

16:20you were probably correct in your estimation

16:23of how big this question would be.

16:25Now, I don't know how this is calculated

16:26because this also depends on Mistral,

16:28but effectively we have here a cost for this question.

16:32So we can now get an idea of how costly are these questions.

16:36But we also see other things.

16:39We see that nothing seems to be affected by this request

16:43because this is the state

16:45not of how we leave the rate limiting,

16:48but this is how we entered rate limiting.

16:50That means that if we make a follow-up question,

16:56like for example,

16:58who are dire straits?

17:01Now, if we send this,

17:04it says the AI token rate limit

17:06exceeded four providers' mistrial.

17:08But now, here comes the fun part.

17:11We can have a look at the different headers

17:14that Konk has given back to us.

17:17And now we can see that

17:18the XAI rate limit remaining 120 is zero.

17:21We don't have any more credits,

17:26let's call it that way,

17:27to send these requests.

17:28We have here that the reset will be at 171.

17:33This means that we need to wait 171 seconds still

17:37before we can continue.

17:39Remember, we made a question that it's 290 in cost

17:43and our limit is 10.

17:45Now, we wouldn't wait 120 seconds.

17:50We would have to wait much more than that

17:52because we've wasted so many tokens

17:54at one point where we were able to make a question.

17:57So now, rate limiting is in effect

17:59and we cannot make any other question

18:00until this deadline expires,

18:03until 171 seconds go by.

18:06There are differences between these two.

18:08The retry after 120 mistrial

18:11means that we can make a retry after 171 seconds.

18:14If we say a retry after,

18:16means that we can make another retry after 171 seconds.

18:19But this means the combination

18:21of all of the different rate limiting

18:25that we have configured in our gateway.

18:27So this one is the same

18:29because we only have one in effect right now

18:31and so that means that

18:32they will always match each other

18:34and it seems like a repetition of one or the other

18:37but they are not really that.

18:39And then finally, we've got here

18:41the rate limit, limit 120 mistrial, 10.

18:46That is the current limit we have for cost

18:48and this one is the reset, again 171.

18:52And then now, I think those 171 seconds

18:55have gone by already

18:56and if we make the question once again,

18:59we should now be getting a response.

19:04Which is taking a while.

19:08Maybe I should have used this plugin,

19:10the AI Cache plugin.

19:12But anyways, it says now that

19:14it tells the story,

19:16it seems like you might be confused

19:17between two different things,

19:18Dire Straits, which is a British rock band,

19:20and Dire Straits, which is a phrase

19:23meaning a serious or desperate situation.

19:26It is true.

19:27One or the other.

19:28It doesn't matter.

19:29We made the question

19:30and the AI model is now trying to figure out

19:33a way to give us a response

19:34and this is a very, very correct response

19:37because one is the band

19:38and the other one is the popular saying.

19:41And now, what is our limitation?

19:43Well, if we go here further above,

19:46we will see that our question now,

19:48the total cost of this question is 262.

19:51So this means that the question is smaller,

19:54but the cost is still 262.

19:56That means we have already reached the border.

19:58We are already way above 10.

20:01So that means if we make a question now,

20:03for example, who is Brian May?

20:06And if we run it,

20:07we should now be still under the limit.

20:12And now it says the AI token rate

20:14limit exceeded four providers' misdraw.

20:16Again, we have exceeded the rate limitation

20:19and that means that we get too many requests

20:21and we should be getting now

20:23the state of our rate limiting.

20:28And that means that the state is,

20:29again, we don't have any more credits.

20:31We need to wait 197 seconds still

20:33and only then we will be able to

20:35make another question to our AI model.

20:39All right, everyone.

20:40So this is my short presentation

20:41about the AI rate limiting configuration

20:45advanced plugin.

20:46This is a plugin that has multiple functionalities,

20:49not just this.

20:49This is a bit of a crash course

20:51into how to use it.

20:53There will be more videos coming in

20:54with a lot more detail about it.

20:56Make sure to stay tuned to this channel

20:58and make sure to subscribe to this channel

21:00so that you don't miss out

21:01on following videos explaining this.

21:03Now, if you enjoyed this video,

21:04make sure to give us a like,

21:06make sure to subscribe to the channel.

21:08As I mentioned before,

21:09make sure to leave your comment.

21:11It's very important for you

21:12to give a comment to this video.

21:14I will answer questions that you put in there

21:16and don't forget to activate

21:17that notification bell

21:18so that you are up to date

21:20with the videos that I post over here.

21:22Thank you so much for watching the video

21:24and until the next one,

21:26be sure to stay tech,

21:28keep programming,

21:29be kind

21:30and have a good one.

21:31Bye.

21:32Bye.

21:34Bye.

21:35Bye.

21:36Bye.

21:37Bye.

21:38Bye.

21:39Bye.

21:40Bye.

21:41Bye.

21:42Bye.

21:43Bye.

21:44Bye.

21:45Bye.

21:46Bye.

21:47Bye.

21:48Bye.

21:49Bye.

21:50Bye.

21:51Bye.

21:52Bye.

21:53Bye.

21:54Bye.

21:55Bye.

21:56Bye.

21:57Bye.

21:58Bye.

21:59Bye.

22:00Bye.

22:01Bye.

22:02Bye.

22:03As a short disclaimer, I'd like to mention that I'm not associated or affiliated with

22:15any of the brands eventually shown, displayed or mentioned in this video.

Category

Transcript

Be the first to comment

Recommended