Notes by Anshuman

Large Language Models (LLMs) like GPT-3.5 by Open AI, have indeed taken the world by storm recently. They have been able to show a remarkable ability to emulate natural language that society is still figuring out how to figure out the ramifications of it. The quality of the text generated by LLMs is often indistinguishable from text written by humans. This opens opportunities for automating tasks that traditionally required a human touch. It can write creative literature, which was once thought of as being impossible to be done by a computer. It can imitate the style of various famous authors, singers and poets. Apart from that, models such as the one powering ChatGPT or its successor, GPT 4 or Github CoPilot have shown a remarkable ability to offer code completion which certainly has put some members of the tech community uncertain about their careers. These models are also capable of doing tasks that would normally be outsourced to perform repetitive tasks.

It is now estimated that ChatGPT has a 100 million active users. With that much volume of users interacting with a model trained on 45TB of data, there is bound to be some fluctuating performance. And with a great amount of users, comes fluctuating response times [1])( https://community.openai.com/t/open-ai-gpt-4-api-is-absurdly-slow/186721) 2. (And even occasionally, outages such as this https://openai.com/blog/march-20-chatgpt-outage). In this blog, I demonstrate how we can potentially combat such an issue which is leverage a feature Open AI offers which many of its competitors have not (at the time of writing). The feature is known as Streaming. As per the documentation available, https://platform.openai.com/docs/api-reference/completions/create

stream Boolean Optional Defaults to false

Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.

Before, we proceed ahead, here’s an explanation about SSE, by my co-author, GPT 4, Server-Sent Events (SSE) is a standard that allows a web server to push real-time updates to the client, or the web browser. Unlike WebSockets, which is a two-way communication channel between client and server, SSE is a one-way channel from the server to the client. In the context of a web application, here's a brief overview of how SSE works:

The client (usually a web browser) opens a connection to the server and requests updates.
The server sends updates to the client over this connection as and when they are available.
The connection remains open, allowing the server to send real-time updates to the client asynchronously.
The client can close the connection at any time, typically when the user navigates away from the page or closes it.

This makes SSE a perfect choice for applications that require real-time data updates from the server, such as live news updates, social media feeds, or real-time analytics.

Advantages of SSE:

• SSE is built upon HTTP, which means it operates well with existing web infrastructure such as load balancers and scaling solutions.

• SSE allows automatic reconnection if the connection is dropped, and it also allows the event to be identified where it was interrupted to continue from there.

• SSE is simpler to implement than WebSockets when only server-to-client communication is required.

Drawbacks of SSE:

• SSE is not as widely supported in web browsers as WebSockets. For example, Internet Explorer does not support SSE.

• SSE is a one-way communication channel (server to client), so if the client needs to send data to the server, it would require a separate AJAX call or a different mechanism.

• SSE messages are UTF-8 encoded text data only. If you need to send binary data, you would have to use another method or encode your binary data as text.

Please note that SSE and WebSockets are different tools with different strengths, and the choice between the two depends on the specific needs and constraints of your project.

So in order to leverage this feature, I decided to build a simple demo using React and Flask (Python3) due to various StackOverflow threads that I happened to go through1 2. *(please consider this as a proof of concept and not a production grade application)

Here’s the source code: https://github.com/anshumankmr/Streaming-ChatGPT-Responses-in-ReactJS

I decided to leverage Flask due to its easy set up and almost native support for SSE without the need of any extra dependencies. I plan on writing one for Node as well in the future.

In order to implement this on the client side, I decided not to use EventSource due to its poor support amongst browsers and the lack of the feature to pass a request body. This could be a hindrance when building a more complicated application, in my opinion, though the reader is free to use EventSource if they feel like it.

Why I believe this Streaming feature is a must have is that it simply reduces the waiting time for a user, so that they can see the response as soon as your server receives it. I also believe placing it on the server side might be suitable for some, considering the fact they may have to use their own API key and storing such a value on the client side can be considered as a security violation.

References: https://stackoverflow.com/questions/12232304/how-to-implement-server-push-in-flask-framework

Demo